Atlas Grid Production System Development (ProdSys II): February 2013

Going through the use cases for DEFT/JEDI, I have noticed that there is a feature implicit in how the current workflow works, and not well understood by many people (including myself). It's the mechanisms that "bundles together" the datasets that are related in the processing logic, i.e. a PanDA task may produce datasets B,C based on an input dataset A, and datasets D,E based on another dataset F. Since there is no concept of Meta-Task in ProdSys I/PanDA, the inter-task chains of dependencies among the datasets are maintained semi-manually. The situation is represented by the simplified diagram below:

In this diagram, different colors represent different logical connection between datasets, for example DS4 is produced by task T3 to result in creation of the dataset DS7 (also colored red). Same applies to "blue" datasets in the same diagram. For completeness' sake, we also have the dataset DS5 there, which is thought to be the result of processing both "blue" and "red" input datasets.

At first glance, this presents a complication for the graph model employed in ProdSys II, since it introduces dependencies involving datasets, already modeled as edges in the Meta-Task graph. One approach would be to create a conjugate graph where the edges (datasets) become nodes and their connections (tasks) would be edges (the conjugate graph is also know as line graph, in graph theory). Then there is an option to store such graph in a separate GraphML file, or store it in the same file which will result in two disconnected graphs created upon parsing.

Either solution creates its own set of problems, and both share the necessity for crafty logic which is required to keep referential integrity when manipulating Meta-Tasks graphs, in operations like template processing etc.

Alternatively, one may try to implement an equivalent of a "port" in the task node. Unfortunately, this feature is not supported in the parser which comes with NetworkX. While it's always possible to roll our own, this likely negates one of the advantages of using this package.

A possible solution to this dilemma is to go back to the basic definition of the term "task", which originally was thought of as a unit of processing with one dataset coming in, and one dataset being output. If we lift the restriction on the number of datasets in that definition but introduce "conservation of color", i.e. define the "task" as a processing unit which only operates on related "sets of datasets", we end up with essentially a partitioned but connected graph.

This actually amounts to a relatively straightforward reformulation of the graph presented above, with creation of additional nodes. For example, the picture above is transformed into the following:

The new feature in this graph is that different tasks can have dependencies on same group of datasets, such as DS2 and DS3. What allows this to work in practice is that while the edges representing these different part of the graph are distinct in the graph description, they refer to the same datasets whose state is managed by JEDI.

Note that the "prime" tasks in this diagram share all or most of their attributes with the original ones, and only differ by the input and output, however this is taken care of in the graph itself.

Documentation work:

This blog: created tags "prodsys1" and "prodsys2" for better search capability.
Created a common navigation header (bar) that can be included in all ProdSys TWiki pages.
References to DEFT and further details added to documentation on the ProdSys pages.
Corrections in DEFT/JEDI interface description as per Tadashi's comments.
Prepared Abstract for the ProdSys paper (CHEP). Abstract approved by ATLAS and submitted.
Presentation for the CMS/ATLAS Common Analysis Platform on 2/28/2013:

Based on the announcement on 2/14 of a CMS development largely parallel to what we do in ATLAS
Potential redundancy, under-utilization of PanDA capability, suboptimal database load
Clear potential for common development and platform

Presentation for the ATLAS Software and Computing Workshop, March 11-15 2013
Meeting with Wolfgang to discuss progress and requirements

Development:

DEFT prototype: functionality complete
SVN project created, code checked in

Continuous updates and checkpoints
Naming of the SVN tree as per Tadashi's comments

Tested database schemas for the Task, Dataset and Meta-Task objects.
Extensive refactoring and rewrite of the main code unit due to lots of new functionality and increased complexity, the application has become a simple CLI driver for underlying classes.
Dedicated test of the code state-switching functionality
Improvements in logging functionality, Logger class created based on standard Python package
Started work on the Dependency Model for datasets

Atlas Grid Production System Development (ProdSys II)

ATLAS Production System Twiki

Tuesday, February 26, 2013

Curious case of related datasets

Monday, February 11, 2013

February-March 2013: ProdSys II progress report (Maxim)