Atlas Grid Production System Development (ProdSys II): 2012

Wednesday, December 19, 2012

Deft and Jedi

Design Parameters

In the initial design, Jedi was designated as the component most closely coupled with the PanDA service itself, while Deft was considered a more platform-neutral database engine for bookkeeping of tasks (hence its name). Assuming the decision on such separation is made, it makes sense to leverage this factorization to take advantage of component-based architecture. In the following we assume that Deft will have a minimum level of dependency on PanDA specifics, while Jedi takes advantage of full integration with PanDA. Wherever possible, Jedi will have custody of the content stored in the database tables, and Deft will largely use references to these data.

Communication

The complete logic of the interaction of Deft with Jedi (and vice versa) is being worked out and will be documented on the ProdSys TWiki when it's ready. For now, let's note a few features of the Deft/Jedi tandem design:

in managing the workflow in Deft, it unnecessary (and in fact undesirable) to implement subprocesses and/or multi-threaded execution such as one in PaPy etc. This is because Deft is NOT a processing engine, or even a resource provisioning engine, as is the case in some Python-based workflow engines. It's a state machine that does not necessarily need to work in real time (or even near time).
there are a few ways to establish communication between Deft and Jedi, which are not mutually exclusive. For example, there may be callbacks, and database token updates, which may in fact co-exist. If a message is missed for some reason, the information can be picked up during a periodic sweep of the database.

An interesting question is whether the database should be exposed through the typical access library API, or as a Web Service. The latter provides much better flexibility at the cost of potential reduction in performance. Needs to be evaluated.

The Job database in Jedi will experience a significant read/write load. It's important to insulate Deft with its task databases from excessive read/write activity. This can be accomplished in a number of ways, for example the number of jobs finished in a task can be updated periodically but not as frequently as the actual job completion rate.

Data Services Interaction

Design documentation for Jedi is presented on a CERN TWiki page .

Questions/comments:

in the "Workflow" section, the dataset name generated automatically. Does this part belong in Jedi? Presumably yes, or it can be handled by a separate service
Deft operates by traversing its graph model of the workflow, taking action at each node which depends on the node's incoming edges. It is therefore possible that operations such as insertion into the dataset table are done by calling a method in a plug-in or calling a web service. If the latter is implemented, it becomes a separate component complementary to Deft/Jedi.
Incremental task execution: currently the proposed solution is the same on both Deft and Jedi side. See the "Augmentation" section of the ProdSys Workflow page, and Incremental Task section of the Jedi page. Basically, the state of relevant task nodes is reset, and new edges (datasets) added. The issue of what to do with tasks that already transitioned to the Archived state can be resolved by Meta-Task cloning. Specifically, if any of the tasks went to "archived", the whole Meta-Task must go to "archived" state and further changes are disallowed, but cloning is allowed.
General note: Deft does not contain any reference to individual files. One consequence of this is that "open dataset handling" is done in Jedi. It might be a good idea to avoid this mode of operation, if possible.

Sunday, December 2, 2012

December 2012: ProdSys II Progress Report (Maxim)

12/16/12 to 12/31/12

Documentation work:

Quick links added to the top of all core pages in the ProdSys TWiki
Added "Meta-Task Recovery" to the Requirements and to the Workflow pages, to reflect the previous documented requirements for the system. This dates back to February 2012, and has been mentioned elsewhere.

Development:

Evaluation of Python packages for Workflow Management:

Soma workflow
Weaver workflow
Other items as per TWiki documentation

Evaluation: Graph representation in XML, standard solutions studied in conjunction with parsing tools
Preliminary selection of GraphML as the language enjoying standardization, fairly simple sytax and parsing support
A prototype of the graph builder and workflow engine created, based on:

"GraphML" as the input language
"Networkx" as the serializer/deserializer
"PyUtilib" as the workflow engine

12/01/12 to 12/15/12

Meetings:

a meeting was held at BNL to discuss the current status of the ProdSys documentation and the initial design of the new system. Present: K.De, V.Fine, A.Klimentov, S.Panitkin, M.Potekhin, T.Wenaus
the Graph Model was reviewed and approved
decision made to avoid heavy-weight XML-based solutions

FNAL/CMS

a Common Platform Meeting was held at FNAL (Dec.5-7). M.Potekhin attending in person, and T.Wenaus remotely.
a review of the work previously done under the mandate of Feasibility Study of PanDA/GlideinWMS Integration. Fairly detailed discussion of scalability issues.
a brief overview of the current work on ProdSys II. Burt has suggested looking into DagMan application (note: rejected after documentation review).

FNAL/LBNE

Initial meeting at FNAL with LBNE team, some ProdSys ideas discussed, among other issues pertaining to the general software management in LBNE.

FNAL/BNL

Initial meeting at BNL with the local LBNE team. Corroborated on FNAL discussion, agreed on basic principles of software coordination
A draft document created describing the principles of Software Coordination at LBNE

Documentation work:

Workflow page has been created
Considerable cleanup of TWiki pages for Panda, ProdSys, TaskModel
Some presentation material added
Added "Augmenting Live Meta-Task" and "Partially Defined Meta-Task" use cases
Permanent links added to this Blog (front page)

Development:

Evaluation of Python packages for Workflow Management:

pyutilib.workflow
romanchyla/workflow on Github

Monday, November 19, 2012

November 2012: ProdSys II Progress Report (Maxim)

11/16/12 to 11/30/12

Design and Documentation work:

Added the description of a few more tables to the DB page (current DB)
Added a chapter on RDBMS representation of the graph model, with four methods of graph representation considered
Worked on the ProdSys object model and schemas for the following components:

Meta-Task
Task
Adjacency map, as the apparently most efficient way of representation for the Tasks in RDBMS

What's new - the model:

If datasets are properties of the edges in the graph representing the Meta-Task, this makes for a reasonable implementation of the workflow logic, since the dependencies between tasks adjacent in the graph, in the model currently used by Coordinators, is established on the basis of the data being available for the next step
New set of "states" for the task, aligned with JEDI
Introduced Pseudo-tasks: entry and exit, a common practice in Grid-based workflow management

11/01/12 to 11/15/12

Documentation work

Cleaned up documentation on the main ProdSys page
Added descriptions of a few more "T-tables" to the DB page. Up to 20 tables have been identified as no longer used, orphaned or invalid

More information has been added to the Main ProdSys Twiki page, based on the Production Group documentation and inspection of the code used in preparation of the LIST data.
An additional Task Model Page has been created for better organization of the documentation.
The description of the Production Database has been supplemented with information about additional tables

Operations and Development

Performed maintenance of the development server at BNL, necessary due to migration to new hardware
Continued practicing with the "Spreadsheet Process" workflow management scripts, inspected produced data, documented the experience on the ProdSys page

Wednesday, November 14, 2012

Performance study of the event table for JeDi

The event table is a new table for JeDi which keeps track of progress of job at event level. We are planning to use the table for the even server and event-level job splitting. Here is the first result of the performance test for the event table. The table was created in INTR with the following schema:

Name	Type
PANDAID	NOT NULL NUMBER(11)
FILEID	NOT NULL NUMBER(11)
JOB_PROCESSID	NOT NULL NUMBER(10)
DEF_MIN_EVENTID	NUMBER(10)
DEF_MAX_EVENTID	NUMBER(10)
PROCESSED_UPTO_EVENTID	NUMBER(10)

where PANDAID and FILEID are IDs in job and file tables, JOB_PROCESSID is ID for a subprocess, DEF_MIN_EVENTID and DEF_MAX_EVENTID define the range of events for the subprocess, and PROCESSED_UPTO_EVENTID represents how many events are
done so far. The primary key is the combination of PANDAID, FILEID, and JOB_PROCESSID. The table physical layout is range-partitioned based on PandaID. The table is index-organized but also partitioned, which is handy for avoiding row-by-row deletion and tree fragmentation. Now each partition will fit 1 million PandaIDs.

The idea of the event server is shown in Fig.1.

Figure 1. A schematic view of the event server

In the event server scheme, multiple pilots process the same job and file in parallel, but each of them takes care of only a different range of events. When the panda server receives a request from a pilot, the panda server sends a range of events (e.g., DEF_MIN_EVENTID=0 and DEF_MAX_EVENTID=99) to the pilot together with job specification and one record is inserted to the event table. The pilot sends heartbeat at every N events processed, so that PROCESSED_UPTO_EVENTID of the record is updated in the event table. When another pilot comes, the panda server scans the event table, and sends a new range of events (e.g., DEF_MIN_EVENTID=100 and DEF_MAX_EVENTID=299) to the pilot if there are events remaining for the job and file.

A script was implemented to emulate interactions between the panda server and the database. The script spawned 5000 child processes and 1000 jobs were processed in parallel, i.e., 5 child processes were used for one job. Each child process sends heartbeat every 2 sec. The script processed roughly 0.4 million jobs per day, which corresponds to the half of the number of jobs processed per day in the current system. Note that INTR is hosted by a low performance machine since it is a testbed and not all jobs will use the event server scheme. Although the result might be acceptable, we will continue stress tests to see if further optimization is possible.

Friday, November 2, 2012

Notes on the "Spreadsheet Process"

Information regarding the current methods of processing task requests is being added to the Main ProdSys page. The spreadsheet is used to model the graph representation of the Meta-Task (the object that's missing in the ProdSys I model), and to serve simultaneously as the database and the UI for th workflow management system.

Here, we present a few points as an overview of the "spreadsheet process":

A spreadsheet is created according to a specific template. The format is that of the Apache Open Office (ODS).
The information in the spreadsheet is accessed by parsing the XML contents of the file in which it is saved. The module xml.dom.minidom is used in the processing scripts
In general, parsing is done for each stage of the "chain", i.e. event generation, simulation and reconstruction. There can be merging steps performed in between.
Each script, when run, produces a text file with information specifying task parameter for the specific step.
The scripts can detect certain type of errors, which will be flagged in the output and can be detected, e.g. by using "grep".
The text files generated in this process can be submitted to the Production System using one of two methods: (a) Web interface, where the user can copy and paste the contents of a file (b) by using a CLI script which will access the same Web service

Apart from the process described above, there can be validation procedures applied to the data. One important aspect of the existing quite of scripts is SVN access, which may present portability issues (i.e. when running at a site outside of CERN perimeter). These notes will be updated and augmented as we proceed with analyzing the code and data flow.

Monday, October 29, 2012

ProdSys splinter meeting (October 2012) action items

Combined list requests (Dmitry, Sasha), October 2012 ☑ done
SSO for list request (Dmitry), October 2012☑ done
Automatic task splitting (Dmitry), End October - November 2012
Tasks cloning (Alexei, Valeri, Dmitry), October, 2012

I/F part is ready. Nov 5, 2012. : http://pandamon.cern.ch/tasks/clonetask
Task Request I/F parameters checking is in progress

Running from nightlies (Andrej, Rod), October 2012
Tag definition I/F. New implementation (Sasha, Dmitry), December 2012
TR features for Group Production. Hiding unnecessary fields (Nurcan). Not assigned. More info is needed to implement it
Pile up tasks start up before simulation is done. (Sasha), October 2012 ☑ done
1% issue. Implementation is postponed
Scouts info usage for simulation tasks (Andrej, Rod, Wolfgang), Oct-Nov 2012 ☑ done
FTK, file naming convention for merging step (Sasha, Graeme), October 2012☑ done
CPU consumption information taken from TRF (Sasha, Graeme, 'Wuppertal group'), November 2012
Meta-Language for Task Requests. ☑ done ( GraphML schema chosen - Maxim)
CAPTCHA in TR I/F (Dmitry), October-November 2012☑ done

it will be implemented as SSO and CAPTCHA option won't be needed anymore

Requestor I/F . RIF. Wolfgang, Maxim, Valeri

RIF specs (Wolfgang, Maxim)
Twiki from Maxim : https://twiki.cern.ch/twiki/bin/viewauth/Atlas/ProdSys
mid-Nov : Wolfgang will prepare an initial list of requirements

Tаsk Request CLI

postponed until ProdSys II

Documentation, Savannah, Twiki (Maxim, Dmitry)
AGIS/PanDA integration (Ale, Alden, AlexeyA)

"Alden part " December 2012
End-to-end test Jan 2013
Production version, Feb 2013

Task search options (Alexei), October 2012, ☑ done
Monitoring.

long running jobs/tasks
Task progress based on task's submission info
Failed jobs monitoring (by error type)
'Stuck' tasks
Integration of existing group production monitoring tools with PanDA Classical and Dashboard monitoring (Nurcan, Jarka, Laura, Valeri)
PanDA classical pages response time (Valeri)

Wednesday, October 24, 2012

Notes on Workflow Management

Workflow management is a rich topic in both theoretical and practical sense. There are a large number of concepts and software products which support WM. There is specialization according to the domain, such as managing business workflow vs scientific workflow. There are some shared commonalities across these boundaries, e.g. widely adopted use of XML as a means to describe or capture the state of the system. However, business-oriented WMS do not appear well suited for computational workflow management. We note YAWL and BEPL as commercial oriented systems, while Kepler was designed to drive scientific workflow applications.

Paper on workflow scheduling on the Grid.

There is a useful Paper on Workflow Patterns. Illustrations below represent two of workflow patterns of interest in our application, among others.

Tuesday, October 23, 2012

Introduction to ATLAS PanDA Production System

October 23, 2012

The Role of ProdSys

The natural unit of workload that is handled by PanDA is a single payload job. Defining the exact nature of the payload, source and destination of data and various other parameters that characterize a job is outside of the scope of core PanDA itself.

ATLAS Production System serves an extremely important role of defining jobs for a large part of the workload handled by PanDA. Jobs are defined in large sets that constitute "tasks", and are formulated to fulfill "task requests". Each task has a number of attributes, set in accordance with a particular request. Each task is typically translated into a large number of jobs. The existing Production System consists of a task request interface, a set of scripts that translate tasks into respective jobs, and a few tools for modification of certain parameters of active tasks.

Individual job definitions in the existing system are created based on the task parameters and remain static for the duration of the task execution. Data pertaining to requests, tasks and jobs reside in the database, and operation of the Production System can be described as transforming one object into another, starting with requirements, formulating tasks and then creating a list of jobs for each task, for execution in PanDA.

Motivations for system evolution

Motivations for evolving the ATLAS production system come from realization that we need to address the following:

The concept of Meta-Task. Absent in the original product (ProdSys I), it emerged based on operational experience with PanDA and its workflow. It became the central object in the workflow management and must be properly introduced into the system.
Operator intervention and Meta-Task recovery: there must be adequate opportunities for the operators and managers to direct the Meta-Task processing, be able to start certain steps before others are defined, augment a task, and recover from failures in an optimal way.
Flexibility of job definition (e.g. making it dynamic as opposed to static once the task is created): there are a number of advantages that we hope can be realized once there is a capability to define jobs dynamically, based on the resources and other conditions present once the task moves into the execution stage
Maintainability: the code of the existing Production System was written "organically", to actively support emerging requests from users, and starts showing its age
Scalability: there are issues with the way the interaction between the ProdSys software and the database back-end, which lead to lockup condition of the database when a transaction is handled, and also the issue of general insufficient throughput when inserting tasks and other data into the system
Ease of use: there is currently a great amount of detail that the end user (Physics Coordination) must define in order to achieve a valid task request. It's desirable to automate the task creation process, whereby cumbersome logic is handled within the application, and the user interface is more concise and transparent.

ATLAS Production System Twiki