Atlas Grid Production System Development (ProdSys II): November 2012

Monday, November 19, 2012

November 2012: ProdSys II Progress Report (Maxim)

11/16/12 to 11/30/12

Design and Documentation work:

Added the description of a few more tables to the DB page (current DB)
Added a chapter on RDBMS representation of the graph model, with four methods of graph representation considered
Worked on the ProdSys object model and schemas for the following components:

Meta-Task
Task
Adjacency map, as the apparently most efficient way of representation for the Tasks in RDBMS

What's new - the model:

If datasets are properties of the edges in the graph representing the Meta-Task, this makes for a reasonable implementation of the workflow logic, since the dependencies between tasks adjacent in the graph, in the model currently used by Coordinators, is established on the basis of the data being available for the next step
New set of "states" for the task, aligned with JEDI
Introduced Pseudo-tasks: entry and exit, a common practice in Grid-based workflow management

11/01/12 to 11/15/12

Documentation work

Cleaned up documentation on the main ProdSys page
Added descriptions of a few more "T-tables" to the DB page. Up to 20 tables have been identified as no longer used, orphaned or invalid

More information has been added to the Main ProdSys Twiki page, based on the Production Group documentation and inspection of the code used in preparation of the LIST data.
An additional Task Model Page has been created for better organization of the documentation.
The description of the Production Database has been supplemented with information about additional tables

Operations and Development

Performed maintenance of the development server at BNL, necessary due to migration to new hardware
Continued practicing with the "Spreadsheet Process" workflow management scripts, inspected produced data, documented the experience on the ProdSys page

Wednesday, November 14, 2012

Performance study of the event table for JeDi

The event table is a new table for JeDi which keeps track of progress of job at event level. We are planning to use the table for the even server and event-level job splitting. Here is the first result of the performance test for the event table. The table was created in INTR with the following schema:

Name	Type
PANDAID	NOT NULL NUMBER(11)
FILEID	NOT NULL NUMBER(11)
JOB_PROCESSID	NOT NULL NUMBER(10)
DEF_MIN_EVENTID	NUMBER(10)
DEF_MAX_EVENTID	NUMBER(10)
PROCESSED_UPTO_EVENTID	NUMBER(10)

where PANDAID and FILEID are IDs in job and file tables, JOB_PROCESSID is ID for a subprocess, DEF_MIN_EVENTID and DEF_MAX_EVENTID define the range of events for the subprocess, and PROCESSED_UPTO_EVENTID represents how many events are
done so far. The primary key is the combination of PANDAID, FILEID, and JOB_PROCESSID. The table physical layout is range-partitioned based on PandaID. The table is index-organized but also partitioned, which is handy for avoiding row-by-row deletion and tree fragmentation. Now each partition will fit 1 million PandaIDs.

The idea of the event server is shown in Fig.1.

Figure 1. A schematic view of the event server

In the event server scheme, multiple pilots process the same job and file in parallel, but each of them takes care of only a different range of events. When the panda server receives a request from a pilot, the panda server sends a range of events (e.g., DEF_MIN_EVENTID=0 and DEF_MAX_EVENTID=99) to the pilot together with job specification and one record is inserted to the event table. The pilot sends heartbeat at every N events processed, so that PROCESSED_UPTO_EVENTID of the record is updated in the event table. When another pilot comes, the panda server scans the event table, and sends a new range of events (e.g., DEF_MIN_EVENTID=100 and DEF_MAX_EVENTID=299) to the pilot if there are events remaining for the job and file.

A script was implemented to emulate interactions between the panda server and the database. The script spawned 5000 child processes and 1000 jobs were processed in parallel, i.e., 5 child processes were used for one job. Each child process sends heartbeat every 2 sec. The script processed roughly 0.4 million jobs per day, which corresponds to the half of the number of jobs processed per day in the current system. Note that INTR is hosted by a low performance machine since it is a testbed and not all jobs will use the event server scheme. Although the result might be acceptable, we will continue stress tests to see if further optimization is possible.

Friday, November 2, 2012

Notes on the "Spreadsheet Process"

Information regarding the current methods of processing task requests is being added to the Main ProdSys page. The spreadsheet is used to model the graph representation of the Meta-Task (the object that's missing in the ProdSys I model), and to serve simultaneously as the database and the UI for th workflow management system.

Here, we present a few points as an overview of the "spreadsheet process":

A spreadsheet is created according to a specific template. The format is that of the Apache Open Office (ODS).
The information in the spreadsheet is accessed by parsing the XML contents of the file in which it is saved. The module xml.dom.minidom is used in the processing scripts
In general, parsing is done for each stage of the "chain", i.e. event generation, simulation and reconstruction. There can be merging steps performed in between.
Each script, when run, produces a text file with information specifying task parameter for the specific step.
The scripts can detect certain type of errors, which will be flagged in the output and can be detected, e.g. by using "grep".
The text files generated in this process can be submitted to the Production System using one of two methods: (a) Web interface, where the user can copy and paste the contents of a file (b) by using a CLI script which will access the same Web service

Apart from the process described above, there can be validation procedures applied to the data. One important aspect of the existing quite of scripts is SVN access, which may present portability issues (i.e. when running at a site outside of CERN perimeter). These notes will be updated and augmented as we proceed with analyzing the code and data flow.

ATLAS Production System Twiki