October 23, 2012
The natural unit of workload that is handled by
PanDA
is a single payload job. Defining the exact nature of the payload,
source and destination of data and various other parameters that
characterize a job is outside of the scope of core
PanDA itself.
ATLAS Production System serves an extremely important role of defining jobs for a large part of the workload handled by
PanDA.
Jobs are defined in large sets that constitute "tasks", and are
formulated to fulfill "task requests". Each task has a number of
attributes, set in accordance with a particular request. Each task is
typically translated into a large number of jobs. The existing
Production System consists of a task request interface, a set of scripts
that translate tasks into respective jobs, and a few tools for
modification of certain parameters of active tasks.
Individual job definitions in the existing system are created based on
the task parameters and remain static for the duration of the task
execution. Data pertaining to requests, tasks and jobs reside in the
database, and operation of the Production System can be described as
transforming one object into another, starting with requirements,
formulating tasks and then creating a list of jobs for each task, for
execution in
PanDA.
Motivations for system evolution
Motivations for evolving the ATLAS production system come from realization that we need to address the following:
- The concept of Meta-Task. Absent in the original product (ProdSys I), it emerged based on operational experience with PanDA and its workflow. It became the central object in the workflow management and must be properly introduced into the system.
- Operator intervention and Meta-Task recovery: there must be adequate opportunities for the operators and managers to direct the Meta-Task processing, be able to start certain steps before others are defined, augment a task, and recover from failures in an optimal way.
- Flexibility of job definition (e.g. making it dynamic as opposed to
static once the task is created): there are a number of advantages that
we hope can be realized once there is a capability to define jobs
dynamically, based on the resources and other conditions present once
the task moves into the execution stage
- Maintainability: the code of the existing Production System
was written "organically", to actively support emerging requests from
users, and starts showing its age
- Scalability: there are issues with the way the interaction
between the ProdSys software and the database back-end,
which lead to lockup condition of the database when a transaction is
handled, and also the issue of general insufficient throughput when inserting tasks and
other data into the system
- Ease of use: there is currently a great amount of detail that
the end user (Physics Coordination) must define in order to achieve a
valid task request. It's desirable to automate the task creation
process, whereby cumbersome logic is handled within the application, and
the user interface is more concise and transparent.