Introduction

lsst::pipe::base provides a data processing pipeline infrastructure. Data processing is performed by "tasks", which are instances of Task or CmdLineTask. Tasks perform a wide range of data processing operations, from basic operations such as assembling raw images into CCD images (trimming overscan), fitting a WCS or detecting sources on an image, to complex combinations of such operations.

Tasks are hierarchical. Each task may may call other tasks to perform some of its data processing; we say that a "parent task" calls a "subtask". The highest-level task is called the "top-level task". To call a subtask, the parent task constructs the subtask and then calls methods on it. Thus data transfer between tasks is simply a matter of passing the data as arguments to function calls.

Command-line tasks are tasks that can be run from the command line. You might think of them as the LSST equivalent of a data processing pipeline. Despite their extra capabilities, command-line tasks can also be used as ordinary tasks and called as subtasks by other tasks. Command-line tasks are subclasses of CmdLineTask.

Each task is configured using the pex_config package, using a task-specific subclass of pex.config.config.Config. The task's configuration includes all subtasks that the task may call. As a result, it is easy to replace (or "retarget") one subtask with another. A common use for this is to provide a camera-specific variant of a particular task, e.g. use one version for SDSS imager data and another version for Subaru Hyper Suprime-Cam data).

Tasks may process multiple items of data in parallel, using Python's multiprocessing library. Support for this is built into the ArgumentParser and TaskRunner.

Most tasks have a run method that performs the primary data processing. Each task's run method should return a Struct. This allows named access to returned data, which provides safer evolution than relying on the order of returned values. All task methods that return more than one or two items of data should return the data in a Struct.

Many tasks are found in the pipe_tasks package, especially tasks that use many different packages and don't seem to belong in any one of them. Tasks that are associated with a particular package should be in that package; for example the instrument signature removal task ip.isr.isrTask.IsrTask is in the ip_isr package.

pipe_base is written purely in Python. The most important contents are:

CmdLineTask: base class for pipeline tasks that can be run from the command line.
Task: base class for subtasks that are not meant to be run from the command line.
Struct: object returned by the run method of a task.
ArgumentParser: command line parser for pipeline tasks.
timeMethod: decorator to log performance information for a Task method.
TaskRunner: a class that runs command-line tasks, using multiprocessing when requested. This will work "as is" for most command-line tasks, but will need to be be subclassed if, for instance, the task's run method needs something other than a single data reference.

Running Command-Line Tasks

Each command-line task typically has a short "task runner script" to run the task in the bin/ directory of whatever package the task is defined in. This section deals with the command-line options of these task runner scripts.

Specify --help to print help. When in doubt give this a try.

The first argument to a task must be the path to the input repository (or --help). For example:

myTask.py path/to/input --id... is valid: input path is the first argument
myTask.py --id ... path/to/input is invalid: an option comes before the input path

--output specifies the path to the output repository. Some tasks also support --calib: the path to input calibration data. To shorten input, output and calib paths see Environment Variables.

Data is usually specified by the --id argument with key=value pairs as the value, where the keys depend on the camera and type of data. If you run the task and specify both an input data repository and --help then the printed help will show you valid keys (the input repository tells the task what kind of camera data is being processed). See Specifying Data IDs for more information about data IDs. A few tasks take more than one kind of data ID, or have renamed the --id argument; run the task with --help or see the task's documentation for details.

You may show the config, subtasks and/or data using --show. By default --show quits after printing the information, but --show run allows the task to run. For example:

--show config data tasks shows the config, data and subtasks, and then quits.
--show tasks run show the subtasks and then runs the task.

For long or repetitive command lines you may wish to specify some arguments in separate text files. See Argument Files for details.

Specifying Data IDs

--id and other data identifier arguments are used to specify IDs for input and output data. The ID keys depend on the camera and on the kind of data being processed. For example, lsstSim calibrated exposures are identified by the following keys: visit, filter, raft and sensor (and a given visit has exactly one filter).

Omit a key to specify all values of that key. For example, for lsstSim calibrated exposures:

--id visit=54123 specifies all rafts and sensors for visit 54123 (and all filters, but there is just one filter per visit).
--id visit=54123 raft=1,0 specifies all sensors for visit raft 1,0 of visit 54123

To specify multiple data IDs you may separate values with ^ (a character that does not have special meaning to the unix command parser). The result is the outer product (all possible combinations). For example:

--id visit=54123^55523 raft=1,1^2,1 specifies four IDs: visits 54123 and 55523 of rafts 1,1 and 2,1

You may specify a data identifier argument as many times as you like. Each one is treated independently. Thus the following example specifies all sensors for four combinations of visit and raft, plus all sensors for one raft of two other visits for calibrated lsstSim data:

--id visit=54123^55523 raft=1,1^2,1 --id visit=623459^293423 raft=0,0

The --rerun Option

The --rerun option is an alternate way to specify the output and, optionally, input repositories for a command line task. Unlike --output, the value supplied to --rerun is relative to the rerun directory of the root input data repository (i.e. follow the chain of parent repositories—indicated by files in the repository root named _parent—all the way back, then add rerun to that). --rerun saves the user from typing in the input repository twice. For example,

processCcd.py $root/hsc --output $root/hsc/rerun/Vishal/aRerun

which (for some semantically insignificant version of $root) will run processCcd.py on the data in $root/hsc and put the output in $root/hsc/rerun/Vishal/aRerun, can be shortened by using --rerun to

processCcd.py $root/hsc --rerun Vishal/aRerun

This is useful because outputs from the pipeline are conventionally placed in a subdirectory of a "rerun" directory present in a per-camera root repository. For example:

$root/hsc/rerun/public # camera hsc; a public rerun useful for many people.
$root/hsc/rerun/Jim/aTicket/fixed-psf # camera hsc; one of several private reruns used to debug a single issue.
$root/lsstSim/Simon/aTicket/rerun/Jim/anotherTicket # camera lsstSim; simulated by user Simon; processed by user Jim.

Additionally, --rerun path1:path2 will read from the rerun specified by path1 and write to the rerun specified by path2. This is referred to as "chaining" reruns. For example:

processCcd.py $root/hsc --rerun Paul/procCcds ...                        # Paul processes some CCDs.
makeCoaddTempExp.py $root/hsc --rerun Paul/procCcds:Jim/coaddRerun ...   # Jim uses Paul's CCD processing
                                                                         # to start building a coadd.

Note that:

If the argument to --rerun starts with a /, it will be interpreted as an absolute path rather than as being relative to the root input data repository.
The argument(s) supplied to --rerun may refer to symbolic links to directories; data will be read or written from their targets.

Argument Files

You may specify long or repetitive command-line arguments in text files and reference those files using @path.

The contents of the files are identical to the command line, except that long lines must not have a \ continuation character. For example if the file foo.txt contains:

--id visit=54123^55523 raft=1,1^2,1

--config someParam=someValue --configfile configOverrideFilePath

you can then reference it with @foo.txt and mix that with other command-line arguments (including --id and --config):

myTask.py inputPath @foo.txt --config anotherParam=anotherValue --output outputPath

Overriding Configuration Parameters

The argument parser automatically loads specific configuration override files based on the camera name and its obs_ package. See Automatically Loaded Config Override Files. The format of a configuration override file matches the configuration shown using --show config (in particular, note that config in a configuration override file is the word that matches self.config in a task when the task uses its config).

In addition, you can specify configuration override files on the command line using --configfile and override some (but not all) configuration parameters directly on the command line using --config, as shown in these examples:

--config str1=foo str2="fancier string" int1=5 intList=2,4,-87 float1=1.53 floatList=3.14,-5.6e7
--configfile config.py, where file config.py contains:
config.strList = "first string", "second string"

Note: config in a configuration override file is equivalent to self.config in a task.

You can use a config file (e.g. as specified by --configfile) to override any config parameter, but the simpler --config command-line option has significant limitations:

You cannot retarget a subtask specified by a lsst.pex.config.ConfigurableField (which is the most common case)
For items in registries, you can only specify values for the active (current) item
You cannot specify values for lists of strings
You cannot specify a subset of list; you must specify all values at once

See Retargeting Subtasks for more information about overriding configuration parmaters in subtasks.

Retargeting Subtasks

As a special case of overriding configuration parameters, users may replace one subtask with another; this is called "retargeting" the subtask. One common use case is to use a camera-specific variant of a subtask. Examples include:

lsst.obs.subaru.isr.SuprimeCamIsrTask: a version of instrument signature removal (ISR or detrending) for Suprime-Cam and Hyper Suprime-Cam
lsst.obs.sdss.selectSdssImages.SelectSdssImagesTask: an version of the task that selects images for co-addition of SDSS stripe 82 images

How you retarget a subtask and override its config parameters depends on whether the task is specified as an lsst.pex.config.ConfigurableField (the most common case, and the source of the term "retarget") or as an lsst.pex.config.RegistryField.

To override a subtask specified as an lsst.pex.config.ConfigurableField use retarget method of the field. To override config parameters use simple dotted notation. Here is an example:

    # Example of retargeting a subtask and overriding its configuration
    # for a subtask specified by an lsst.pex.config.ConfigurableField

    # import the task and then retarget it
    from ... import FooTask
    config.configurableSubtask.retarget(FooTask)

    # override a config parameter
    config.configurableSubtask.subtaskParam1 = newValue

Warning: when you retarget a task specified by an lsst.pex.config.ConfigurableField you lose all configuration overrides for both the old and new task. This limitation is not shared by lsst.pex.config.RegistryField.

To retarget a subtask specified as an lsst.pex.config.RegistryField set the name property of the field. There are two ways to override configuration parameters for tasks in a registry: you may set parameters for the active task using the registry field's active parameter, and you may set parameters for any registered task using dictionary notation and the name by which the task is registered. Here is an example that assumes a task FooTask is defined in module .../foo.py and registered using name "foo". :

    # Example of retargeting a subtask and overriding its configuration
    # for a subtask specified by an lsst.pex.config.RegistryField

    # Import the task's module, so the task registers itself,
    # then set the name property of the field to the name by which FooTask is registered:
    import .../foo.py
    config.registrySubtask.name = "foo"

    # You can override the active subtask's configuration using attribute `active`
    config.registrySubtask.active.subtaskParam1 = newValue

    # Such overrides can also be specified on the command line, e.g.
    # --config registrySubtask.active.subtaskParam1 = newValue

    # You can also override parameters in any subtask in the registry using dictionary access,
    # but this can only be done in a config override file, not on the command line:
    config.registrySubtask["foo"].subtaskParam1 = newValue

Specifying Debug Variables

Some tasks support debug variables that can be set, while running from the command line, to display additional information. Each task documents which debug variables it supports. See Using lsstDebug to control debugging output for information about how to enable specific debug variables while running from the command line.

Automatically Loaded Config Override Files

When a pipeline task is run, two camera-specific configuration override files are loaded, if found; first one for the obs_ package then one for the camera. (There are two because some obs_ packages contain data for multiple cameras). These files may override configuration parameters or even retarget subtasks with camera-specific variants (e.g. for instrument signature removal). The configuration override files are, in order:

obs_path/config/task_name.py
obs_path/config/camera_name/task_name.py

where the path elements are:

task_name: the name of the pipeline task (the value of its _DefaultName class variable), e.g. "processCcd" for lsst.pipe.tasks.processCcd.ProcessCcdTask.
camera_name: the name of the camera, e.g. "lsstSim"
obs_path: the path to the obs_ package for the camera, e.g. the path to obs_lsstSim

Here are two examples:

obs_lsstSim/config/makeCoaddTempExp.py: specifies which version of the image selector task to use for co-adding LSST simulated images
obs_subaru/config/hsc/isr.py: provides overrides for the instrument signature removal (aka detrending) task for Hyper Suprime-Cam

Environment Variables

The command parser uses environment variables PIPE_INPUT_ROOT, PIPE_CALIB_ROOT, and PIPE_OUTPUT_ROOT, if available, to make it easier to specify the input, calib and output data repositories. Each environment variable is used as a root directory for relative paths and ignored for absolute paths. The default value for each of these environment variables is the current working directory. For example:

mytask foo # use $PIPE_INPUT_ROOT/foo as the input repository (or ./foo if $PIPE_INPUT_ROOT is undefined)
mytask . # use $PIPE_INPUT_ROOT (= $PIPE_INPUT_ROOT/.) as the input repository
mytask /a/b # use /a/b as the input repository ($PIPE_INPUT_ROOT is ignored for absolute paths)

Other Resources

pipe_tasks introduction includes links to:

A page listing documentation for many tasks
A manual on how to write a task
A manual on how to write a command-line task

Contents