Public Member Functions
def	__init__ (self, pipeline, *registry)

def	__repr__ (self)

def	connectDataIds (self, registry, collections, userQuery, externalDataId)

def	resolveDatasetRefs (self, registry, collections, run, commonDataIds, *skipExisting=True)

def	makeQuantumGraph (self)

Public Attributes
	tasks

	dimensions

Detailed Description

A helper data structure that organizes the information involved in
constructing a `QuantumGraph` for a `Pipeline`.

Parameters
----------
pipeline : `Pipeline`
    Sequence of tasks from which a graph is to be constructed.  Must
    have nested task classes already imported.
universe : `DimensionUniverse`
    Universe of all possible dimensions.

Notes
-----
The scaffolding data structure contains nested data structures for both
tasks (`_TaskScaffolding`) and datasets (`_DatasetDict`).  The dataset
data structures are shared between the pipeline-level structure (which
aggregates all datasets and categorizes them from the perspective of the
complete pipeline) and the individual tasks that use them as inputs and
outputs.

`QuantumGraph` construction proceeds in four steps, with each corresponding
to a different `_PipelineScaffolding` method:

1. When `_PipelineScaffolding` is constructed, we extract and categorize
   the DatasetTypes used by the pipeline (delegating to
   `PipelineDatasetTypes.fromPipeline`), then use these to construct the
   nested `_TaskScaffolding` and `_DatasetDict` objects.

2. In `connectDataIds`, we construct and run the "Big Join Query", which
   returns related tuples of all dimensions used to identify any regular
   input, output, and intermediate datasets (not prerequisites).  We then
   iterate over these tuples of related dimensions, identifying the subsets
   that correspond to distinct data IDs for each task and dataset type,
   and then create `_QuantumScaffolding` objects.

3. In `resolveDatasetRefs`, we run follow-up queries against all of the
   dataset data IDs previously identified, transforming unresolved
   DatasetRefs into resolved DatasetRefs where appropriate.  We then look
   up prerequisite datasets for all quanta.

4. In `makeQuantumGraph`, we construct a `QuantumGraph` from the lists of
   per-task `_QuantumScaffolding` objects.

Definition at line 357 of file graphBuilder.py.

Constructor & Destructor Documentation

◆ init()

def lsst.pipe.base.graphBuilder._PipelineScaffolding.__init__	(		self,
			pipeline,
		*	registry
	)

Definition at line 401 of file graphBuilder.py.

     def __init__(self, pipeline, *, registry):
         _LOG.debug("Initializing data structures for QuantumGraph generation.")
         self.tasks = []
         # Aggregate and categorize the DatasetTypes in the Pipeline.
         datasetTypes = PipelineDatasetTypes.fromPipeline(pipeline, registry=registry)
         # Construct dictionaries that map those DatasetTypes to structures
         # that will (later) hold addiitonal information about them.
         for attr in ("initInputs", "initIntermediates", "initOutputs",
                      "inputs", "intermediates", "outputs", "prerequisites"):
             setattr(self, attr, _DatasetDict.fromDatasetTypes(getattr(datasetTypes, attr),
                                                               universe=registry.dimensions))
         # Aggregate all dimensions for all non-init, non-prerequisite
         # DatasetTypes.  These are the ones we'll include in the big join
         # query.
         self.dimensions = self.inputs.dimensions.union(self.intermediates.dimensions,
                                                        self.outputs.dimensions)
         # Construct scaffolding nodes for each Task, and add backreferences
         # to the Task from each DatasetScaffolding node.
         # Note that there's only one scaffolding node for each DatasetType,
         # shared by _PipelineScaffolding and all _TaskScaffoldings that
         # reference it.
         if isinstance(pipeline, Pipeline):
             pipeline = pipeline.toExpandedPipeline()
         self.tasks = [_TaskScaffolding(taskDef=taskDef, parent=self, datasetTypes=taskDatasetTypes)
                       for taskDef, taskDatasetTypes in zip(pipeline,
                       datasetTypes.byTask.values())]
  

Member Function Documentation

◆ repr()

def lsst.pipe.base.graphBuilder._PipelineScaffolding.__repr__ ( self )

Definition at line 428 of file graphBuilder.py.

     def __repr__(self):
         # Default dataclass-injected __repr__ gets caught in an infinite loop
         # because of back-references.
         return f"_PipelineScaffolding(tasks={self.tasks}, ...)"
  

◆ connectDataIds()

def lsst.pipe.base.graphBuilder._PipelineScaffolding.connectDataIds	(	self,
		registry,
		collections,
		userQuery,
		externalDataId
	)

Query for the data IDs that connect nodes in the `QuantumGraph`.

This method populates `_TaskScaffolding.dataIds` and
`_DatasetScaffolding.dataIds` (except for those in `prerequisites`).

Parameters
----------
registry : `lsst.daf.butler.Registry`
    Registry for the data repository; used for all data ID queries.
collections
    Expressions representing the collections to search for input
    datasets.  May be any of the types accepted by
    `lsst.daf.butler.CollectionSearch.fromExpression`.
userQuery : `str` or `None`
    User-provided expression to limit the data IDs processed.
externalDataId : `DataCoordinate`
    Externally-provided data ID that should be used to restrict the
    results, just as if these constraints had been included via ``AND``
    in ``userQuery``.  This includes (at least) any instrument named
    in the pipeline definition.

Returns
-------
commonDataIds : \
        `lsst.daf.butler.registry.queries.DataCoordinateQueryResults`
    An interface to a database temporary table containing all data IDs
    that will appear in this `QuantumGraph`.  Returned inside a
    context manager, which will drop the temporary table at the end of
    the `with` block in which this method is called.

Definition at line 482 of file graphBuilder.py.

     def connectDataIds(self, registry, collections, userQuery, externalDataId):
         """Query for the data IDs that connect nodes in the `QuantumGraph`.
  
         This method populates `_TaskScaffolding.dataIds` and
         `_DatasetScaffolding.dataIds` (except for those in `prerequisites`).
  
         Parameters
         ----------
         registry : `lsst.daf.butler.Registry`
             Registry for the data repository; used for all data ID queries.
         collections
             Expressions representing the collections to search for input
             datasets.  May be any of the types accepted by
             `lsst.daf.butler.CollectionSearch.fromExpression`.
         userQuery : `str` or `None`
             User-provided expression to limit the data IDs processed.
         externalDataId : `DataCoordinate`
             Externally-provided data ID that should be used to restrict the
             results, just as if these constraints had been included via ``AND``
             in ``userQuery``.  This includes (at least) any instrument named
             in the pipeline definition.
  
         Returns
         -------
         commonDataIds : \
                 `lsst.daf.butler.registry.queries.DataCoordinateQueryResults`
             An interface to a database temporary table containing all data IDs
             that will appear in this `QuantumGraph`.  Returned inside a
             context manager, which will drop the temporary table at the end of
             the `with` block in which this method is called.
         """
         _LOG.debug("Building query for data IDs.")
         # Initialization datasets always have empty data IDs.
         emptyDataId = DataCoordinate.makeEmpty(registry.dimensions)
         for datasetType, refs in itertools.chain(self.initInputs.items(),
                                                  self.initIntermediates.items(),
                                                  self.initOutputs.items()):
             refs[emptyDataId] = DatasetRef(datasetType, emptyDataId)
         # Run one big query for the data IDs for task dimensions and regular
         # inputs and outputs.  We limit the query to only dimensions that are
         # associated with the input dataset types, but don't (yet) try to
         # obtain the dataset_ids for those inputs.
         _LOG.debug("Submitting data ID query and materializing results.")
         with registry.queryDataIds(self.dimensions,
                                    datasets=list(self.inputs),
                                    collections=collections,
                                    where=userQuery,
                                    dataId=externalDataId,
                                    ).materialize() as commonDataIds:
             _LOG.debug("Expanding data IDs.")
             commonDataIds = commonDataIds.expanded()
             _LOG.debug("Iterating over query results to associate quanta with datasets.")
             # Iterate over query results, populating data IDs for datasets and
             # quanta and then connecting them to each other.
             n = 0
             for n, commonDataId in enumerate(commonDataIds):
                 # Create DatasetRefs for all DatasetTypes from this result row,
                 # noting that we might have created some already.
                 # We remember both those that already existed and those that we
                 # create now.
                 refsForRow = {}
                 for datasetType, refs in itertools.chain(self.inputs.items(), self.intermediates.items(),
                                                          self.outputs.items()):
                     datasetDataId = commonDataId.subset(datasetType.dimensions)
                     ref = refs.get(datasetDataId)
                     if ref is None:
                         ref = DatasetRef(datasetType, datasetDataId)
                         refs[datasetDataId] = ref
                     refsForRow[datasetType.name] = ref
                 # Create _QuantumScaffolding objects for all tasks from this
                 # result row, noting that we might have created some already.
                 for task in self.tasks:
                     quantumDataId = commonDataId.subset(task.dimensions)
                     quantum = task.quanta.get(quantumDataId)
                     if quantum is None:
                         quantum = _QuantumScaffolding(task=task, dataId=quantumDataId)
                         task.quanta[quantumDataId] = quantum
                     # Whether this is a new quantum or an existing one, we can
                     # now associate the DatasetRefs for this row with it.  The
                     # fact that a Quantum data ID and a dataset data ID both
                     # came from the same result row is what tells us they
                     # should be associated.
                     # Many of these associates will be duplicates (because
                     # another query row that differed from this one only in
                     # irrelevant dimensions already added them), and we use
                     # sets to skip.
                     for datasetType in task.inputs:
                         ref = refsForRow[datasetType.name]
                         quantum.inputs[datasetType.name][ref.dataId] = ref
                     for datasetType in task.outputs:
                         ref = refsForRow[datasetType.name]
                         quantum.outputs[datasetType.name][ref.dataId] = ref
             _LOG.debug("Finished processing %d rows from data ID query.", n)
             yield commonDataIds
  

◆ makeQuantumGraph()

def lsst.pipe.base.graphBuilder._PipelineScaffolding.makeQuantumGraph ( self )

Create a `QuantumGraph` from the quanta already present in
the scaffolding data structure.

Returns
-------
graph : `QuantumGraph`
    The full `QuantumGraph`.

Definition at line 755 of file graphBuilder.py.

     def makeQuantumGraph(self):
         """Create a `QuantumGraph` from the quanta already present in
         the scaffolding data structure.
  
         Returns
         -------
         graph : `QuantumGraph`
             The full `QuantumGraph`.
         """
         graph = QuantumGraph({task.taskDef: task.makeQuantumSet() for task in self.tasks})
         return graph
  
  
 # ------------------------
 #  Exported definitions --
 # ------------------------
  
  

◆ resolveDatasetRefs()

def lsst.pipe.base.graphBuilder._PipelineScaffolding.resolveDatasetRefs	(		self,
			registry,
			collections,
			run,
			commonDataIds,
		*	skipExisting = `True`
	)

Perform follow up queries for each dataset data ID produced in
`fillDataIds`.

This method populates `_DatasetScaffolding.refs` (except for those in
`prerequisites`).

Parameters
----------
registry : `lsst.daf.butler.Registry`
    Registry for the data repository; used for all data ID queries.
collections
    Expressions representing the collections to search for input
    datasets.  May be any of the types accepted by
    `lsst.daf.butler.CollectionSearch.fromExpression`.
run : `str`, optional
    Name of the `~lsst.daf.butler.CollectionType.RUN` collection for
    output datasets, if it already exists.
commonDataIds : \
        `lsst.daf.butler.registry.queries.DataCoordinateQueryResults`
    Result of a previous call to `connectDataIds`.
skipExisting : `bool`, optional
    If `True` (default), a Quantum is not created if all its outputs
    already exist in ``run``.  Ignored if ``run`` is `None`.

Raises
------
OutputExistsError
    Raised if an output dataset already exists in the output run
    and ``skipExisting`` is `False`.  The case where some but not all
    of a quantum's outputs are present and ``skipExisting`` is `True`
    cannot be identified at this stage, and is handled by `fillQuanta`
    instead.

Definition at line 577 of file graphBuilder.py.

     def resolveDatasetRefs(self, registry, collections, run, commonDataIds, *, skipExisting=True):
         """Perform follow up queries for each dataset data ID produced in
         `fillDataIds`.
  
         This method populates `_DatasetScaffolding.refs` (except for those in
         `prerequisites`).
  
         Parameters
         ----------
         registry : `lsst.daf.butler.Registry`
             Registry for the data repository; used for all data ID queries.
         collections
             Expressions representing the collections to search for input
             datasets.  May be any of the types accepted by
             `lsst.daf.butler.CollectionSearch.fromExpression`.
         run : `str`, optional
             Name of the `~lsst.daf.butler.CollectionType.RUN` collection for
             output datasets, if it already exists.
         commonDataIds : \
                 `lsst.daf.butler.registry.queries.DataCoordinateQueryResults`
             Result of a previous call to `connectDataIds`.
         skipExisting : `bool`, optional
             If `True` (default), a Quantum is not created if all its outputs
             already exist in ``run``.  Ignored if ``run`` is `None`.
  
         Raises
         ------
         OutputExistsError
             Raised if an output dataset already exists in the output run
             and ``skipExisting`` is `False`.  The case where some but not all
             of a quantum's outputs are present and ``skipExisting`` is `True`
             cannot be identified at this stage, and is handled by `fillQuanta`
             instead.
         """
         # Look up [init] intermediate and output datasets in the output
         # collection, if there is an output collection.
         if run is not None:
             for datasetType, refs in itertools.chain(self.initIntermediates.items(),
                                                      self.initOutputs.items(),
                                                      self.intermediates.items(),
                                                      self.outputs.items()):
                 _LOG.debug("Resolving %d datasets for intermediate and/or output dataset %s.",
                            len(refs), datasetType.name)
                 isInit = datasetType in self.initIntermediates or datasetType in self.initOutputs
                 resolvedRefQueryResults = commonDataIds.subset(
                     datasetType.dimensions,
                     unique=True
                 ).findDatasets(
                     datasetType,
                     collections=run,
                     findFirst=True
                 )
                 for resolvedRef in resolvedRefQueryResults:
                     # TODO: we could easily support per-DatasetType
                     # skipExisting and I could imagine that being useful - it's
                     # probably required in order to support writing initOutputs
                     # before QuantumGraph generation.
                     assert resolvedRef.dataId in refs
                     if skipExisting or isInit:
                         refs[resolvedRef.dataId] = resolvedRef
                     else:
                         raise OutputExistsError(f"Output dataset {datasetType.name} already exists in "
                                                 f"output RUN collection '{run}' with data ID"
                                                 f" {resolvedRef.dataId}.")
         # Look up input and initInput datasets in the input collection(s).
         for datasetType, refs in itertools.chain(self.initInputs.items(), self.inputs.items()):
             _LOG.debug("Resolving %d datasets for input dataset %s.", len(refs), datasetType.name)
             resolvedRefQueryResults = commonDataIds.subset(
                 datasetType.dimensions,
                 unique=True
             ).findDatasets(
                 datasetType,
                 collections=collections,
                 findFirst=True
             )
             dataIdsNotFoundYet = set(refs.keys())
             for resolvedRef in resolvedRefQueryResults:
                 dataIdsNotFoundYet.discard(resolvedRef.dataId)
                 refs[resolvedRef.dataId] = resolvedRef
             if dataIdsNotFoundYet:
                 raise RuntimeError(
                     f"{len(dataIdsNotFoundYet)} dataset(s) of type "
                     f"'{datasetType.name}' was/were present in a previous "
                     f"query, but could not be found now."
                     f"This is either a logic bug in QuantumGraph generation "
                     f"or the input collections have been modified since "
                     f"QuantumGraph generation began."
                 )
         # Copy the resolved DatasetRefs to the _QuantumScaffolding objects,
         # replacing the unresolved refs there, and then look up prerequisites.
         for task in self.tasks:
             _LOG.debug(
                 "Applying resolutions and finding prerequisites for %d quanta of task with label '%s'.",
                 len(task.quanta),
                 task.taskDef.label
             )
             lookupFunctions = {
                 c.name: c.lookupFunction
                 for c in iterConnections(task.taskDef.connections, "prerequisiteInputs")
                 if c.lookupFunction is not None
             }
             dataIdsToSkip = []
             for quantum in task.quanta.values():
                 # Process outputs datasets only if there is a run to look for
                 # outputs in and skipExisting is True.  Note that if
                 # skipExisting is False, any output datasets that already exist
                 # would have already caused an exception to be raised.
                 # We never update the DatasetRefs in the quantum because those
                 # should never be resolved.
                 if run is not None and skipExisting:
                     resolvedRefs = []
                     unresolvedRefs = []
                     for datasetType, originalRefs in quantum.outputs.items():
                         for ref in task.outputs.extract(datasetType, originalRefs.keys()):
                             if ref.id is not None:
                                 resolvedRefs.append(ref)
                             else:
                                 unresolvedRefs.append(ref)
                     if resolvedRefs:
                         if unresolvedRefs:
                             raise OutputExistsError(
                                 f"Quantum {quantum.dataId} of task with label "
                                 f"'{quantum.task.taskDef.label}' has some outputs that exist "
                                 f"({resolvedRefs}) "
                                 f"and others that don't ({unresolvedRefs})."
                             )
                         else:
                             # All outputs are already present; skip this
                             # quantum and continue to the next.
                             dataIdsToSkip.append(quantum.dataId)
                             continue
                 # Update the input DatasetRefs to the resolved ones we already
                 # searched for.
                 for datasetType, refs in quantum.inputs.items():
                     for ref in task.inputs.extract(datasetType, refs.keys()):
                         refs[ref.dataId] = ref
                 # Look up prerequisite datasets in the input collection(s).
                 # These may have dimensions that extend beyond those we queried
                 # for originally, because we want to permit those data ID
                 # values to differ across quanta and dataset types.
                 for datasetType in task.prerequisites:
                     lookupFunction = lookupFunctions.get(datasetType.name)
                     if lookupFunction is not None:
                         # PipelineTask has provided its own function to do the
                         # lookup.  This always takes precedence.
                         refs = list(
                             lookupFunction(datasetType, registry, quantum.dataId, collections)
                         )
                     elif (datasetType.isCalibration()
                             and datasetType.dimensions <= quantum.dataId.graph
                             and quantum.dataId.graph.temporal):
                         # This is a master calibration lookup, which we have to
                         # handle specially because the query system can't do a
                         # temporal join on a non-dimension-based timespan yet.
                         timespan = quantum.dataId.timespan
                         try:
                             refs = [registry.findDataset(datasetType, quantum.dataId,
                                                          collections=collections,
                                                          timespan=timespan)]
                         except KeyError:
                             # This dataset type is not present in the registry,
                             # which just means there are no datasets here.
                             refs = []
                     else:
                         # Most general case.
                         refs = list(registry.queryDatasets(datasetType,
                                                            collections=collections,
                                                            dataId=quantum.dataId,
                                                            findFirst=True).expanded())
                     quantum.prerequisites[datasetType].update({ref.dataId: ref for ref in refs
                                                                if ref is not None})
             # Actually remove any quanta that we decided to skip above.
             if dataIdsToSkip:
                 _LOG.debug("Pruning %d quanta for task with label '%s' because all of their outputs exist.",
                            len(dataIdsToSkip), task.taskDef.label)
                 for dataId in dataIdsToSkip:
                     del task.quanta[dataId]
  

Member Data Documentation

◆ dimensions

lsst.pipe.base.graphBuilder._PipelineScaffolding.dimensions

Definition at line 415 of file graphBuilder.py.

◆ tasks

lsst.pipe.base.graphBuilder._PipelineScaffolding.tasks

Definition at line 403 of file graphBuilder.py.

The documentation for this class was generated from the following file:

/j/snowflake/release/lsstsw/stack/lsst-scipipe-0.4.3/Linux64/pipe_base/22.0.1+94e66cc9ed/python/lsst/pipe/base/graphBuilder.py

Public Member Functions

Public Attributes