|
def | __init__ (self, root=None, mapper=None, inputs=None, outputs=None, **mapperArgs) |
|
def | __repr__ (self) |
|
def | defineAlias (self, alias, datasetType) |
|
def | getKeys (self, datasetType=None, level=None, tag=None) |
|
def | getDatasetTypes (self, tag=None) |
|
def | queryMetadata (self, datasetType, format, dataId={}, **rest) |
|
def | datasetExists (self, datasetType, dataId={}, write=False, **rest) |
|
def | get (self, datasetType, dataId=None, immediate=True, **rest) |
|
def | put (self, obj, datasetType, dataId={}, doBackup=False, **rest) |
|
def | subset (self, datasetType, level=None, dataId={}, **rest) |
|
def | dataRef (self, datasetType, level=None, dataId={}, **rest) |
|
def | getUri (self, datasetType, dataId=None, write=False, **rest) |
|
def | __reduce__ (self) |
|
Butler provides a generic mechanism for persisting and retrieving data using mappers.
A Butler manages a collection of datasets known as a repository. Each dataset has a type representing its
intended usage and a location. Note that the dataset type is not the same as the C++ or Python type of the
object containing the data. For example, an ExposureF object might be used to hold the data for a raw
image, a post-ISR image, a calibrated science image, or a difference image. These would all be different
dataset types.
A Butler can produce a collection of possible values for a key (or tuples of values for multiple keys) if
given a partial data identifier. It can check for the existence of a file containing a dataset given its
type and data identifier. The Butler can then retrieve the dataset. Similarly, it can persist an object to
an appropriate location when given its associated data identifier.
Note that the Butler has two more advanced features when retrieving a data set. First, the retrieval is
lazy. Input does not occur until the data set is actually accessed. This allows datasets to be retrieved
and placed on a clipboard prospectively with little cost, even if the algorithm of a stage ends up not
using them. Second, the Butler will call a standardization hook upon retrieval of the dataset. This
function, contained in the input mapper object, must perform any necessary manipulations to force the
retrieved object to conform to standards, including translating metadata.
Public methods:
__init__(self, root, mapper=None, **mapperArgs)
defineAlias(self, alias, datasetType)
getKeys(self, datasetType=None, level=None)
getDatasetTypes(self)
queryMetadata(self, datasetType, format=None, dataId={}, **rest)
datasetExists(self, datasetType, dataId={}, **rest)
get(self, datasetType, dataId={}, immediate=False, **rest)
put(self, obj, datasetType, dataId={}, **rest)
subset(self, datasetType, level=None, dataId={}, **rest)
dataRef(self, datasetType, level=None, dataId={}, **rest)
Initialization:
The preferred method of initialization is to use the `inputs` and `outputs` __init__ parameters. These
are described in the parameters section, below.
For backward compatibility: this initialization method signature can take a posix root path, and
optionally a mapper class instance or class type that will be instantiated using the mapperArgs input
argument. However, for this to work in a backward compatible way it creates a single repository that is
used as both an input and an output repository. This is NOT preferred, and will likely break any
provenance system we have in place.
Parameters
----------
root : string
.. note:: Deprecated in 12_0
`root` will be removed in TBD, it is replaced by `inputs` and `outputs` for
multiple-repository support.
A file system path. Will only work with a PosixRepository.
mapper : string or instance
.. note:: Deprecated in 12_0
`mapper` will be removed in TBD, it is replaced by `inputs` and `outputs` for
multiple-repository support.
Provides a mapper to be used with Butler.
mapperArgs : dict
.. note:: Deprecated in 12_0
`mapperArgs` will be removed in TBD, it is replaced by `inputs` and `outputs` for
multiple-repository support.
Provides arguments to be passed to the mapper if the mapper input argument is a class type to be
instantiated by Butler.
inputs : RepositoryArgs, dict, or string
Can be a single item or a list. Provides arguments to load an existing repository (or repositories).
String is assumed to be a URI and is used as the cfgRoot (URI to the location of the cfg file). (Local
file system URI does not have to start with 'file://' and in this way can be a relative path). The
`RepositoryArgs` class can be used to provide more parameters with which to initialize a repository
(such as `mapper`, `mapperArgs`, `tags`, etc. See the `RepositoryArgs` documentation for more
details). A dict may be used as shorthand for a `RepositoryArgs` class instance. The dict keys must
match parameters to the `RepositoryArgs.__init__` function.
outputs : RepositoryArgs, dict, or string
Provides arguments to load one or more existing repositories or create new ones. The different types
are handled the same as for `inputs`.
The Butler init sequence loads all of the input and output repositories.
This creates the object hierarchy to read from and write to them. Each
repository can have 0 or more parents, which also get loaded as inputs.
This becomes a DAG of repositories. Ultimately, Butler creates a list of
these Repositories in the order that they are used.
Initialization Sequence
=======================
During initialization Butler creates a Repository class instance & support structure for each object
passed to `inputs` and `outputs` as well as the parent repositories recorded in the `RepositoryCfg` of
each existing readable repository.
This process is complex. It is explained below to shed some light on the intent of each step.
1. Input Argument Standardization
---------------------------------
In `Butler._processInputArguments` the input arguments are verified to be legal (and a RuntimeError is
raised if not), and they are converted into an expected format that is used for the rest of the Butler
init sequence. See the docstring for `_processInputArguments`.
2. Create RepoData Objects
--------------------------
Butler uses an object, called `RepoData`, to keep track of information about each repository; each
repository is contained in a single `RepoData`. The attributes are explained in its docstring.
After `_processInputArguments`, a RepoData is instantiated and put in a list for each repository in
`outputs` and `inputs`. This list of RepoData, the `repoDataList`, now represents all the output and input
repositories (but not parent repositories) that this Butler instance will use.
3. Get `RepositoryCfg`s
-----------------------
`Butler._getCfgs` gets the `RepositoryCfg` for each repository the `repoDataList`. The behavior is
described in the docstring.
4. Add Parents
--------------
`Butler._addParents` then considers the parents list in the `RepositoryCfg` of each `RepoData` in the
`repoDataList` and inserts new `RepoData` objects for each parent not represented in the proper location
in the `repoDataList`. Ultimately a flat list is built to represent the DAG of readable repositories
represented in depth-first order.
5. Set and Verify Parents of Outputs
------------------------------------
To be able to load parent repositories when output repositories are used as inputs, the input repositories
are recorded as parents in the `RepositoryCfg` file of new output repositories. When an output repository
already exists, for consistency the Butler's inputs must match the list of parents specified the already-
existing output repository's `RepositoryCfg` file.
In `Butler._setAndVerifyParentsLists`, the list of parents is recorded in the `RepositoryCfg` of new
repositories. For existing repositories the list of parents is compared with the `RepositoryCfg`'s parents
list, and if they do not match a `RuntimeError` is raised.
6. Set the Default Mapper
-------------------------
If all the input repositories use the same mapper then we can assume that mapper to be the
"default mapper". If there are new output repositories whose `RepositoryArgs` do not specify a mapper and
there is a default mapper then the new output repository will be set to use that default mapper.
This is handled in `Butler._setDefaultMapper`.
7. Cache References to Parent RepoDatas
---------------------------------------
In `Butler._connectParentRepoDatas`, in each `RepoData` in `repoDataList`, a list of `RepoData` object
references is built that matches the parents specified in that `RepoData`'s `RepositoryCfg`.
This list is used later to find things in that repository's parents, without considering peer repository's
parents. (e.g. finding the registry of a parent)
8. Set Tags
-----------
Tags are described at https://ldm-463.lsst.io/v/draft/#tagging
In `Butler._setRepoDataTags`, for each `RepoData`, the tags specified by its `RepositoryArgs` are recorded
in a set, and added to the tags set in each of its parents, for ease of lookup when mapping.
9. Find Parent Registry and Instantiate RepoData
------------------------------------------------
At this point there is enough information to instantiate the `Repository` instances. There is one final
step before instantiating the Repository, which is to try to get a parent registry that can be used by the
child repository. The criteria for "can be used" is spelled out in `Butler._setParentRegistry`. However,
to get the registry from the parent, the parent must be instantiated. The `repoDataList`, in depth-first
search order, is built so that the most-dependent repositories are first, and the least dependent
repositories are last. So the `repoDataList` is reversed and the Repositories are instantiated in that
order; for each RepoData a parent registry is searched for, and then the Repository is instantiated with
whatever registry could be found.
Definition at line 323 of file butler.py.