Basic Usage

In C++:

#include "lsst/afw/table.h"
using namespace lsst::afw::table;
Schema schema;
Key<int> k1 = schema.addField<int>("f1", "doc for f1");
Key<float> k2 = schema.addField<float>("f2", "doc for f2", "units for f2");
Key< Array<double> > k3 = schema.addField< Array<double> >("f3", "doc for f3", "units for f2", 5);
BaseCatalog catalog(schema);
for (int i = 1; i <= 3; ++i) {
    std::shared_ptr<BaseRecord> record = catalog.addNew();
    record->set(k1, i);
    record->set(k2, i*3.14);
    ndarray::Array<double,1> a3 = (*record)[k3]; // operator[] allows in-place edits for some fields
    a3.asEigen().setRandom();
    a3.deep() += i;
}
BaseColumnView columns = catalog.getColumnView();
std::cout << columns[k1] << ", " << columns[k2] << ", " << columns[k3];

In Python:

from lsst.afw.table import *
schema = Schema()
k1 = schema.addField("f1", type=int, doc="doc for f1")
k2 = schema.addField("f2", type=numpy.float32, "doc for f2", units="units for f2")
k3 = schema.addField("f3", type="ArrayD", doc="doc for f3", units="units for f2", size=5)
catalog = BaseCatalog(schema)
for i in (1, 2, 3):
    record = catalog.addNew()
    record.set(k1, i)
    record.set(k2, i * 3.14)
    record[k3] = numpy.random.rand(5) + i # ndarray::Array == numpy.ndarray in Python
print catalog[k1], catalog[k2], catalog[k3]   # in Python, can access columns directly from the catalog

Overview

The primary objects users will interact with in the table library are Schemas, Fields, Keys, Tables, Records, Catalogs, and ColumnViews. Schema is a concrete class that defines the columns of a table; it behaves like a heterogeneous container of SchemaItem<T> objects, which are in turn composed of Field and Key objects. A Field contains name, documentation, and units, while the Key object is a lightweight opaque object used to actually access elements of the table. Using keys for access allows reads and writes to compile down to little (if any) more than a pointer offset and dereference. String field names can be used instead of keys in Python (though this is less efficient), but this is not possible in C++.

Record and table classes are defined in pairs; each record class has a 1-to-1 correspondence with a table class. A record at its simplest is just a row in a table, though both classes are polymorphic and derived classes can add additional functionality (such as the SourceRecord and SourceTable classes, for instance). A table acts as a factory for records; all record creating (even cloning) goes through a table member function. This is because the underlying data for records is allocated in multi-record blocks by the table. Records thus hold a shared_ptr back to their table, as well as a pointer to their memory block, and data members that are shared by multiple records (such as the Schema) are accessed through the table rather than held separately in each record. A table does not hold pointers to the records it is associated with, however, or provide ways to iterate over records.

Instead, we have a separate concept of a container class, which holds a single table pointer and multiple records, and is usually just a wrapper around an STL container of record shared_ptrs. While we've left open the possibility of additional containers in the future, the CatalogT and SimpleCatalogT implementations (better known by the BaseCatalog, SimpleCatalog, and SourceCatalog typedefs) will be the most commonly used. Catalogs are based on a std::vector of record shared_ptrs, but provided accessors and iterators that yield record references, so you can use iter->method instead of (**iter).method. Most Catalog operations are analogous to std::vector operations, though they often provide support for both deep and shallow copies, and the SimpleCatalogT class adds sorting and lookups based on unique IDs. Future derived classes of BaseRecord and BaseTable will hopefully be able to just use one of the existing catalog templates and provide a new typedef rather than require a new catalog class. Catalogs have a significantly different interface in Python, where they mimic a (limited) Python list instead of a C++ vector (see the Python in-line help).

ColumnView (i.e. BaseColumnView) objects can be created from Catalogs, and provide views into columns of the catalog as ndarray::Array objects (in C++) or numpy.ndarray objects (in Python). In Python, columns can be directly accessed from a Catalog (using a private ColumnView under the hood). Not all records are allocated in contiguous memory, however (as discussed below), and only contiguous records can be viewed as columns. To get columns from a non-contiguous Catalog, you need to do a deep copy of it, which will automatically ensure the new catalog is contiguous. In Python, that looks like this:

newCat = oldCat.copy(deep=True)

Memory and Copy Semantics

Tables and records are noncopyable, and are always allocated in shared_ptrs. Both can be deep copied, however - tables have a clone() member function, and records can be copied by calling copyRecord() on the table. Records are also default constructable, in the sense that a table must always be able to create a record with no additional arguments besides what the table itself can provide (SourceTable, for instance, sets the ID of default-constructed records using an internal IdFactory object).

The memory in a table is allocated in blocks, as in most implementations of the C++ STL's deque. When the capacity of the most recent block is exhausted, a new block will be allocated for future records. The pointers to records themselves are not pointers into these blocks (record instances are allocated as usual with new or make_shared) - the block memory is accessed via a void pointer in the BaseRecord class, and derived record classes shouldn't have to deal with it at all.

One of the advantages of this approach is that most - but not all - records will be close to their neighbors in memory. More importantly, unlike std::vector, the whole table is never silently reallocated. Finally, if a sequence of records have been allocated from the same block, their columns may be accessed as strided ndarray objects (and hency NumPy arrays in Python) using ColumnView.

FITS I/O

Records can be saved/loaded to/from FITS binary tables using the writeFits and readFits member functions on the library's container classes. Not all FITS binary table column types are supported, but the most common ones are (notable exceptions are strings, complex numbers, and variable-length arrays). As long as a FIT binary table contains only allowed column types, it should be possible to read it into an afw/table container, though FITS tables from external sources will not be able to tell our FITS reader to use specialized field types like Point or Covariance - any multi-element column will be read in as an array unless special keys are present in the FITS header.

The FITS I/O functionally is implemented in the io::FitsReader and io::FitsWriter classes, which inherit from the more general io::Reader and io::Writer classes. New types of serialization for tables should follow the same pattern and create new subclasses of io::Reader and io::Writer. In addition, new table/record types will usually want to implement a new FitsWriter and FitsReader subclass (which can delegate most of the work to the base classes) to save derived-class data members and ensure loaded objects have the correct type.

Each Table derived class can also define the interpretation of a set of bitflags that can be used to control the details of how catalogs are saved and loaded. For instance, SourceTable uses these to determine whether to read or write Footprints with each source, and if so, whether to read or write HeavyFootprints as regular Footprints (see SourceFitsFlags). All Catalog FITS I/O routines accept a flags argument, even if the underlying table object will ignore it.

Schemas

Schema objects are append-only objects - you can add new fields, but you can never remove them. This is because the schema creates and returns keys to fields as they are added, and removing a field from the schema would invalidate not only the key for that field, but also keys for any fields that were added after it. Copying a schema and adding new records to the copy will allow keys created from the original to work with tables and records that use the copy; we can consider the original in this case to be a subset of the original schema, and we can test for this using Schema::contains. Containment tests and the Schema equality comparison operators only consider the position, type and length of fields (in other words, in the information contained in a Key) - you can renam a field in a schema without invalidating keys or changing how it is compared to other schemas. (Note that one schema being a subset or superset of another is completely unrelated to the SubSchema class, which is used to implement the dotted namespaces discussed below).

Field Names

By convention, field names are camel case and have '_'-separated elements. Only letters, numbers, and underscores should be used. These rules are not enforced, but names that do not meet these requirements may not round-trip correctly in FITS.

Underscores should be used as a sort of namespace separator (much like '.' in Python; we use underscore instead so we don't need to translate field names when using them in SQL). These namespace elements typically indicate things like the module or Task containing the algorithm that produced the field, the name of the algorithm, and (finally) the name of the particular output produced by that algorithm.

For example:

base_SdssShape_xx is produced by the SdssShape algorithm located in the meas_base package (we drop "meas_" since so many algorithms are in meas_* packages; see lsst.meas.base.generateAlgorithmName), and this particular field represents the x-x second-moment of the source.
deblend_nChild is produced by SourceDeblendTask, where deblend is an abbreviation of the Task, and nChild is a value produced by it. There are no hard rules for how to abbreviate the name of a Task when generating its field names; we trust developer judgement in selecting a prefix that is both concise and unambiguous.

Note that some underscore-separated elements are themselves multiple words, such as SdssShape or nChild, and we use CamelCase, not more underscores, to separate words. Two rules of thumb are:

if two words are not individually meaningful (or mean something different when separated), join them with CamelCase;
if the prefix of a field name represents a conceptual group of multiple fields, use underscores to join the group name to the group elements.

We also do not have hard rules about whether words begin with uppercase or lowercase, with three exceptions:

The first namespace element (typically a module or Task abbreviation) should start with a lowercase letter.
The last namespace element (the name of a particular value) should begin with a lowercase letter.
Names that correspond to a Python or C++ class that produce the field should start with uppercase (to match the name of the class).

Schema provides extra functionality for names with underscore-separated elements; these elements can be accessed individually with the bracket operators. More information on this behavior can be found in the Schema and SubSchema class documentation.

Other field strings (documentation and units) are essentially arbitrary, but should not contain single quotes, as these may also confuse FITS parsers (even when escaped).

Variable-Length Arrays

Variable-length array fields (Array<T> fields with a size of 0) work in a fundamentally different way from all other types - instead of using some of the contiguous memory block that other fields occupy, these fields are simply a reference-counted pointer and size to separately-allocated memory. That means it's impossible to get column views for them, and there's a bit more overhead for each one. Whenever an array size can be known in advance, it's better to use a fixed-length array field instead.

Variable-length array fields also behave slightly differently when assigned to, as compared to fixed-length array fields: using BaseRecord.set() to assign assign a new array to a variable-length field will replace the old value entirely with a reference to the new array, without copying any values ("shallow assignment"), while assigning to a fixed-length array field will copy the values into the record's memory block ("deep assignment"). In order to copy values into a variable-length array, retrieve a reference to the array using square bracket [] operators, then assign to that. In code, if key is a Key to a variable-length array field,

record.set(key, array); // shallow: doesn't copy values, just reset pointers, size can change

record[key] = array; // deep: copies values, doesn't modify pointers, sizes must match

Field Types

In C++, field types are defined by the template arguments to Key and Field (among others). Empty tag templates (Array, Point, Moments, Covariance) are used for compound fields. In Python, strings are used to set field types. The Key and Field classes for each type can be accessed through dictionaries (e.g. Key["F"]), but usually these type strings are only explicitly written when passed as the 'type' argument of Schema.addField. Some Python types can also be used in place of type strings for fields (e.g. int, afw.geom.SpherePoint). Note that Python type strings with angle brackets do not have the extra spaces that are necessary when writing templates in C++98.

Some field types require a size argument to be passed to the Field constructor or Schema::addField; while this size can be set at compile time, all records must have the same size.

Not all field types support all types of data access. All field types support atomic access to the entire field at once through BaseRecord::get and BaseRecord::set. Some field types support square bracket access to references or reference-like objects (i.e. ndarray::ArrayRef) as well. Only scalar and array fields support column access through ColumnView.

A Key for an individual element of a compound field can also be obtained from the compound Key object or (for 'named subfields') from the Schema directly (see Schema and the KeyBase specializations). Element keys can be used just like any other scalar Key, and hence provide access to column views.

C++ Type	Python Type String	Python Aliases Types	C++ Value (get/set) Type	Reference Access	ColumnView Support	Dynamic Size	Notes
Flag	"Flag"		bool	No	Yes	No	Stored internally as a single bit
boost::int32_t	"I"	int, numpy.int32	boost::int32_t	Yes	Yes	No
boost::int64_t	"L"	long, numpy.int64	boost::int64_t	Yes	Yes	No
float	"F"	numpy.float32	float	Yes	Yes	No
double	"D"	float, numpy.float64	double	Yes	Yes	No
Angle	"Angle"	afw.geom.Angle	afw::geom::Angle	Yes	Yes	No	ColumnView access in Python returns an array of numpy.float64 (in radians).
std::string	"String"	str	std::string	No	No	Yes	Strings are fixed-size; sizes must be declared at schema-creation time as with arrays.
Array<int>	"ArrayI"		ndarray::Array<int,1>	Yes	Yes	Yes	operator[] returns an ndarray::ArrayRef
Array<float>	"ArrayF"		ndarray::Array<float,1>	Yes	Yes	Yes	operator[] returns an ndarray::ArrayRef
Array<double>	"ArrayD"		ndarray::Array<double,1>	Yes	Yes	Yes	operator[] returns an ndarray::ArrayRef