LSST Applications g011c388f00+1570843bc3,g0265f82a02+1c31ac625c,g08a116f7bc+9b5a8b4fc4,g113e161629+00d7be2b8d,g16a3bce237+1c31ac625c,g2079a07aa2+6a65a43b64,g2bbee38e9b+1c31ac625c,g337abbeb29+1c31ac625c,g3ddfee87b4+15db52e637,g50ff169b8f+f00c948d2c,g52b1c1532d+81bc2a20b4,g858d7b2824+00d7be2b8d,g88964a4962+8a1a53efdf,g8a2af25fa3+33d8adeb5f,g8a8a8dda67+81bc2a20b4,g99855d9996+dcceda0d40,g9ddcbc5298+fe33e4d80d,ga1e77700b3+ec8c1568a5,ga8c6da7877+b6c80215ae,gae46bcf261+1c31ac625c,gb700894bec+3b32ebc22c,gb8350603e9+de515223a7,gba4ed39666+d9abe90c32,gbeb006f7da+e2f003f3e5,gc86a011abf+00d7be2b8d,gcf0d15dbbd+15db52e637,gd162630629+0ff1f5d43c,gdaeeff99f8+6ceac51f81,ge79ae78c31+1c31ac625c,gee10cc3b42+81bc2a20b4,gf041782ebf+0cc2057818,gf11f55472b+99ee6e9747,gf1cff7945b+00d7be2b8d,gf748b16de2+9283e76039,gf9db590de0+15db52e637,v26.0.1.rc2
LSST Data Management Base Package
|
Public Member Functions | |
__init__ (self, *args, **kwargs) | |
columnLevelNames (self) | |
columnLevels (self) | |
toDataFrame (self, columns=None, droplevels=True) | |
Public Attributes | |
columns | |
columnLevels | |
Protected Member Functions | |
_getColumnIndex (self) | |
_getColumns (self) | |
_colsFromDict (self, colDict) | |
_stringify (self, cols) | |
Protected Attributes | |
_columnLevelNames | |
Wrapper to access dataframe with multi-level column index from Parquet This subclass of `ParquetTable` to handle the multi-level is necessary because there is not a convenient way to request specific table subsets by level via Parquet through pyarrow, as there is with a `pandas.DataFrame`. Additionally, pyarrow stores multilevel index information in a very strange way. Pandas stores it as a tuple, so that one can access a single column from a pandas dataframe as `df[('ref', 'HSC-G', 'coord_ra')]`. However, for some reason pyarrow saves these indices as "stringified" tuples, such that in order to read thissame column from a table written to Parquet, you would have to do the following: pf = pyarrow.ParquetFile(filename) df = pf.read(columns=["('ref', 'HSC-G', 'coord_ra')"]) See also https://github.com/apache/arrow/issues/1771, where we've raised this issue. As multilevel-indexed dataframes can be very useful to store data like multiple filters' worth of data in the same table, this case deserves a wrapper to enable easier access; that's what this object is for. For example, parq = MultilevelParquetTable(filename) columnDict = {'dataset':'meas', 'filter':'HSC-G', 'column':['coord_ra', 'coord_dec']} df = parq.toDataFrame(columns=columnDict) will return just the coordinate columns; the equivalent of calling `df['meas']['HSC-G'][['coord_ra', 'coord_dec']]` on the total dataframe, but without having to load the whole frame into memory---this reads just those columns from disk. You can also request a sub-table; e.g., parq = MultilevelParquetTable(filename) columnDict = {'dataset':'meas', 'filter':'HSC-G'} df = parq.toDataFrame(columns=columnDict) and this will be the equivalent of `df['meas']['HSC-G']` on the total dataframe. Parameters ---------- filename : str, optional Path to Parquet file. dataFrame : dataFrame, optional
Definition at line 157 of file parquetTable.py.
lsst.pipe.tasks.parquetTable.MultilevelParquetTable.__init__ | ( | self, | |
* | args, | ||
** | kwargs | ||
) |
Reimplemented from lsst.pipe.tasks.parquetTable.ParquetTable.
Definition at line 207 of file parquetTable.py.
|
protected |
Definition at line 317 of file parquetTable.py.
|
protected |
Reimplemented from lsst.pipe.tasks.parquetTable.ParquetTable.
Definition at line 227 of file parquetTable.py.
|
protected |
Reimplemented from lsst.pipe.tasks.parquetTable.ParquetTable.
Definition at line 234 of file parquetTable.py.
|
protected |
Definition at line 332 of file parquetTable.py.
lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevelNames | ( | self | ) |
Definition at line 213 of file parquetTable.py.
lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevels | ( | self | ) |
Names of levels in column index
Definition at line 222 of file parquetTable.py.
lsst.pipe.tasks.parquetTable.MultilevelParquetTable.toDataFrame | ( | self, | |
columns = None , |
|||
droplevels = True |
|||
) |
Get table (or specified columns) as a pandas DataFrame To get specific columns in specified sub-levels: parq = MultilevelParquetTable(filename) columnDict = {'dataset':'meas', 'filter':'HSC-G', 'column':['coord_ra', 'coord_dec']} df = parq.toDataFrame(columns=columnDict) Or, to get an entire subtable, leave out one level name: parq = MultilevelParquetTable(filename) columnDict = {'dataset':'meas', 'filter':'HSC-G'} df = parq.toDataFrame(columns=columnDict) Parameters ---------- columns : list or dict, optional Desired columns. If `None`, then all columns will be returned. If a list, then the names of the columns must be *exactly* as stored by pyarrow; that is, stringified tuples. If a dictionary, then the entries of the dictionary must correspond to the level names of the column multi-index (that is, the `columnLevels` attribute). Not every level must be passed; if any level is left out, then all entries in that level will be implicitly included. droplevels : bool If True drop levels of column index that have just one entry
Reimplemented from lsst.pipe.tasks.parquetTable.ParquetTable.
Definition at line 244 of file parquetTable.py.
|
protected |
Definition at line 210 of file parquetTable.py.
lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevels |
Definition at line 217 of file parquetTable.py.
lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columns |
Definition at line 216 of file parquetTable.py.