LSST Applications
21.0.0-172-gfb10e10a+18fedfabac,22.0.0+297cba6710,22.0.0+80564b0ff1,22.0.0+8d77f4f51a,22.0.0+a28f4c53b1,22.0.0+dcf3732eb2,22.0.1-1-g7d6de66+2a20fdde0d,22.0.1-1-g8e32f31+297cba6710,22.0.1-1-geca5380+7fa3b7d9b6,22.0.1-12-g44dc1dc+2a20fdde0d,22.0.1-15-g6a90155+515f58c32b,22.0.1-16-g9282f48+790f5f2caa,22.0.1-2-g92698f7+dcf3732eb2,22.0.1-2-ga9b0f51+7fa3b7d9b6,22.0.1-2-gd1925c9+bf4f0e694f,22.0.1-24-g1ad7a390+a9625a72a8,22.0.1-25-g5bf6245+3ad8ecd50b,22.0.1-25-gb120d7b+8b5510f75f,22.0.1-27-g97737f7+2a20fdde0d,22.0.1-32-gf62ce7b1+aa4237961e,22.0.1-4-g0b3f228+2a20fdde0d,22.0.1-4-g243d05b+871c1b8305,22.0.1-4-g3a563be+32dcf1063f,22.0.1-4-g44f2e3d+9e4ab0f4fa,22.0.1-42-gca6935d93+ba5e5ca3eb,22.0.1-5-g15c806e+85460ae5f3,22.0.1-5-g58711c4+611d128589,22.0.1-5-g75bb458+99c117b92f,22.0.1-6-g1c63a23+7fa3b7d9b6,22.0.1-6-g50866e6+84ff5a128b,22.0.1-6-g8d3140d+720564cf76,22.0.1-6-gd805d02+cc5644f571,22.0.1-8-ge5750ce+85460ae5f3,master-g6e05de7fdc+babf819c66,master-g99da0e417a+8d77f4f51a,w.2021.48
LSST Data Management Base Package
|
Public Member Functions | |
def | __init__ (self, *args, **kwargs) |
def | columnLevelNames (self) |
def | columnLevels (self) |
def | toDataFrame (self, columns=None, droplevels=True) |
def | write (self, filename) |
def | pandasMd (self) |
def | columnIndex (self) |
def | columns (self) |
def | toDataFrame (self, columns=None) |
Public Attributes | |
filename | |
Wrapper to access dataframe with multi-level column index from Parquet This subclass of `ParquetTable` to handle the multi-level is necessary because there is not a convenient way to request specific table subsets by level via Parquet through pyarrow, as there is with a `pandas.DataFrame`. Additionally, pyarrow stores multilevel index information in a very strange way. Pandas stores it as a tuple, so that one can access a single column from a pandas dataframe as `df[('ref', 'HSC-G', 'coord_ra')]`. However, for some reason pyarrow saves these indices as "stringified" tuples, such that in order to read thissame column from a table written to Parquet, you would have to do the following: pf = pyarrow.ParquetFile(filename) df = pf.read(columns=["('ref', 'HSC-G', 'coord_ra')"]) See also https://github.com/apache/arrow/issues/1771, where we've raised this issue. As multilevel-indexed dataframes can be very useful to store data like multiple filters' worth of data in the same table, this case deserves a wrapper to enable easier access; that's what this object is for. For example, parq = MultilevelParquetTable(filename) columnDict = {'dataset':'meas', 'filter':'HSC-G', 'column':['coord_ra', 'coord_dec']} df = parq.toDataFrame(columns=columnDict) will return just the coordinate columns; the equivalent of calling `df['meas']['HSC-G'][['coord_ra', 'coord_dec']]` on the total dataframe, but without having to load the whole frame into memory---this reads just those columns from disk. You can also request a sub-table; e.g., parq = MultilevelParquetTable(filename) columnDict = {'dataset':'meas', 'filter':'HSC-G'} df = parq.toDataFrame(columns=columnDict) and this will be the equivalent of `df['meas']['HSC-G']` on the total dataframe. Parameters ---------- filename : str, optional Path to Parquet file. dataFrame : dataFrame, optional
Definition at line 148 of file parquetTable.py.
def lsst.pipe.tasks.parquetTable.MultilevelParquetTable.__init__ | ( | self, | |
* | args, | ||
** | kwargs | ||
) |
Definition at line 198 of file parquetTable.py.
|
inherited |
Columns as a pandas Index
Definition at line 91 of file parquetTable.py.
def lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevelNames | ( | self | ) |
Definition at line 204 of file parquetTable.py.
def lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevels | ( | self | ) |
Names of levels in column index
Definition at line 213 of file parquetTable.py.
|
inherited |
List of column names (or column index if df is set) This may either be a list of column names, or a pandas.Index object describing the column index, depending on whether the ParquetTable object is wrapping a ParquetFile or a DataFrame.
Definition at line 105 of file parquetTable.py.
|
inherited |
Definition at line 83 of file parquetTable.py.
|
inherited |
Get table (or specified columns) as a pandas DataFrame Parameters ---------- columns : list, optional Desired columns. If `None`, then all columns will be returned.
Definition at line 126 of file parquetTable.py.
def lsst.pipe.tasks.parquetTable.MultilevelParquetTable.toDataFrame | ( | self, | |
columns = None , |
|||
droplevels = True |
|||
) |
Get table (or specified columns) as a pandas DataFrame To get specific columns in specified sub-levels: parq = MultilevelParquetTable(filename) columnDict = {'dataset':'meas', 'filter':'HSC-G', 'column':['coord_ra', 'coord_dec']} df = parq.toDataFrame(columns=columnDict) Or, to get an entire subtable, leave out one level name: parq = MultilevelParquetTable(filename) columnDict = {'dataset':'meas', 'filter':'HSC-G'} df = parq.toDataFrame(columns=columnDict) Parameters ---------- columns : list or dict, optional Desired columns. If `None`, then all columns will be returned. If a list, then the names of the columns must be *exactly* as stored by pyarrow; that is, stringified tuples. If a dictionary, then the entries of the dictionary must correspond to the level names of the column multi-index (that is, the `columnLevels` attribute). Not every level must be passed; if any level is left out, then all entries in that level will be implicitly included. droplevels : bool If True drop levels of column index that have just one entry
Definition at line 235 of file parquetTable.py.
|
inherited |
Write pandas dataframe to parquet Parameters ---------- filename : str Path to which to write.
Definition at line 69 of file parquetTable.py.
|
inherited |
Definition at line 55 of file parquetTable.py.