Inheritance diagram for lsst.pipe.tasks.parquetTable.MultilevelParquetTable:

Public Member Functions
	__init__ (self, args, *kwargs)

	columnLevelNames (self)

	columnLevels (self)

	toDataFrame (self, columns=None, droplevels=True)

Public Attributes
	columns

	columnLevels

Protected Member Functions
	_getColumnIndex (self)

	_getColumns (self)

	_colsFromDict (self, colDict)

	_stringify (self, cols)

Protected Attributes
	_columnLevelNames

Detailed Description

Wrapper to access dataframe with multi-level column index from Parquet

This subclass of `ParquetTable` to handle the multi-level is necessary
because there is not a convenient way to request specific table subsets
by level via Parquet through pyarrow, as there is with a `pandas.DataFrame`.

Additionally, pyarrow stores multilevel index information in a very strange
way. Pandas stores it as a tuple, so that one can access a single column
from a pandas dataframe as `df[('ref', 'HSC-G', 'coord_ra')]`.  However, for
some reason pyarrow saves these indices as "stringified" tuples, such that
in order to read thissame column from a table written to Parquet, you would
have to do the following:

    pf = pyarrow.ParquetFile(filename)
    df = pf.read(columns=["('ref', 'HSC-G', 'coord_ra')"])

See also https://github.com/apache/arrow/issues/1771, where we've raised
this issue.

As multilevel-indexed dataframes can be very useful to store data like
multiple filters' worth of data in the same table, this case deserves a
wrapper to enable easier access;
that's what this object is for.  For example,

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
                  'filter':'HSC-G',
                  'column':['coord_ra', 'coord_dec']}
    df = parq.toDataFrame(columns=columnDict)

will return just the coordinate columns; the equivalent of calling
`df['meas']['HSC-G'][['coord_ra', 'coord_dec']]` on the total dataframe,
but without having to load the whole frame into memory---this reads just
those columns from disk.  You can also request a sub-table; e.g.,

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
                  'filter':'HSC-G'}
    df = parq.toDataFrame(columns=columnDict)

and this will be the equivalent of `df['meas']['HSC-G']` on the total dataframe.

Parameters
----------
filename : str, optional
    Path to Parquet file.
dataFrame : dataFrame, optional

Definition at line 157 of file parquetTable.py.

Constructor & Destructor Documentation

◆ init()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable.__init__	(		self,
		*	args,
		**	kwargs
	)

Reimplemented from lsst.pipe.tasks.parquetTable.ParquetTable.

Definition at line 207 of file parquetTable.py.

    def __init__(self, *args, **kwargs):
        super(MultilevelParquetTable, self).__init__(*args, **kwargs)
 
        self._columnLevelNames = None
 

Member Function Documentation

◆ _colsFromDict()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable._colsFromDict	(	self,
		colDict
	)

protected

Definition at line 317 of file parquetTable.py.

    def _colsFromDict(self, colDict):
        new_colDict = {}
        for i, lev in enumerate(self.columnLevels):
            if lev in colDict:
                if isinstance(colDict[lev], str):
                    new_colDict[lev] = [colDict[lev]]
                else:
                    new_colDict[lev] = colDict[lev]
            else:
                new_colDict[lev] = self.columnIndex.levels[i]
 
        levelCols = [new_colDict[lev] for lev in self.columnLevels]
        cols = product(*levelCols)
        return list(cols)
 

◆ _getColumnIndex()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable._getColumnIndex ( self )

protected

Reimplemented from lsst.pipe.tasks.parquetTable.ParquetTable.

Definition at line 227 of file parquetTable.py.

    def _getColumnIndex(self):
        if self._df is not None:
            return super()._getColumnIndex()
        else:
            levelNames = [f["name"] for f in self.pandasMd["column_indexes"]]
            return pd.MultiIndex.from_tuples(self.columns, names=levelNames)
 

◆ _getColumns()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable._getColumns ( self )

protected

Reimplemented from lsst.pipe.tasks.parquetTable.ParquetTable.

Definition at line 234 of file parquetTable.py.

    def _getColumns(self):
        if self._df is not None:
            return super()._getColumns()
        else:
            columns = self._pf.metadata.schema.names
            n = len(self.pandasMd["column_indexes"])
            pattern = re.compile(", ".join(["'(.*)'"] * n))
            matches = [re.search(pattern, c) for c in columns]
            return [m.groups() for m in matches if m is not None]
 

◆ _stringify()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable._stringify	(	self,
		cols
	)

protected

Definition at line 332 of file parquetTable.py.

332 def _stringify(self, cols):

333 return [str(c) for c in cols]

◆ columnLevelNames()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevelNames ( self )

Definition at line 213 of file parquetTable.py.

    def columnLevelNames(self):
        if self._columnLevelNames is None:
            self._columnLevelNames = {
                level: list(np.unique(np.array(self.columns)[:, i]))
                for i, level in enumerate(self.columnLevels)
            }
        return self._columnLevelNames
 

◆ columnLevels()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevels ( self )

Names of levels in column index

Definition at line 222 of file parquetTable.py.

    def columnLevels(self):
        """Names of levels in column index
        """
        return self.columnIndex.names
 

◆ toDataFrame()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable.toDataFrame	(	self,
		columns = `None`,
		droplevels = `True`
	)

Get table (or specified columns) as a pandas DataFrame

To get specific columns in specified sub-levels:

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
              'filter':'HSC-G',
              'column':['coord_ra', 'coord_dec']}
    df = parq.toDataFrame(columns=columnDict)

Or, to get an entire subtable, leave out one level name:

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
                  'filter':'HSC-G'}
    df = parq.toDataFrame(columns=columnDict)

Parameters
----------
columns : list or dict, optional
    Desired columns.  If `None`, then all columns will be
    returned.  If a list, then the names of the columns must
    be *exactly* as stored by pyarrow; that is, stringified tuples.
    If a dictionary, then the entries of the dictionary must
    correspond to the level names of the column multi-index
    (that is, the `columnLevels` attribute).  Not every level
    must be passed; if any level is left out, then all entries
    in that level will be implicitly included.
droplevels : bool
    If True drop levels of column index that have just one entry

Reimplemented from lsst.pipe.tasks.parquetTable.ParquetTable.

Definition at line 244 of file parquetTable.py.

    def toDataFrame(self, columns=None, droplevels=True):
        """Get table (or specified columns) as a pandas DataFrame
 
        To get specific columns in specified sub-levels:
 
            parq = MultilevelParquetTable(filename)
            columnDict = {'dataset':'meas',
                      'filter':'HSC-G',
                      'column':['coord_ra', 'coord_dec']}
            df = parq.toDataFrame(columns=columnDict)
 
        Or, to get an entire subtable, leave out one level name:
 
            parq = MultilevelParquetTable(filename)
            columnDict = {'dataset':'meas',
                          'filter':'HSC-G'}
            df = parq.toDataFrame(columns=columnDict)
 
        Parameters
        ----------
        columns : list or dict, optional
            Desired columns.  If `None`, then all columns will be
            returned.  If a list, then the names of the columns must
            be *exactly* as stored by pyarrow; that is, stringified tuples.
            If a dictionary, then the entries of the dictionary must
            correspond to the level names of the column multi-index
            (that is, the `columnLevels` attribute).  Not every level
            must be passed; if any level is left out, then all entries
            in that level will be implicitly included.
        droplevels : bool
            If True drop levels of column index that have just one entry
 
        """
        if columns is None:
            if self._pf is None:
                return self._df
            else:
                return self._pf.read().to_pandas()
 
        if isinstance(columns, dict):
            columns = self._colsFromDict(columns)
 
        if self._pf is None:
            try:
                df = self._df[columns]
            except (AttributeError, KeyError):
                newColumns = [c for c in columns if c in self.columnIndex]
                if not newColumns:
                    raise ValueError("None of the requested columns ({}) are available!".format(columns))
                df = self._df[newColumns]
        else:
            pfColumns = self._stringify(columns)
            try:
                df = self._pf.read(columns=pfColumns, use_pandas_metadata=True).to_pandas()
            except (AttributeError, KeyError):
                newColumns = [c for c in columns if c in self.columnIndex]
                if not newColumns:
                    raise ValueError("None of the requested columns ({}) are available!".format(columns))
                pfColumns = self._stringify(newColumns)
                df = self._pf.read(columns=pfColumns, use_pandas_metadata=True).to_pandas()
 
        if droplevels:
            # Drop levels of column index that have just one entry
            levelsToDrop = [n for lev, n in zip(df.columns.levels, df.columns.names) if len(lev) == 1]
 
            # Prevent error when trying to drop *all* columns
            if len(levelsToDrop) == len(df.columns.names):
                levelsToDrop.remove(df.columns.names[-1])
 
            df.columns = df.columns.droplevel(levelsToDrop)
 
        return df
 

Member Data Documentation

◆ _columnLevelNames

lsst.pipe.tasks.parquetTable.MultilevelParquetTable._columnLevelNames

protected

Definition at line 210 of file parquetTable.py.

◆ columnLevels

lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevels

Definition at line 217 of file parquetTable.py.

◆ columns

lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columns

Definition at line 216 of file parquetTable.py.

The documentation for this class was generated from the following file:

/j/snowflake/release/lsstsw/stack/lsst-scipipe-8.0.0/Linux64/pipe_tasks/g8a2af25fa3+33d8adeb5f/python/lsst/pipe/tasks/parquetTable.py

Public Member Functions

Public Attributes

Protected Member Functions

Protected Attributes

Detailed Description

Constructor & Destructor Documentation

◆ __init__()

Member Function Documentation

◆ _colsFromDict()

◆ _getColumnIndex()

◆ _getColumns()

◆ _stringify()

◆ columnLevelNames()

◆ columnLevels()

◆ toDataFrame()

Member Data Documentation

◆ _columnLevelNames

◆ columnLevels

◆ columns

◆ init()