Inheritance diagram for lsst.pipe.tasks.parquetTable.MultilevelParquetTable:

Public Member Functions
def	__init__ (self, args, *kwargs)

def	columnLevelNames (self)

def	columnLevels (self)

def	toDataFrame (self, columns=None, droplevels=True)

def	write (self, filename)

def	pandasMd (self)

def	columnIndex (self)

def	columns (self)

def	toDataFrame (self, columns=None)

Public Attributes
	filename

Detailed Description

Wrapper to access dataframe with multi-level column index from Parquet

This subclass of `ParquetTable` to handle the multi-level is necessary
because there is not a convenient way to request specific table subsets
by level via Parquet through pyarrow, as there is with a `pandas.DataFrame`.

Additionally, pyarrow stores multilevel index information in a very strange
way. Pandas stores it as a tuple, so that one can access a single column
from a pandas dataframe as `df[('ref', 'HSC-G', 'coord_ra')]`.  However, for
some reason pyarrow saves these indices as "stringified" tuples, such that
in order to read thissame column from a table written to Parquet, you would
have to do the following:

    pf = pyarrow.ParquetFile(filename)
    df = pf.read(columns=["('ref', 'HSC-G', 'coord_ra')"])

See also https://github.com/apache/arrow/issues/1771, where we've raised
this issue.

As multilevel-indexed dataframes can be very useful to store data like
multiple filters' worth of data in the same table, this case deserves a
wrapper to enable easier access;
that's what this object is for.  For example,

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
                  'filter':'HSC-G',
                  'column':['coord_ra', 'coord_dec']}
    df = parq.toDataFrame(columns=columnDict)

will return just the coordinate columns; the equivalent of calling
`df['meas']['HSC-G'][['coord_ra', 'coord_dec']]` on the total dataframe,
but without having to load the whole frame into memory---this reads just
those columns from disk.  You can also request a sub-table; e.g.,

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
                  'filter':'HSC-G'}
    df = parq.toDataFrame(columns=columnDict)

and this will be the equivalent of `df['meas']['HSC-G']` on the total dataframe.

Parameters
----------
filename : str, optional
    Path to Parquet file.
dataFrame : dataFrame, optional

Definition at line 148 of file parquetTable.py.

Constructor & Destructor Documentation

◆ init()

def lsst.pipe.tasks.parquetTable.MultilevelParquetTable.__init__	(		self,
		*	args,
		**	kwargs
	)

Definition at line 198 of file parquetTable.py.

     def __init__(self, *args, **kwargs):
         super(MultilevelParquetTable, self).__init__(*args, **kwargs)
  
         self._columnLevelNames = None
  

Member Function Documentation

◆ columnIndex()

def lsst.pipe.tasks.parquetTable.ParquetTable.columnIndex ( self )

inherited

Columns as a pandas Index

Definition at line 91 of file parquetTable.py.

     def columnIndex(self):
         """Columns as a pandas Index
         """
         if self._columnIndex is None:
             self._columnIndex = self._getColumnIndex()
         return self._columnIndex
  

◆ columnLevelNames()

def lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevelNames ( self )

Definition at line 204 of file parquetTable.py.

     def columnLevelNames(self):
         if self._columnLevelNames is None:
             self._columnLevelNames = {
                 level: list(np.unique(np.array(self.columns)[:, i]))
                 for i, level in enumerate(self.columnLevels)
             }
         return self._columnLevelNames
  

◆ columnLevels()

def lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevels ( self )

Names of levels in column index

Definition at line 213 of file parquetTable.py.

     def columnLevels(self):
         """Names of levels in column index
         """
         return self.columnIndex.names
  

◆ columns()

def lsst.pipe.tasks.parquetTable.ParquetTable.columns ( self )

inherited

List of column names (or column index if df is set)

This may either be a list of column names, or a
pandas.Index object describing the column index, depending
on whether the ParquetTable object is wrapping a ParquetFile
or a DataFrame.

Definition at line 105 of file parquetTable.py.

     def columns(self):
         """List of column names (or column index if df is set)
  
         This may either be a list of column names, or a
         pandas.Index object describing the column index, depending
         on whether the ParquetTable object is wrapping a ParquetFile
         or a DataFrame.
         """
         if self._columns is None:
             self._columns = self._getColumns()
         return self._columns
  

◆ pandasMd()

def lsst.pipe.tasks.parquetTable.ParquetTable.pandasMd ( self )

inherited

Definition at line 83 of file parquetTable.py.

     def pandasMd(self):
         if self._pf is None:
             raise AttributeError("This property is only accessible if ._pf is set.")
         if self._pandasMd is None:
             self._pandasMd = json.loads(self._pf.metadata.metadata[b"pandas"])
         return self._pandasMd
  

◆ toDataFrame() [1/2]

def lsst.pipe.tasks.parquetTable.ParquetTable.toDataFrame	(	self,
		columns = `None`
	)

inherited

Get table (or specified columns) as a pandas DataFrame

Parameters
----------
columns : list, optional
    Desired columns.  If `None`, then all columns will be
    returned.

Definition at line 126 of file parquetTable.py.

     def toDataFrame(self, columns=None):
         """Get table (or specified columns) as a pandas DataFrame
  
         Parameters
         ----------
         columns : list, optional
             Desired columns.  If `None`, then all columns will be
             returned.
         """
         if self._pf is None:
             if columns is None:
                 return self._df
             else:
                 return self._df[columns]
  
         if columns is None:
             return self._pf.read().to_pandas()
  
         df = self._pf.read(columns=columns, use_pandas_metadata=True).to_pandas()
         return df
  
  

◆ toDataFrame() [2/2]

def lsst.pipe.tasks.parquetTable.MultilevelParquetTable.toDataFrame	(	self,
		columns = `None`,
		droplevels = `True`
	)

Get table (or specified columns) as a pandas DataFrame

To get specific columns in specified sub-levels:

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
              'filter':'HSC-G',
              'column':['coord_ra', 'coord_dec']}
    df = parq.toDataFrame(columns=columnDict)

Or, to get an entire subtable, leave out one level name:

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
                  'filter':'HSC-G'}
    df = parq.toDataFrame(columns=columnDict)

Parameters
----------
columns : list or dict, optional
    Desired columns.  If `None`, then all columns will be
    returned.  If a list, then the names of the columns must
    be *exactly* as stored by pyarrow; that is, stringified tuples.
    If a dictionary, then the entries of the dictionary must
    correspond to the level names of the column multi-index
    (that is, the `columnLevels` attribute).  Not every level
    must be passed; if any level is left out, then all entries
    in that level will be implicitly included.
droplevels : bool
    If True drop levels of column index that have just one entry

Definition at line 235 of file parquetTable.py.

     def toDataFrame(self, columns=None, droplevels=True):
         """Get table (or specified columns) as a pandas DataFrame
  
         To get specific columns in specified sub-levels:
  
             parq = MultilevelParquetTable(filename)
             columnDict = {'dataset':'meas',
                       'filter':'HSC-G',
                       'column':['coord_ra', 'coord_dec']}
             df = parq.toDataFrame(columns=columnDict)
  
         Or, to get an entire subtable, leave out one level name:
  
             parq = MultilevelParquetTable(filename)
             columnDict = {'dataset':'meas',
                           'filter':'HSC-G'}
             df = parq.toDataFrame(columns=columnDict)
  
         Parameters
         ----------
         columns : list or dict, optional
             Desired columns.  If `None`, then all columns will be
             returned.  If a list, then the names of the columns must
             be *exactly* as stored by pyarrow; that is, stringified tuples.
             If a dictionary, then the entries of the dictionary must
             correspond to the level names of the column multi-index
             (that is, the `columnLevels` attribute).  Not every level
             must be passed; if any level is left out, then all entries
             in that level will be implicitly included.
         droplevels : bool
             If True drop levels of column index that have just one entry
  
         """
         if columns is None:
             if self._pf is None:
                 return self._df
             else:
                 return self._pf.read().to_pandas()
  
         if isinstance(columns, dict):
             columns = self._colsFromDict(columns)
  
         if self._pf is None:
             try:
                 df = self._df[columns]
             except (AttributeError, KeyError):
                 newColumns = [c for c in columns if c in self.columnIndex]
                 if not newColumns:
                     raise ValueError("None of the requested columns ({}) are available!".format(columns))
                 df = self._df[newColumns]
         else:
             pfColumns = self._stringify(columns)
             try:
                 df = self._pf.read(columns=pfColumns, use_pandas_metadata=True).to_pandas()
             except (AttributeError, KeyError):
                 newColumns = [c for c in columns if c in self.columnIndex]
                 if not newColumns:
                     raise ValueError("None of the requested columns ({}) are available!".format(columns))
                 pfColumns = self._stringify(newColumns)
                 df = self._pf.read(columns=pfColumns, use_pandas_metadata=True).to_pandas()
  
         if droplevels:
             # Drop levels of column index that have just one entry
             levelsToDrop = [n for lev, n in zip(df.columns.levels, df.columns.names) if len(lev) == 1]
  
             # Prevent error when trying to drop *all* columns
             if len(levelsToDrop) == len(df.columns.names):
                 levelsToDrop.remove(df.columns.names[-1])
  
             df.columns = df.columns.droplevel(levelsToDrop)
  
         return df
  

◆ write()

def lsst.pipe.tasks.parquetTable.ParquetTable.write	(	self,
		filename
	)

inherited

Write pandas dataframe to parquet

Parameters
----------
filename : str
    Path to which to write.

Definition at line 69 of file parquetTable.py.

     def write(self, filename):
         """Write pandas dataframe to parquet
  
         Parameters
         ----------
         filename : str
             Path to which to write.
         """
         if self._df is None:
             raise ValueError("df property must be defined to write.")
         table = pyarrow.Table.from_pandas(self._df)
         pyarrow.parquet.write_table(table, filename)
  

Member Data Documentation

◆ filename

lsst.pipe.tasks.parquetTable.ParquetTable.filename

inherited

Definition at line 55 of file parquetTable.py.

The documentation for this class was generated from the following file:

/j/snowflake/release/lsstsw/stack/lsst-scipipe-0.7.0/Linux64/pipe_tasks/21.0.0-172-gfb10e10a+18fedfabac/python/lsst/pipe/tasks/parquetTable.py

Public Member Functions

Public Attributes

Detailed Description

Constructor & Destructor Documentation

◆ __init__()

Member Function Documentation

◆ columnIndex()

◆ columnLevelNames()

◆ columnLevels()

◆ columns()

◆ pandasMd()

◆ toDataFrame() [1/2]

◆ toDataFrame() [2/2]

◆ write()

Member Data Documentation

◆ filename

◆ init()