LSST Applications g011c388f00+1570843bc3,g0265f82a02+1c31ac625c,g08a116f7bc+9b5a8b4fc4,g113e161629+00d7be2b8d,g16a3bce237+1c31ac625c,g2079a07aa2+6a65a43b64,g2bbee38e9b+1c31ac625c,g337abbeb29+1c31ac625c,g3ddfee87b4+15db52e637,g50ff169b8f+f00c948d2c,g52b1c1532d+81bc2a20b4,g858d7b2824+00d7be2b8d,g88964a4962+8a1a53efdf,g8a2af25fa3+33d8adeb5f,g8a8a8dda67+81bc2a20b4,g99855d9996+dcceda0d40,g9ddcbc5298+fe33e4d80d,ga1e77700b3+ec8c1568a5,ga8c6da7877+b6c80215ae,gae46bcf261+1c31ac625c,gb700894bec+3b32ebc22c,gb8350603e9+de515223a7,gba4ed39666+d9abe90c32,gbeb006f7da+e2f003f3e5,gc86a011abf+00d7be2b8d,gcf0d15dbbd+15db52e637,gd162630629+0ff1f5d43c,gdaeeff99f8+6ceac51f81,ge79ae78c31+1c31ac625c,gee10cc3b42+81bc2a20b4,gf041782ebf+0cc2057818,gf11f55472b+99ee6e9747,gf1cff7945b+00d7be2b8d,gf748b16de2+9283e76039,gf9db590de0+15db52e637,v26.0.1.rc2
LSST Data Management Base Package
Loading...
Searching...
No Matches
Public Member Functions | Public Attributes | Protected Member Functions | Protected Attributes | List of all members
lsst.pipe.tasks.parquetTable.MultilevelParquetTable Class Reference
Inheritance diagram for lsst.pipe.tasks.parquetTable.MultilevelParquetTable:
lsst.pipe.tasks.parquetTable.ParquetTable

Public Member Functions

 __init__ (self, *args, **kwargs)
 
 columnLevelNames (self)
 
 columnLevels (self)
 
 toDataFrame (self, columns=None, droplevels=True)
 

Public Attributes

 columns
 
 columnLevels
 

Protected Member Functions

 _getColumnIndex (self)
 
 _getColumns (self)
 
 _colsFromDict (self, colDict)
 
 _stringify (self, cols)
 

Protected Attributes

 _columnLevelNames
 

Detailed Description

Wrapper to access dataframe with multi-level column index from Parquet

This subclass of `ParquetTable` to handle the multi-level is necessary
because there is not a convenient way to request specific table subsets
by level via Parquet through pyarrow, as there is with a `pandas.DataFrame`.

Additionally, pyarrow stores multilevel index information in a very strange
way. Pandas stores it as a tuple, so that one can access a single column
from a pandas dataframe as `df[('ref', 'HSC-G', 'coord_ra')]`.  However, for
some reason pyarrow saves these indices as "stringified" tuples, such that
in order to read thissame column from a table written to Parquet, you would
have to do the following:

    pf = pyarrow.ParquetFile(filename)
    df = pf.read(columns=["('ref', 'HSC-G', 'coord_ra')"])

See also https://github.com/apache/arrow/issues/1771, where we've raised
this issue.

As multilevel-indexed dataframes can be very useful to store data like
multiple filters' worth of data in the same table, this case deserves a
wrapper to enable easier access;
that's what this object is for.  For example,

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
                  'filter':'HSC-G',
                  'column':['coord_ra', 'coord_dec']}
    df = parq.toDataFrame(columns=columnDict)

will return just the coordinate columns; the equivalent of calling
`df['meas']['HSC-G'][['coord_ra', 'coord_dec']]` on the total dataframe,
but without having to load the whole frame into memory---this reads just
those columns from disk.  You can also request a sub-table; e.g.,

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
                  'filter':'HSC-G'}
    df = parq.toDataFrame(columns=columnDict)

and this will be the equivalent of `df['meas']['HSC-G']` on the total dataframe.

Parameters
----------
filename : str, optional
    Path to Parquet file.
dataFrame : dataFrame, optional

Definition at line 157 of file parquetTable.py.

Constructor & Destructor Documentation

◆ __init__()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable.__init__ (   self,
args,
**  kwargs 
)

Reimplemented from lsst.pipe.tasks.parquetTable.ParquetTable.

Definition at line 207 of file parquetTable.py.

207 def __init__(self, *args, **kwargs):
208 super(MultilevelParquetTable, self).__init__(*args, **kwargs)
209
210 self._columnLevelNames = None
211

Member Function Documentation

◆ _colsFromDict()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable._colsFromDict (   self,
  colDict 
)
protected

Definition at line 317 of file parquetTable.py.

317 def _colsFromDict(self, colDict):
318 new_colDict = {}
319 for i, lev in enumerate(self.columnLevels):
320 if lev in colDict:
321 if isinstance(colDict[lev], str):
322 new_colDict[lev] = [colDict[lev]]
323 else:
324 new_colDict[lev] = colDict[lev]
325 else:
326 new_colDict[lev] = self.columnIndex.levels[i]
327
328 levelCols = [new_colDict[lev] for lev in self.columnLevels]
329 cols = product(*levelCols)
330 return list(cols)
331
daf::base::PropertyList * list
Definition fits.cc:932

◆ _getColumnIndex()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable._getColumnIndex (   self)
protected

Reimplemented from lsst.pipe.tasks.parquetTable.ParquetTable.

Definition at line 227 of file parquetTable.py.

227 def _getColumnIndex(self):
228 if self._df is not None:
229 return super()._getColumnIndex()
230 else:
231 levelNames = [f["name"] for f in self.pandasMd["column_indexes"]]
232 return pd.MultiIndex.from_tuples(self.columns, names=levelNames)
233

◆ _getColumns()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable._getColumns (   self)
protected

Reimplemented from lsst.pipe.tasks.parquetTable.ParquetTable.

Definition at line 234 of file parquetTable.py.

234 def _getColumns(self):
235 if self._df is not None:
236 return super()._getColumns()
237 else:
238 columns = self._pf.metadata.schema.names
239 n = len(self.pandasMd["column_indexes"])
240 pattern = re.compile(", ".join(["'(.*)'"] * n))
241 matches = [re.search(pattern, c) for c in columns]
242 return [m.groups() for m in matches if m is not None]
243

◆ _stringify()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable._stringify (   self,
  cols 
)
protected

Definition at line 332 of file parquetTable.py.

332 def _stringify(self, cols):
333 return [str(c) for c in cols]

◆ columnLevelNames()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevelNames (   self)

Definition at line 213 of file parquetTable.py.

213 def columnLevelNames(self):
214 if self._columnLevelNames is None:
215 self._columnLevelNames = {
216 level: list(np.unique(np.array(self.columns)[:, i]))
217 for i, level in enumerate(self.columnLevels)
218 }
219 return self._columnLevelNames
220

◆ columnLevels()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevels (   self)
Names of levels in column index

Definition at line 222 of file parquetTable.py.

222 def columnLevels(self):
223 """Names of levels in column index
224 """
225 return self.columnIndex.names
226

◆ toDataFrame()

lsst.pipe.tasks.parquetTable.MultilevelParquetTable.toDataFrame (   self,
  columns = None,
  droplevels = True 
)
Get table (or specified columns) as a pandas DataFrame

To get specific columns in specified sub-levels:

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
              'filter':'HSC-G',
              'column':['coord_ra', 'coord_dec']}
    df = parq.toDataFrame(columns=columnDict)

Or, to get an entire subtable, leave out one level name:

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
                  'filter':'HSC-G'}
    df = parq.toDataFrame(columns=columnDict)

Parameters
----------
columns : list or dict, optional
    Desired columns.  If `None`, then all columns will be
    returned.  If a list, then the names of the columns must
    be *exactly* as stored by pyarrow; that is, stringified tuples.
    If a dictionary, then the entries of the dictionary must
    correspond to the level names of the column multi-index
    (that is, the `columnLevels` attribute).  Not every level
    must be passed; if any level is left out, then all entries
    in that level will be implicitly included.
droplevels : bool
    If True drop levels of column index that have just one entry

Reimplemented from lsst.pipe.tasks.parquetTable.ParquetTable.

Definition at line 244 of file parquetTable.py.

244 def toDataFrame(self, columns=None, droplevels=True):
245 """Get table (or specified columns) as a pandas DataFrame
246
247 To get specific columns in specified sub-levels:
248
249 parq = MultilevelParquetTable(filename)
250 columnDict = {'dataset':'meas',
251 'filter':'HSC-G',
252 'column':['coord_ra', 'coord_dec']}
253 df = parq.toDataFrame(columns=columnDict)
254
255 Or, to get an entire subtable, leave out one level name:
256
257 parq = MultilevelParquetTable(filename)
258 columnDict = {'dataset':'meas',
259 'filter':'HSC-G'}
260 df = parq.toDataFrame(columns=columnDict)
261
262 Parameters
263 ----------
264 columns : list or dict, optional
265 Desired columns. If `None`, then all columns will be
266 returned. If a list, then the names of the columns must
267 be *exactly* as stored by pyarrow; that is, stringified tuples.
268 If a dictionary, then the entries of the dictionary must
269 correspond to the level names of the column multi-index
270 (that is, the `columnLevels` attribute). Not every level
271 must be passed; if any level is left out, then all entries
272 in that level will be implicitly included.
273 droplevels : bool
274 If True drop levels of column index that have just one entry
275
276 """
277 if columns is None:
278 if self._pf is None:
279 return self._df
280 else:
281 return self._pf.read().to_pandas()
282
283 if isinstance(columns, dict):
284 columns = self._colsFromDict(columns)
285
286 if self._pf is None:
287 try:
288 df = self._df[columns]
289 except (AttributeError, KeyError):
290 newColumns = [c for c in columns if c in self.columnIndex]
291 if not newColumns:
292 raise ValueError("None of the requested columns ({}) are available!".format(columns))
293 df = self._df[newColumns]
294 else:
295 pfColumns = self._stringify(columns)
296 try:
297 df = self._pf.read(columns=pfColumns, use_pandas_metadata=True).to_pandas()
298 except (AttributeError, KeyError):
299 newColumns = [c for c in columns if c in self.columnIndex]
300 if not newColumns:
301 raise ValueError("None of the requested columns ({}) are available!".format(columns))
302 pfColumns = self._stringify(newColumns)
303 df = self._pf.read(columns=pfColumns, use_pandas_metadata=True).to_pandas()
304
305 if droplevels:
306 # Drop levels of column index that have just one entry
307 levelsToDrop = [n for lev, n in zip(df.columns.levels, df.columns.names) if len(lev) == 1]
308
309 # Prevent error when trying to drop *all* columns
310 if len(levelsToDrop) == len(df.columns.names):
311 levelsToDrop.remove(df.columns.names[-1])
312
313 df.columns = df.columns.droplevel(levelsToDrop)
314
315 return df
316
std::shared_ptr< table::io::Persistable > read(table::io::InputArchive const &archive, table::io::CatalogVector const &catalogs) const override

Member Data Documentation

◆ _columnLevelNames

lsst.pipe.tasks.parquetTable.MultilevelParquetTable._columnLevelNames
protected

Definition at line 210 of file parquetTable.py.

◆ columnLevels

lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevels

Definition at line 217 of file parquetTable.py.

◆ columns

lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columns

Definition at line 216 of file parquetTable.py.


The documentation for this class was generated from the following file: