LSST Applications  21.0.0-172-gfb10e10a+18fedfabac,22.0.0+297cba6710,22.0.0+80564b0ff1,22.0.0+8d77f4f51a,22.0.0+a28f4c53b1,22.0.0+dcf3732eb2,22.0.1-1-g7d6de66+2a20fdde0d,22.0.1-1-g8e32f31+297cba6710,22.0.1-1-geca5380+7fa3b7d9b6,22.0.1-12-g44dc1dc+2a20fdde0d,22.0.1-15-g6a90155+515f58c32b,22.0.1-16-g9282f48+790f5f2caa,22.0.1-2-g92698f7+dcf3732eb2,22.0.1-2-ga9b0f51+7fa3b7d9b6,22.0.1-2-gd1925c9+bf4f0e694f,22.0.1-24-g1ad7a390+a9625a72a8,22.0.1-25-g5bf6245+3ad8ecd50b,22.0.1-25-gb120d7b+8b5510f75f,22.0.1-27-g97737f7+2a20fdde0d,22.0.1-32-gf62ce7b1+aa4237961e,22.0.1-4-g0b3f228+2a20fdde0d,22.0.1-4-g243d05b+871c1b8305,22.0.1-4-g3a563be+32dcf1063f,22.0.1-4-g44f2e3d+9e4ab0f4fa,22.0.1-42-gca6935d93+ba5e5ca3eb,22.0.1-5-g15c806e+85460ae5f3,22.0.1-5-g58711c4+611d128589,22.0.1-5-g75bb458+99c117b92f,22.0.1-6-g1c63a23+7fa3b7d9b6,22.0.1-6-g50866e6+84ff5a128b,22.0.1-6-g8d3140d+720564cf76,22.0.1-6-gd805d02+cc5644f571,22.0.1-8-ge5750ce+85460ae5f3,master-g6e05de7fdc+babf819c66,master-g99da0e417a+8d77f4f51a,w.2021.48
LSST Data Management Base Package
Public Member Functions | Public Attributes | List of all members
lsst.pipe.tasks.parquetTable.MultilevelParquetTable Class Reference
Inheritance diagram for lsst.pipe.tasks.parquetTable.MultilevelParquetTable:
lsst.pipe.tasks.parquetTable.ParquetTable

Public Member Functions

def __init__ (self, *args, **kwargs)
 
def columnLevelNames (self)
 
def columnLevels (self)
 
def toDataFrame (self, columns=None, droplevels=True)
 
def write (self, filename)
 
def pandasMd (self)
 
def columnIndex (self)
 
def columns (self)
 
def toDataFrame (self, columns=None)
 

Public Attributes

 filename
 

Detailed Description

Wrapper to access dataframe with multi-level column index from Parquet

This subclass of `ParquetTable` to handle the multi-level is necessary
because there is not a convenient way to request specific table subsets
by level via Parquet through pyarrow, as there is with a `pandas.DataFrame`.

Additionally, pyarrow stores multilevel index information in a very strange
way. Pandas stores it as a tuple, so that one can access a single column
from a pandas dataframe as `df[('ref', 'HSC-G', 'coord_ra')]`.  However, for
some reason pyarrow saves these indices as "stringified" tuples, such that
in order to read thissame column from a table written to Parquet, you would
have to do the following:

    pf = pyarrow.ParquetFile(filename)
    df = pf.read(columns=["('ref', 'HSC-G', 'coord_ra')"])

See also https://github.com/apache/arrow/issues/1771, where we've raised
this issue.

As multilevel-indexed dataframes can be very useful to store data like
multiple filters' worth of data in the same table, this case deserves a
wrapper to enable easier access;
that's what this object is for.  For example,

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
                  'filter':'HSC-G',
                  'column':['coord_ra', 'coord_dec']}
    df = parq.toDataFrame(columns=columnDict)

will return just the coordinate columns; the equivalent of calling
`df['meas']['HSC-G'][['coord_ra', 'coord_dec']]` on the total dataframe,
but without having to load the whole frame into memory---this reads just
those columns from disk.  You can also request a sub-table; e.g.,

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
                  'filter':'HSC-G'}
    df = parq.toDataFrame(columns=columnDict)

and this will be the equivalent of `df['meas']['HSC-G']` on the total dataframe.

Parameters
----------
filename : str, optional
    Path to Parquet file.
dataFrame : dataFrame, optional

Definition at line 148 of file parquetTable.py.

Constructor & Destructor Documentation

◆ __init__()

def lsst.pipe.tasks.parquetTable.MultilevelParquetTable.__init__ (   self,
args,
**  kwargs 
)

Definition at line 198 of file parquetTable.py.

198  def __init__(self, *args, **kwargs):
199  super(MultilevelParquetTable, self).__init__(*args, **kwargs)
200 
201  self._columnLevelNames = None
202 

Member Function Documentation

◆ columnIndex()

def lsst.pipe.tasks.parquetTable.ParquetTable.columnIndex (   self)
inherited
Columns as a pandas Index

Definition at line 91 of file parquetTable.py.

91  def columnIndex(self):
92  """Columns as a pandas Index
93  """
94  if self._columnIndex is None:
95  self._columnIndex = self._getColumnIndex()
96  return self._columnIndex
97 

◆ columnLevelNames()

def lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevelNames (   self)

Definition at line 204 of file parquetTable.py.

204  def columnLevelNames(self):
205  if self._columnLevelNames is None:
206  self._columnLevelNames = {
207  level: list(np.unique(np.array(self.columns)[:, i]))
208  for i, level in enumerate(self.columnLevels)
209  }
210  return self._columnLevelNames
211 
daf::base::PropertyList * list
Definition: fits.cc:913

◆ columnLevels()

def lsst.pipe.tasks.parquetTable.MultilevelParquetTable.columnLevels (   self)
Names of levels in column index

Definition at line 213 of file parquetTable.py.

213  def columnLevels(self):
214  """Names of levels in column index
215  """
216  return self.columnIndex.names
217 

◆ columns()

def lsst.pipe.tasks.parquetTable.ParquetTable.columns (   self)
inherited
List of column names (or column index if df is set)

This may either be a list of column names, or a
pandas.Index object describing the column index, depending
on whether the ParquetTable object is wrapping a ParquetFile
or a DataFrame.

Definition at line 105 of file parquetTable.py.

105  def columns(self):
106  """List of column names (or column index if df is set)
107 
108  This may either be a list of column names, or a
109  pandas.Index object describing the column index, depending
110  on whether the ParquetTable object is wrapping a ParquetFile
111  or a DataFrame.
112  """
113  if self._columns is None:
114  self._columns = self._getColumns()
115  return self._columns
116 

◆ pandasMd()

def lsst.pipe.tasks.parquetTable.ParquetTable.pandasMd (   self)
inherited

Definition at line 83 of file parquetTable.py.

83  def pandasMd(self):
84  if self._pf is None:
85  raise AttributeError("This property is only accessible if ._pf is set.")
86  if self._pandasMd is None:
87  self._pandasMd = json.loads(self._pf.metadata.metadata[b"pandas"])
88  return self._pandasMd
89 

◆ toDataFrame() [1/2]

def lsst.pipe.tasks.parquetTable.ParquetTable.toDataFrame (   self,
  columns = None 
)
inherited
Get table (or specified columns) as a pandas DataFrame

Parameters
----------
columns : list, optional
    Desired columns.  If `None`, then all columns will be
    returned.

Definition at line 126 of file parquetTable.py.

126  def toDataFrame(self, columns=None):
127  """Get table (or specified columns) as a pandas DataFrame
128 
129  Parameters
130  ----------
131  columns : list, optional
132  Desired columns. If `None`, then all columns will be
133  returned.
134  """
135  if self._pf is None:
136  if columns is None:
137  return self._df
138  else:
139  return self._df[columns]
140 
141  if columns is None:
142  return self._pf.read().to_pandas()
143 
144  df = self._pf.read(columns=columns, use_pandas_metadata=True).to_pandas()
145  return df
146 
147 
std::shared_ptr< table::io::Persistable > read(table::io::InputArchive const &archive, table::io::CatalogVector const &catalogs) const override
Definition: warpExposure.cc:0

◆ toDataFrame() [2/2]

def lsst.pipe.tasks.parquetTable.MultilevelParquetTable.toDataFrame (   self,
  columns = None,
  droplevels = True 
)
Get table (or specified columns) as a pandas DataFrame

To get specific columns in specified sub-levels:

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
              'filter':'HSC-G',
              'column':['coord_ra', 'coord_dec']}
    df = parq.toDataFrame(columns=columnDict)

Or, to get an entire subtable, leave out one level name:

    parq = MultilevelParquetTable(filename)
    columnDict = {'dataset':'meas',
                  'filter':'HSC-G'}
    df = parq.toDataFrame(columns=columnDict)

Parameters
----------
columns : list or dict, optional
    Desired columns.  If `None`, then all columns will be
    returned.  If a list, then the names of the columns must
    be *exactly* as stored by pyarrow; that is, stringified tuples.
    If a dictionary, then the entries of the dictionary must
    correspond to the level names of the column multi-index
    (that is, the `columnLevels` attribute).  Not every level
    must be passed; if any level is left out, then all entries
    in that level will be implicitly included.
droplevels : bool
    If True drop levels of column index that have just one entry

Definition at line 235 of file parquetTable.py.

235  def toDataFrame(self, columns=None, droplevels=True):
236  """Get table (or specified columns) as a pandas DataFrame
237 
238  To get specific columns in specified sub-levels:
239 
240  parq = MultilevelParquetTable(filename)
241  columnDict = {'dataset':'meas',
242  'filter':'HSC-G',
243  'column':['coord_ra', 'coord_dec']}
244  df = parq.toDataFrame(columns=columnDict)
245 
246  Or, to get an entire subtable, leave out one level name:
247 
248  parq = MultilevelParquetTable(filename)
249  columnDict = {'dataset':'meas',
250  'filter':'HSC-G'}
251  df = parq.toDataFrame(columns=columnDict)
252 
253  Parameters
254  ----------
255  columns : list or dict, optional
256  Desired columns. If `None`, then all columns will be
257  returned. If a list, then the names of the columns must
258  be *exactly* as stored by pyarrow; that is, stringified tuples.
259  If a dictionary, then the entries of the dictionary must
260  correspond to the level names of the column multi-index
261  (that is, the `columnLevels` attribute). Not every level
262  must be passed; if any level is left out, then all entries
263  in that level will be implicitly included.
264  droplevels : bool
265  If True drop levels of column index that have just one entry
266 
267  """
268  if columns is None:
269  if self._pf is None:
270  return self._df
271  else:
272  return self._pf.read().to_pandas()
273 
274  if isinstance(columns, dict):
275  columns = self._colsFromDict(columns)
276 
277  if self._pf is None:
278  try:
279  df = self._df[columns]
280  except (AttributeError, KeyError):
281  newColumns = [c for c in columns if c in self.columnIndex]
282  if not newColumns:
283  raise ValueError("None of the requested columns ({}) are available!".format(columns))
284  df = self._df[newColumns]
285  else:
286  pfColumns = self._stringify(columns)
287  try:
288  df = self._pf.read(columns=pfColumns, use_pandas_metadata=True).to_pandas()
289  except (AttributeError, KeyError):
290  newColumns = [c for c in columns if c in self.columnIndex]
291  if not newColumns:
292  raise ValueError("None of the requested columns ({}) are available!".format(columns))
293  pfColumns = self._stringify(newColumns)
294  df = self._pf.read(columns=pfColumns, use_pandas_metadata=True).to_pandas()
295 
296  if droplevels:
297  # Drop levels of column index that have just one entry
298  levelsToDrop = [n for lev, n in zip(df.columns.levels, df.columns.names) if len(lev) == 1]
299 
300  # Prevent error when trying to drop *all* columns
301  if len(levelsToDrop) == len(df.columns.names):
302  levelsToDrop.remove(df.columns.names[-1])
303 
304  df.columns = df.columns.droplevel(levelsToDrop)
305 
306  return df
307 
def format(config, name=None, writeSourceLine=True, prefix="", verbose=False)
Definition: history.py:174

◆ write()

def lsst.pipe.tasks.parquetTable.ParquetTable.write (   self,
  filename 
)
inherited
Write pandas dataframe to parquet

Parameters
----------
filename : str
    Path to which to write.

Definition at line 69 of file parquetTable.py.

69  def write(self, filename):
70  """Write pandas dataframe to parquet
71 
72  Parameters
73  ----------
74  filename : str
75  Path to which to write.
76  """
77  if self._df is None:
78  raise ValueError("df property must be defined to write.")
79  table = pyarrow.Table.from_pandas(self._df)
80  pyarrow.parquet.write_table(table, filename)
81 
void write(OutputArchiveHandle &handle) const override

Member Data Documentation

◆ filename

lsst.pipe.tasks.parquetTable.ParquetTable.filename
inherited

Definition at line 55 of file parquetTable.py.


The documentation for this class was generated from the following file: