Public Member Functions
	__init__ (self, MatchProbabilisticConfig config)

	match (self, pd.DataFrame catalog_ref, pd.DataFrame catalog_target, np.array select_ref=None, np.array select_target=None, logging.Logger logger=None, int logging_n_rows=None, **kwargs)

Public Attributes
	config

Static Public Attributes
MatchProbabilisticConfig	config

Detailed Description

A probabilistic, greedy catalog matcher.

Parameters
----------
config: `MatchProbabilisticConfig`
    A configuration instance.

Definition at line 414 of file matcher_probabilistic.py.

Constructor & Destructor Documentation

◆ init()

lsst.meas.astrom.matcher_probabilistic.MatcherProbabilistic.__init__	(		self,
		MatchProbabilisticConfig	config )

Definition at line 424 of file matcher_probabilistic.py.

    ):
        self.config = config
 

Member Function Documentation

◆ match()

lsst.meas.astrom.matcher_probabilistic.MatcherProbabilistic.match	(		self,
		pd.DataFrame	catalog_ref,
		pd.DataFrame	catalog_target,
		np.array	select_ref = None,
		np.array	select_target = None,
		logging.Logger	logger = None,
		int	logging_n_rows = None,
		**	kwargs )

Match catalogs.

Parameters
----------
catalog_ref : `pandas.DataFrame`
    A reference catalog to match in order of a given column (i.e. greedily).
catalog_target : `pandas.DataFrame`
    A target catalog for matching sources from `catalog_ref`. Must contain measurements with errors.
select_ref : `numpy.array`
    A boolean array of the same length as `catalog_ref` selecting the sources that can be matched.
select_target : `numpy.array`
    A boolean array of the same length as `catalog_target` selecting the sources that can be matched.
logger : `logging.Logger`
    A Logger for logging.
logging_n_rows : `int`
    The number of sources to match before printing a log message.
kwargs
    Additional keyword arguments to pass to `format_catalogs`.

Returns
-------
catalog_out_ref : `pandas.DataFrame`
    A catalog of identical length to `catalog_ref`, containing match information for rows selected by
    `select_ref` (including the matching row index in `catalog_target`).
catalog_out_target : `pandas.DataFrame`
    A catalog of identical length to `catalog_target`, containing the indices of matching rows in
    `catalog_ref`.
exceptions : `dict` [`int`, `Exception`]
    A dictionary keyed by `catalog_target` row number of the first exception caught when matching.

Definition at line 430 of file matcher_probabilistic.py.

    ):
        """Match catalogs.
 
        Parameters
        ----------
        catalog_ref : `pandas.DataFrame`
            A reference catalog to match in order of a given column (i.e. greedily).
        catalog_target : `pandas.DataFrame`
            A target catalog for matching sources from `catalog_ref`. Must contain measurements with errors.
        select_ref : `numpy.array`
            A boolean array of the same length as `catalog_ref` selecting the sources that can be matched.
        select_target : `numpy.array`
            A boolean array of the same length as `catalog_target` selecting the sources that can be matched.
        logger : `logging.Logger`
            A Logger for logging.
        logging_n_rows : `int`
            The number of sources to match before printing a log message.
        kwargs
            Additional keyword arguments to pass to `format_catalogs`.
 
        Returns
        -------
        catalog_out_ref : `pandas.DataFrame`
            A catalog of identical length to `catalog_ref`, containing match information for rows selected by
            `select_ref` (including the matching row index in `catalog_target`).
        catalog_out_target : `pandas.DataFrame`
            A catalog of identical length to `catalog_target`, containing the indices of matching rows in
            `catalog_ref`.
        exceptions : `dict` [`int`, `Exception`]
            A dictionary keyed by `catalog_target` row number of the first exception caught when matching.
        """
        if logger is None:
            logger = logger_default
 
        config = self.config
 
        # Transform any coordinates, if required
        # Note: The returned objects contain the original catalogs, as well as
        # transformed coordinates, and the selection of sources for matching.
        # These might be identical to the arrays passed as kwargs, but that
        # depends on config settings.
        # For the rest of this function, the selection arrays will be used,
        # but the indices of the original, unfiltered catalog will also be
        # output, so some further indexing steps are needed.
        ref, target = config.coord_format.format_catalogs(
            catalog_ref=catalog_ref, catalog_target=catalog_target,
            select_ref=select_ref, select_target=select_target,
            **kwargs
        )
 
        # If no order is specified, take nansum of all flux columns for a 'total flux'
        # Note: it won't actually be a total flux if bands overlap significantly
        # (or it might define a filter with >100% efficiency
        # Also, this is done on the original dataframe as it's harder to accomplish
        # just with a recarray
        column_order = (
            catalog_ref.loc[ref.extras.select, config.column_ref_order]
            if config.column_ref_order is not None else
            np.nansum(catalog_ref.loc[ref.extras.select, config.columns_ref_flux], axis=1)
        )
        order = np.argsort(column_order if config.order_ascending else -column_order)
 
        n_ref_select = len(ref.extras.indices)
 
        match_dist_max = config.match_dist_max
        coords_spherical = config.coord_format.coords_spherical
        if coords_spherical:
            match_dist_max = np.radians(match_dist_max / 3600.)
 
        # Convert ra/dec sky coordinates to spherical vectors for accurate distances
        func_convert = _radec_to_xyz if coords_spherical else np.vstack
        vec_ref, vec_target = (
            func_convert(cat.coord1[cat.extras.select], cat.coord2[cat.extras.select])
            for cat in (ref, target)
        )
 
        # Generate K-d tree to compute distances
        logger.info('Generating cKDTree with match_n_max=%d', config.match_n_max)
        tree_obj = cKDTree(vec_target)
 
        scores, idxs_target_select = tree_obj.query(
            vec_ref,
            distance_upper_bound=match_dist_max,
            k=config.match_n_max,
        )
 
        n_target_select = len(target.extras.indices)
        n_matches = np.sum(idxs_target_select != n_target_select, axis=1)
        n_matched_max = np.sum(n_matches == config.match_n_max)
        if n_matched_max > 0:
            logger.warning(
                '%d/%d (%.2f%%) selected true objects have n_matches=n_match_max(%d)',
                n_matched_max, n_ref_select, 100.*n_matched_max/n_ref_select, config.match_n_max
            )
 
        # Pre-allocate outputs
        target_row_match = np.full(target.extras.n, np.nan, dtype=np.int64)
        ref_candidate_match = np.zeros(ref.extras.n, dtype=bool)
        ref_row_match = np.full(ref.extras.n, np.nan, dtype=np.int64)
        ref_match_count = np.zeros(ref.extras.n, dtype=np.int32)
        ref_match_meas_finite = np.zeros(ref.extras.n, dtype=np.int32)
        ref_chisq = np.full(ref.extras.n, np.nan, dtype=float)
 
        # Need the original reference row indices for output
        idx_orig_ref, idx_orig_target = (np.argwhere(cat.extras.select) for cat in (ref, target))
 
        # Retrieve required columns, including any converted ones (default to original column name)
        columns_convert = config.coord_format.coords_ref_to_convert
        if columns_convert is None:
            columns_convert = {}
        data_ref = ref.catalog[
            [columns_convert.get(column, column) for column in config.columns_ref_meas]
        ].iloc[ref.extras.indices[order]]
        data_target = target.catalog[config.columns_target_meas][target.extras.select]
        errors_target = target.catalog[config.columns_target_err][target.extras.select]
 
        exceptions = {}
        # The kdTree uses len(inputs) as a sentinel value for no match
        matched_target = {n_target_select, }
 
        t_begin = time.process_time()
 
        logger.info('Matching n_indices=%d/%d', len(order), len(ref.catalog))
        for index_n, index_row_select in enumerate(order):
            index_row = idx_orig_ref[index_row_select]
            ref_candidate_match[index_row] = True
            found = idxs_target_select[index_row_select, :]
            # Select match candidates from nearby sources not already matched
            # Note: set lookup is apparently fast enough that this is a few percent faster than:
            # found = [x for x in found[found != n_target_select] if x not in matched_target]
            # ... at least for ~1M sources
            found = [x for x in found if x not in matched_target]
            n_found = len(found)
            if n_found > 0:
                # This is an ndarray of n_found rows x len(data_ref/target) columns
                chi = (
                    (data_target.iloc[found].values - data_ref.iloc[index_n].values)
                    / errors_target.iloc[found].values
                )
                finite = np.isfinite(chi)
                n_finite = np.sum(finite, axis=1)
                # Require some number of finite chi_sq to match
                chisq_good = n_finite >= config.match_n_finite_min
                if np.any(chisq_good):
                    try:
                        chisq_sum = np.zeros(n_found, dtype=float)
                        chisq_sum[chisq_good] = np.nansum(chi[chisq_good, :] ** 2, axis=1)
                        idx_chisq_min = np.nanargmin(chisq_sum / n_finite)
                        ref_match_meas_finite[index_row] = n_finite[idx_chisq_min]
                        ref_match_count[index_row] = len(chisq_good)
                        ref_chisq[index_row] = chisq_sum[idx_chisq_min]
                        idx_match_select = found[idx_chisq_min]
                        row_target = target.extras.indices[idx_match_select]
                        ref_row_match[index_row] = row_target
 
                        target_row_match[row_target] = index_row
                        matched_target.add(idx_match_select)
                    except Exception as error:
                        # Can't foresee any exceptions, but they shouldn't prevent
                        # matching subsequent sources
                        exceptions[index_row] = error
 
            if logging_n_rows and ((index_n + 1) % logging_n_rows == 0):
                t_elapsed = time.process_time() - t_begin
                logger.info(
                    'Processed %d/%d in %.2fs at sort value=%.3f',
                    index_n + 1, n_ref_select, t_elapsed, column_order[order[index_n]],
                )
 
        data_ref = {
            'match_candidate': ref_candidate_match,
            'match_row': ref_row_match,
            'match_count': ref_match_count,
            'match_chisq': ref_chisq,
            'match_n_chisq_finite': ref_match_meas_finite,
        }
        data_target = {
            'match_candidate': target.extras.select if target.extras.select is not None else (
                np.ones(target.extras.n, dtype=bool)),
            'match_row': target_row_match,
        }
 
        for (columns, out_original, out_matched, in_original, in_matched, matches) in (
            (
                self.config.columns_ref_copy,
                data_ref,
                data_target,
                ref,
                target,
                target_row_match,
            ),
            (
                self.config.columns_target_copy,
                data_target,
                data_ref,
                target,
                ref,
                ref_row_match,
            ),
        ):
            matched = matches >= 0
            idx_matched = matches[matched]
 
            for column in columns:
                values = in_original.catalog[column]
                out_original[column] = values
                dtype = in_original.catalog[column].dtype
 
                # Pandas object columns can have mixed types - check for that
                if dtype == object:
                    types = list(set((type(x) for x in values)))
                    if len(types) != 1:
                        raise RuntimeError(f'Column {column} dtype={dtype} has multiple types={types}')
                    dtype = types[0]
 
                value_fill = default_value(dtype)
 
                # Without this, the dtype would be '<U1' for an empty Unicode string
                if dtype == str:
                    dtype = f'<U{max(len(x) for x in values)}'
 
                column_match = np.full(in_matched.extras.n, value_fill, dtype=dtype)
                column_match[matched] = in_original.catalog[column][idx_matched]
                out_matched[f'match_{column}'] = column_match
 
        catalog_out_ref = pd.DataFrame(data_ref)
        catalog_out_target = pd.DataFrame(data_target)
 
        return catalog_out_ref, catalog_out_target, exceptions

Member Data Documentation

◆ config [1/2]

MatchProbabilisticConfig lsst.meas.astrom.matcher_probabilistic.MatcherProbabilistic.config

static

Definition at line 422 of file matcher_probabilistic.py.

◆ config [2/2]

lsst.meas.astrom.matcher_probabilistic.MatcherProbabilistic.config

Definition at line 428 of file matcher_probabilistic.py.

The documentation for this class was generated from the following file:

/j/snowflake/release/lsstsw/stack/lsst-scipipe-8.0.0/Linux64/meas_astrom/gff1a9f87cc+fa3a7a026e/python/lsst/meas/astrom/matcher_probabilistic.py

Public Member Functions

Public Attributes

Static Public Attributes