Metric operations

Implementations of various profile evaluation metrics. These can be used directly, but we recommend usage through cytominer_eval.evaluate().

cytominer_eval.evaluate() contains several checks to confirm metrics are properly implemented.

cytominer_eval.operations.grit

Grit describes phenotype strength of replicate profiles along two distinct axes:

  • Similarity to other perturbations that target the same larger group (e.g. gene, MOA), with respect to:

  • Similarity to control perturbations

cytominer_eval.operations.grit.grit(similarity_melted_df: pandas.core.frame.DataFrame, control_perts: List[str], profile_col: str, replicate_group_col: str, replicate_summary_method: str = 'mean') → pandas.core.frame.DataFrame

Calculate grit

Parameters
  • similarity_melted_df (pandas.DataFrame) – a long pandas dataframe output from cytominer_eval.transform.metric_melt

  • control_perts (list) – a list of control perturbations to calculate a null distribution

  • profile_col (str) – the metadata column storing profile ids. The column can have unique or replicate identifiers.

  • replicate_group_col (str) – the metadata column indicating a higher order structure (group) than the profile column. E.g. target gene vs. guide in a CRISPR experiment.

  • replicate_summary_method ({'mean', 'median'}, optional) – how replicate z-scores to control perts are summarized. Defaults to “mean”.

Returns

A dataframe of grit measurements per perturbation

Return type

pandas.DataFrame

cytominer_eval.operations.mp_value

Functions to calculate multidimensional perturbation values (mp-value)

mp-value describes the distance, in dimensionality-reduced space, between a perturbation and a control 1.

References

1(1,2)

Hutz, J. et al. “The Multidimensional Perturbation Value: A Single Metric to Measure Similarity and Activity of Treatments in High-Throughput Multidimensional Screens” Journal of Biomolecular Screening, Volume: 18 issue: 4, page(s): 367-377. doi: 10.1177/1087057112469257

cytominer_eval.operations.mp_value.mp_value(df: pandas.core.frame.DataFrame, control_perts: List[str], replicate_id: str, features: List[str], params: dict = {}) → pandas.core.frame.DataFrame

Calculate multidimensional perturbation value (mp-value) 1.

Parameters
  • df (pandas.DataFrame) – profiles with measurements per row and features or metadata per column.

  • control_perts (list) – The control perturbations against which the distances will be computed.

  • replicate_id (str) – The metadata identifier marking which column tracks control and replicate perts.

  • features (list) – columns containing numerical features to be used for the mp-value computation

  • params (dict, optional) – Optional parameters provided. See list of parameters in cytominer_eval.operations.util.default_mp_value_parameters()

Returns

mp-values per perturbation.

Return type

pd.DataFrame

cytominer_eval.operations.replicate_reproducibility

Functions to calculate replicate reproducibility

cytominer_eval.operations.replicate_reproducibility.replicate_reproducibility(similarity_melted_df: pandas.core.frame.DataFrame, replicate_groups: List[str], quantile_over_null: float = 0.95, return_median_correlations: bool = False) → float

Summarize pairwise replicate correlations

For a given pairwise similarity matrix, replicate information, and specific options, output a replicate correlation summary.

Parameters
  • similarity_melted_df (pandas.DataFrame) – An elongated symmetrical matrix indicating pairwise correlations between samples. Importantly, it must follow the exact structure as output from cytominer_eval.transform.transform.metric_melt().

  • replicate_groups (list) – A list of metadata column names in the original profile dataframe to indicate replicate samples.

  • quantile_over_null (float, optional) – A float between 0 and 1 indicating the threshold of nonreplicates to use when reporting percent matching or percent replicating. Defaults to 0.95.

  • return_median_correlations (bool, optional) – If provided, also return median pairwise correlations per replicate. Defaults to False.

Returns

The replicate reproducibility of the profiles according to the replicate columns provided. If return_median_correlations = True then the function will return both the metric and a median pairwise correlation pandas.DataFrame.

Return type

{float, (float, pd.DataFrame)}

cytominer_eval.operations.precision_recall

Functions to calculate precision and recall at a given k

cytominer_eval.operations.precision_recall.precision_recall(similarity_melted_df: pandas.core.frame.DataFrame, replicate_groups: List[str], k: int) → pandas.core.frame.DataFrame

Determine the precision and recall at k for all unique replicate groups based on a predefined similarity metric (see cytominer_eval.transform.metric_melt)

Parameters
  • similarity_melted_df (pandas.DataFrame) – An elongated symmetrical matrix indicating pairwise correlations between samples. Importantly, it must follow the exact structure as output from cytominer_eval.transform.transform.metric_melt().

  • replicate_groups (List) – a list of metadata column names in the original profile dataframe to use as replicate columns.

  • k (int) – an integer indicating how many pairwise comparisons to threshold.

Returns

precision and recall metrics for all replicate groups given k

Return type

pandas.DataFrame

cytominer_eval.operations.util

class cytominer_eval.operations.util.MahalanobisEstimator(arr: Union[pandas.core.frame.DataFrame, numpy.ndarray])

Bases: object

Store location and dispersion estimators of the empirical distribution of data provided in an array and allow computation of statistical distances.

Parameters

arr ({pandas.DataFrame, np.ndarray}) – the matrix used to calculate covariance

sigma

Fitted covariance matrix of sklearn.covariance.EmpiricalCovariance()

Type

np.array

mahalanobis(X)

Computes mahalanobis distance between the input array (self.arr) and the X array as provided

mahalanobis(X: Union[pandas.core.frame.DataFrame, numpy.ndarray]) → numpy.ndarray

Compute the mahalanobis distance between the empirical distribution described by this object and points in an array X.

Parameters

X ({pandas.DataFrame, np.ndarray}) – A samples by features array-like matrix to compute mahalanobis distance between self.arr

Returns

Mahalanobis distance between the input array and the original sigma

Return type

numpy.array

cytominer_eval.operations.util.assign_replicates(similarity_melted_df: pandas.core.frame.DataFrame, replicate_groups: List[str]) → pandas.core.frame.DataFrame

Determine which profiles should be considered replicates.

Given an elongated pairwise correlation matrix with metadata annotations, determine how to assign replicate information.

Parameters
  • similarity_melted_df (pandas.DataFrame) – Long pandas DataFrame of annotated pairwise correlations output from cytominer_eval.transform.transform.metric_melt().

  • replicate_groups (list) – a list of metadata column names in the original profile dataframe used to indicate replicate profiles.

Returns

A similarity_melted_df but with added columns indicating whether or not the pairwise similarity metric is comparing replicates or not. Used in most eval operations.

Return type

pd.DataFrame

cytominer_eval.operations.util.calculate_grit(replicate_group_df: pandas.core.frame.DataFrame, control_perts: List[str], column_id_info: dict, replicate_summary_method: str = 'mean') → pandas.core.series.Series

Given an elongated pairwise correlation dataframe of replicate groups, calculate grit.

Usage: Designed to be called within a pandas.DataFrame().groupby().apply(). See cytominer_eval.operations.grit.grit().

Parameters
  • replicate_group_df (pandas.DataFrame) – An elongated dataframe storing pairwise correlations of all profiles to a single replicate group.

  • control_perts (list) – The profile_ids that should be considered controls (the reference)

  • column_id_info (dict) – A dictionary of column identifiers noting profile and replicate group ids. This variable is autogenerated in cytominer_eval.transform.util.set_grit_column_info().

  • replicate_summary_method ({'mean', 'median'}, optional) – how replicate z-scores to control perts are summarized. Defaults to “mean”.

Returns

A return bundle of identifiers (perturbation, group) and results (grit score). The dictionary has keys (“perturbation”, “group”, “grit_score”). “grit_score” will be NaN if no other profiles exist in the defined group.

Return type

dict

cytominer_eval.operations.util.calculate_mahalanobis(pert_df: pandas.core.frame.DataFrame, control_df: pandas.core.frame.DataFrame) → pandas.core.series.Series

Given perturbation and control dataframes, calculate mahalanobis distance per perturbation

Usage: Designed to be called within a pandas.DataFrame().groupby().apply(). See cytominer_eval.operations.util.calculate_mp_value().

Parameters
  • pert_df (pandas.DataFrame) – A pandas dataframe of replicate perturbations (samples by features)

  • control_df (pandas.DataFrame) – A pandas dataframe of control perturbations (samples by features). Must have the same feature measurements as pert_df

Returns

The mahalanobis distance between perturbation and control

Return type

float

cytominer_eval.operations.util.calculate_mp_value(pert_df: pandas.core.frame.DataFrame, control_df: pandas.core.frame.DataFrame, params: dict = {}) → pandas.core.series.Series

Given perturbation and control dataframes, calculate mp-value per perturbation

Usage: Designed to be called within a pandas.DataFrame().groupby().apply(). See cytominer_eval.operations.mp_value.mp_value().

Parameters
  • pert_df (pandas.DataFrame) – A pandas dataframe of replicate perturbations (samples by features)

  • control_df (pandas.DataFrame) – A pandas dataframe of control perturbations (samples by features). Must have the same feature measurements as pert_df

  • params ({dict}, optional) – the parameters to use when calculating mp value. See cytominer_eval.operations.util.default_mp_value_parameters().

Returns

The mp value for the given perturbation

Return type

float

cytominer_eval.operations.util.calculate_precision_recall(replicate_group_df: pandas.core.frame.DataFrame, k: int) → pandas.core.series.Series

Given an elongated pairwise correlation dataframe of replicate groups, calculate precision and recall.

Usage: Designed to be called within a pandas.DataFrame().groupby().apply(). See cytominer_eval.operations.precision_recall.precision_recall().

Parameters
  • replicate_group_df (pandas.DataFrame) – An elongated dataframe storing pairwise correlations of all profiles to a single replicate group.

  • k (int) – an integer indicating how many pairwise comparisons to threshold.

Returns

A return bundle of identifiers (k) and results (precision and recall at k). The dictionary has keys (“k”, “precision”, “recall”).

Return type

dict

cytominer_eval.operations.util.default_mp_value_parameters()

Set the different default parameters used for mp-values.

Returns

A default parameter set with keys: rescale_pca (whether the PCA should be scaled by variance explained) and nb_permutations (how many permutations to calculate empirical p-value). Defaults to True and 100, respectively.

Return type

dict

cytominer_eval.operations.util.get_grit_entry(df: pandas.core.frame.DataFrame, col: str) → str

Helper function to define the perturbation identifier of interest

Grit must be calculated using unique perturbations. This may or may not mean unique perturbations.

Module contents