Metric operations¶
Implementations of various profile evaluation metrics. These can be used directly, but we recommend usage through cytominer_eval.evaluate().
cytominer_eval.evaluate() contains several checks to confirm metrics are properly implemented.
cytominer_eval.operations.grit¶
Grit describes phenotype strength of replicate profiles along two distinct axes:
Similarity to other perturbations that target the same larger group (e.g. gene, MOA), with respect to:
Similarity to control perturbations
-
cytominer_eval.operations.grit.grit(similarity_melted_df: pandas.core.frame.DataFrame, control_perts: List[str], profile_col: str, replicate_group_col: str, replicate_summary_method: str = 'mean') → pandas.core.frame.DataFrame¶ Calculate grit
- Parameters
similarity_melted_df (pandas.DataFrame) – a long pandas dataframe output from cytominer_eval.transform.metric_melt
control_perts (list) – a list of control perturbations to calculate a null distribution
profile_col (str) – the metadata column storing profile ids. The column can have unique or replicate identifiers.
replicate_group_col (str) – the metadata column indicating a higher order structure (group) than the profile column. E.g. target gene vs. guide in a CRISPR experiment.
replicate_summary_method ({'mean', 'median'}, optional) – how replicate z-scores to control perts are summarized. Defaults to “mean”.
- Returns
A dataframe of grit measurements per perturbation
- Return type
pandas.DataFrame
cytominer_eval.operations.mp_value¶
Functions to calculate multidimensional perturbation values (mp-value)
mp-value describes the distance, in dimensionality-reduced space, between a perturbation and a control 1.
References
- 1(1,2)
Hutz, J. et al. “The Multidimensional Perturbation Value: A Single Metric to Measure Similarity and Activity of Treatments in High-Throughput Multidimensional Screens” Journal of Biomolecular Screening, Volume: 18 issue: 4, page(s): 367-377. doi: 10.1177/1087057112469257
-
cytominer_eval.operations.mp_value.mp_value(df: pandas.core.frame.DataFrame, control_perts: List[str], replicate_id: str, features: List[str], params: dict = {}) → pandas.core.frame.DataFrame¶ Calculate multidimensional perturbation value (mp-value) 1.
- Parameters
df (pandas.DataFrame) – profiles with measurements per row and features or metadata per column.
control_perts (list) – The control perturbations against which the distances will be computed.
replicate_id (str) – The metadata identifier marking which column tracks control and replicate perts.
features (list) – columns containing numerical features to be used for the mp-value computation
params (dict, optional) – Optional parameters provided. See list of parameters in
cytominer_eval.operations.util.default_mp_value_parameters()
- Returns
mp-values per perturbation.
- Return type
pd.DataFrame
cytominer_eval.operations.replicate_reproducibility¶
Functions to calculate replicate reproducibility
-
cytominer_eval.operations.replicate_reproducibility.replicate_reproducibility(similarity_melted_df: pandas.core.frame.DataFrame, replicate_groups: List[str], quantile_over_null: float = 0.95, return_median_correlations: bool = False) → float¶ Summarize pairwise replicate correlations
For a given pairwise similarity matrix, replicate information, and specific options, output a replicate correlation summary.
- Parameters
similarity_melted_df (pandas.DataFrame) – An elongated symmetrical matrix indicating pairwise correlations between samples. Importantly, it must follow the exact structure as output from
cytominer_eval.transform.transform.metric_melt().replicate_groups (list) – A list of metadata column names in the original profile dataframe to indicate replicate samples.
quantile_over_null (float, optional) – A float between 0 and 1 indicating the threshold of nonreplicates to use when reporting percent matching or percent replicating. Defaults to 0.95.
return_median_correlations (bool, optional) – If provided, also return median pairwise correlations per replicate. Defaults to False.
- Returns
The replicate reproducibility of the profiles according to the replicate columns provided. If return_median_correlations = True then the function will return both the metric and a median pairwise correlation pandas.DataFrame.
- Return type
{float, (float, pd.DataFrame)}
cytominer_eval.operations.precision_recall¶
Functions to calculate precision and recall at a given k
-
cytominer_eval.operations.precision_recall.precision_recall(similarity_melted_df: pandas.core.frame.DataFrame, replicate_groups: List[str], k: int) → pandas.core.frame.DataFrame¶ Determine the precision and recall at k for all unique replicate groups based on a predefined similarity metric (see cytominer_eval.transform.metric_melt)
- Parameters
similarity_melted_df (pandas.DataFrame) – An elongated symmetrical matrix indicating pairwise correlations between samples. Importantly, it must follow the exact structure as output from
cytominer_eval.transform.transform.metric_melt().replicate_groups (List) – a list of metadata column names in the original profile dataframe to use as replicate columns.
k (int) – an integer indicating how many pairwise comparisons to threshold.
- Returns
precision and recall metrics for all replicate groups given k
- Return type
pandas.DataFrame
cytominer_eval.operations.util¶
-
class
cytominer_eval.operations.util.MahalanobisEstimator(arr: Union[pandas.core.frame.DataFrame, numpy.ndarray])¶ Bases:
objectStore location and dispersion estimators of the empirical distribution of data provided in an array and allow computation of statistical distances.
- Parameters
arr ({pandas.DataFrame, np.ndarray}) – the matrix used to calculate covariance
-
sigma¶ Fitted covariance matrix of sklearn.covariance.EmpiricalCovariance()
- Type
np.array
-
mahalanobis(X)¶ Computes mahalanobis distance between the input array (self.arr) and the X array as provided
-
mahalanobis(X: Union[pandas.core.frame.DataFrame, numpy.ndarray]) → numpy.ndarray¶ Compute the mahalanobis distance between the empirical distribution described by this object and points in an array X.
- Parameters
X ({pandas.DataFrame, np.ndarray}) – A samples by features array-like matrix to compute mahalanobis distance between self.arr
- Returns
Mahalanobis distance between the input array and the original sigma
- Return type
numpy.array
-
cytominer_eval.operations.util.assign_replicates(similarity_melted_df: pandas.core.frame.DataFrame, replicate_groups: List[str]) → pandas.core.frame.DataFrame¶ Determine which profiles should be considered replicates.
Given an elongated pairwise correlation matrix with metadata annotations, determine how to assign replicate information.
- Parameters
similarity_melted_df (pandas.DataFrame) – Long pandas DataFrame of annotated pairwise correlations output from
cytominer_eval.transform.transform.metric_melt().replicate_groups (list) – a list of metadata column names in the original profile dataframe used to indicate replicate profiles.
- Returns
A similarity_melted_df but with added columns indicating whether or not the pairwise similarity metric is comparing replicates or not. Used in most eval operations.
- Return type
pd.DataFrame
-
cytominer_eval.operations.util.calculate_grit(replicate_group_df: pandas.core.frame.DataFrame, control_perts: List[str], column_id_info: dict, replicate_summary_method: str = 'mean') → pandas.core.series.Series¶ Given an elongated pairwise correlation dataframe of replicate groups, calculate grit.
Usage: Designed to be called within a pandas.DataFrame().groupby().apply(). See
cytominer_eval.operations.grit.grit().- Parameters
replicate_group_df (pandas.DataFrame) – An elongated dataframe storing pairwise correlations of all profiles to a single replicate group.
control_perts (list) – The profile_ids that should be considered controls (the reference)
column_id_info (dict) – A dictionary of column identifiers noting profile and replicate group ids. This variable is autogenerated in
cytominer_eval.transform.util.set_grit_column_info().replicate_summary_method ({'mean', 'median'}, optional) – how replicate z-scores to control perts are summarized. Defaults to “mean”.
- Returns
A return bundle of identifiers (perturbation, group) and results (grit score). The dictionary has keys (“perturbation”, “group”, “grit_score”). “grit_score” will be NaN if no other profiles exist in the defined group.
- Return type
dict
-
cytominer_eval.operations.util.calculate_mahalanobis(pert_df: pandas.core.frame.DataFrame, control_df: pandas.core.frame.DataFrame) → pandas.core.series.Series¶ Given perturbation and control dataframes, calculate mahalanobis distance per perturbation
Usage: Designed to be called within a pandas.DataFrame().groupby().apply(). See
cytominer_eval.operations.util.calculate_mp_value().- Parameters
pert_df (pandas.DataFrame) – A pandas dataframe of replicate perturbations (samples by features)
control_df (pandas.DataFrame) – A pandas dataframe of control perturbations (samples by features). Must have the same feature measurements as pert_df
- Returns
The mahalanobis distance between perturbation and control
- Return type
float
-
cytominer_eval.operations.util.calculate_mp_value(pert_df: pandas.core.frame.DataFrame, control_df: pandas.core.frame.DataFrame, params: dict = {}) → pandas.core.series.Series¶ Given perturbation and control dataframes, calculate mp-value per perturbation
Usage: Designed to be called within a pandas.DataFrame().groupby().apply(). See
cytominer_eval.operations.mp_value.mp_value().- Parameters
pert_df (pandas.DataFrame) – A pandas dataframe of replicate perturbations (samples by features)
control_df (pandas.DataFrame) – A pandas dataframe of control perturbations (samples by features). Must have the same feature measurements as pert_df
params ({dict}, optional) – the parameters to use when calculating mp value. See
cytominer_eval.operations.util.default_mp_value_parameters().
- Returns
The mp value for the given perturbation
- Return type
float
-
cytominer_eval.operations.util.calculate_precision_recall(replicate_group_df: pandas.core.frame.DataFrame, k: int) → pandas.core.series.Series¶ Given an elongated pairwise correlation dataframe of replicate groups, calculate precision and recall.
Usage: Designed to be called within a pandas.DataFrame().groupby().apply(). See
cytominer_eval.operations.precision_recall.precision_recall().- Parameters
replicate_group_df (pandas.DataFrame) – An elongated dataframe storing pairwise correlations of all profiles to a single replicate group.
k (int) – an integer indicating how many pairwise comparisons to threshold.
- Returns
A return bundle of identifiers (k) and results (precision and recall at k). The dictionary has keys (“k”, “precision”, “recall”).
- Return type
dict
-
cytominer_eval.operations.util.default_mp_value_parameters()¶ Set the different default parameters used for mp-values.
- Returns
A default parameter set with keys: rescale_pca (whether the PCA should be scaled by variance explained) and nb_permutations (how many permutations to calculate empirical p-value). Defaults to True and 100, respectively.
- Return type
dict
-
cytominer_eval.operations.util.get_grit_entry(df: pandas.core.frame.DataFrame, col: str) → str¶ Helper function to define the perturbation identifier of interest
Grit must be calculated using unique perturbations. This may or may not mean unique perturbations.