Metrics¶
Implementations of various profile evaluation metrics. These can be used directly, but we recommend usage through cytominer_eval.evaluate()`.
cytominer_eval.evaluate() contains several checks to confirm metrics are properly implemented.
Grit¶
Grit describes phenotype strength of replicate profiles along two distinct axes:
Similarity to other perturbations that target the same larger group (e.g. gene, MOA), with respect to:
Similarity to control perturbations
-
cytominer_eval.operations.grit.grit(similarity_melted_df: pandas.core.frame.DataFrame, control_perts: List[str], profile_col: str, replicate_group_col: str, replicate_summary_method: str = 'mean') → pandas.core.frame.DataFrame¶ Calculate grit
- Parameters
similarity_melted_df (pandas.DataFrame) – a long pandas dataframe output from cytominer_eval.transform.metric_melt
control_perts (list) – a list of control perturbations to calculate a null distribution
profile_col (str) – the metadata column storing profile ids. The column can have unique or replicate identifiers.
replicate_group_col (str) – the metadata column indicating a higher order structure (group) than the profile column. E.g. target gene vs. guide in a CRISPR experiment.
replicate_summary_method ({'mean', 'median'}, optional) – how replicate z-scores to control perts are summarized. Defaults to “mean”.
- Returns
A dataframe of grit measurements per perturbation
- Return type
pandas.DataFrame
mp-value¶
Functions to calculate multidimensional perturbation values (mp-value)
mp-value describes the distance, in dimensionality-reduced space, between a perturbation and a control 1.
References
- 1(1,2)
Hutz, J. et al. “The Multidimensional Perturbation Value: A Single Metric to Measure Similarity and Activity of Treatments in High-Throughput Multidimensional Screens” Journal of Biomolecular Screening, Volume: 18 issue: 4, page(s): 367-377. doi: 10.1177/1087057112469257
-
cytominer_eval.operations.mp_value.mp_value(df: pandas.core.frame.DataFrame, control_perts: List[str], replicate_id: str, features: List[str], params: dict = {}) → pandas.core.frame.DataFrame¶ Calculate multidimensional perturbation value (mp-value) 1.
- Parameters
df (pandas.DataFrame) – profiles with measurements per row and features or metadata per column.
control_perts (list) – The control perturbations against which the distances will be computed.
replicate_id (str) – The metadata identifier marking which column tracks control and replicate perts.
features (list) – columns containing numerical features to be used for the mp-value computation
params (dict, optional) – Optional parameters provided. See list of parameters in
cytominer_eval.operations.util.default_mp_value_parameters()
- Returns
mp-values per perturbation.
- Return type
pd.DataFrame
Replicate reproducibility¶
Functions to calculate replicate reproducibility
-
cytominer_eval.operations.replicate_reproducibility.replicate_reproducibility(similarity_melted_df: pandas.core.frame.DataFrame, replicate_groups: List[str], quantile_over_null: float = 0.95, return_median_correlations: bool = False) → float¶ Summarize pairwise replicate correlations
For a given pairwise similarity matrix, replicate information, and specific options, output a replicate correlation summary.
- Parameters
similarity_melted_df (pandas.DataFrame) – An elongated symmetrical matrix indicating pairwise correlations between samples. Importantly, it must follow the exact structure as output from
cytominer_eval.transform.transform.metric_melt().replicate_groups (list) – A list of metadata column names in the original profile dataframe to indicate replicate samples.
quantile_over_null (float, optional) – A float between 0 and 1 indicating the threshold of nonreplicates to use when reporting percent matching or percent replicating. Defaults to 0.95.
return_median_correlations (bool, optional) – If provided, also return median pairwise correlations per replicate. Defaults to False.
- Returns
The replicate reproducibility of the profiles according to the replicate columns provided. If return_median_correlations = True then the function will return both the metric and a median pairwise correlation pandas.DataFrame.
- Return type
{float, (float, pd.DataFrame)}
Precision and recall¶
Functions to calculate precision and recall at a given k
-
cytominer_eval.operations.precision_recall.precision_recall(similarity_melted_df: pandas.core.frame.DataFrame, replicate_groups: List[str], groupby_columns: List[str], k: Union[int, List[int]]) → pandas.core.frame.DataFrame¶ Determine the precision and recall at k for all unique groupby_columns samples based on a predefined similarity metric (see cytominer_eval.transform.metric_melt)
- Parameters
similarity_melted_df (pandas.DataFrame) – An elongated symmetrical matrix indicating pairwise correlations between samples. Importantly, it must follow the exact structure as output from
cytominer_eval.transform.transform.metric_melt().replicate_groups (List) – a list of metadata column names in the original profile dataframe to use as replicate columns.
groupby_columns (List of str) – Column by which the similarity matrix is grouped and by which the precision/recall is calculated. For example, if groupby_column = Metadata_sample then the precision is calculated for each sample. Calculating the precision by sample is the default but it is mathematically not incorrect to calculate the precision at the MOA level. This is just less intuitive to understand.
k (List of ints or int) – an integer indicating how many pairwise comparisons to threshold.
- Returns
precision and recall metrics for all groupby_column groups given k
- Return type
pandas.DataFrame
Enrichment¶
Function to calculate the enrichment score for a given similarity matrix.
-
cytominer_eval.operations.enrichment.enrichment(similarity_melted_df: pandas.core.frame.DataFrame, replicate_groups: List[str], percentile: Union[float, List[float]]) → pandas.core.frame.DataFrame¶ Calculate the enrichment score. This score is based on the fisher exact odds score. Similar to the other functions, the closest connections are determined and checked with the replicates. This score effectively calculates how much better the distribution of correct connections is compared to random.
- Parameters
similarity_melted_df (pandas.DataFrame) – An elongated symmetrical matrix indicating pairwise correlations between samples. Importantly, it must follow the exact structure as output from
cytominer_eval.transform.transform.metric_melt().replicate_groups (List) – a list of metadata column names in the original profile dataframe to use as replicate columns.
percentile (List of floats) – Determines what percentage of top connections used for the enrichment calculation.
- Returns
percentile, threshold, odds ratio and p value
- Return type
dict