Metrics

Implementations of various profile evaluation metrics. These can be used directly, but we recommend usage through cytominer_eval.evaluate()`.

cytominer_eval.evaluate() contains several checks to confirm metrics are properly implemented.

Grit

Grit describes phenotype strength of replicate profiles along two distinct axes:

  • Similarity to other perturbations that target the same larger group (e.g. gene, MOA), with respect to:

  • Similarity to control perturbations

cytominer_eval.operations.grit.grit(similarity_melted_df: pandas.core.frame.DataFrame, control_perts: List[str], profile_col: str, replicate_group_col: str, replicate_summary_method: str = 'mean') → pandas.core.frame.DataFrame

Calculate grit

Parameters
  • similarity_melted_df (pandas.DataFrame) – a long pandas dataframe output from cytominer_eval.transform.metric_melt

  • control_perts (list) – a list of control perturbations to calculate a null distribution

  • profile_col (str) – the metadata column storing profile ids. The column can have unique or replicate identifiers.

  • replicate_group_col (str) – the metadata column indicating a higher order structure (group) than the profile column. E.g. target gene vs. guide in a CRISPR experiment.

  • replicate_summary_method ({'mean', 'median'}, optional) – how replicate z-scores to control perts are summarized. Defaults to “mean”.

Returns

A dataframe of grit measurements per perturbation

Return type

pandas.DataFrame

mp-value

Functions to calculate multidimensional perturbation values (mp-value)

mp-value describes the distance, in dimensionality-reduced space, between a perturbation and a control 1.

References

1(1,2)

Hutz, J. et al. “The Multidimensional Perturbation Value: A Single Metric to Measure Similarity and Activity of Treatments in High-Throughput Multidimensional Screens” Journal of Biomolecular Screening, Volume: 18 issue: 4, page(s): 367-377. doi: 10.1177/1087057112469257

cytominer_eval.operations.mp_value.mp_value(df: pandas.core.frame.DataFrame, control_perts: List[str], replicate_id: str, features: List[str], params: dict = {}) → pandas.core.frame.DataFrame

Calculate multidimensional perturbation value (mp-value) 1.

Parameters
  • df (pandas.DataFrame) – profiles with measurements per row and features or metadata per column.

  • control_perts (list) – The control perturbations against which the distances will be computed.

  • replicate_id (str) – The metadata identifier marking which column tracks control and replicate perts.

  • features (list) – columns containing numerical features to be used for the mp-value computation

  • params (dict, optional) – Optional parameters provided. See list of parameters in cytominer_eval.operations.util.default_mp_value_parameters()

Returns

mp-values per perturbation.

Return type

pd.DataFrame

Replicate reproducibility

Functions to calculate replicate reproducibility

cytominer_eval.operations.replicate_reproducibility.replicate_reproducibility(similarity_melted_df: pandas.core.frame.DataFrame, replicate_groups: List[str], quantile_over_null: float = 0.95, return_median_correlations: bool = False) → float

Summarize pairwise replicate correlations

For a given pairwise similarity matrix, replicate information, and specific options, output a replicate correlation summary.

Parameters
  • similarity_melted_df (pandas.DataFrame) – An elongated symmetrical matrix indicating pairwise correlations between samples. Importantly, it must follow the exact structure as output from cytominer_eval.transform.transform.metric_melt().

  • replicate_groups (list) – A list of metadata column names in the original profile dataframe to indicate replicate samples.

  • quantile_over_null (float, optional) – A float between 0 and 1 indicating the threshold of nonreplicates to use when reporting percent matching or percent replicating. Defaults to 0.95.

  • return_median_correlations (bool, optional) – If provided, also return median pairwise correlations per replicate. Defaults to False.

Returns

The replicate reproducibility of the profiles according to the replicate columns provided. If return_median_correlations = True then the function will return both the metric and a median pairwise correlation pandas.DataFrame.

Return type

{float, (float, pd.DataFrame)}

Precision and recall

Functions to calculate precision and recall at a given k

cytominer_eval.operations.precision_recall.precision_recall(similarity_melted_df: pandas.core.frame.DataFrame, replicate_groups: List[str], groupby_columns: List[str], k: Union[int, List[int]]) → pandas.core.frame.DataFrame

Determine the precision and recall at k for all unique groupby_columns samples based on a predefined similarity metric (see cytominer_eval.transform.metric_melt)

Parameters
  • similarity_melted_df (pandas.DataFrame) – An elongated symmetrical matrix indicating pairwise correlations between samples. Importantly, it must follow the exact structure as output from cytominer_eval.transform.transform.metric_melt().

  • replicate_groups (List) – a list of metadata column names in the original profile dataframe to use as replicate columns.

  • groupby_columns (List of str) – Column by which the similarity matrix is grouped and by which the precision/recall is calculated. For example, if groupby_column = Metadata_sample then the precision is calculated for each sample. Calculating the precision by sample is the default but it is mathematically not incorrect to calculate the precision at the MOA level. This is just less intuitive to understand.

  • k (List of ints or int) – an integer indicating how many pairwise comparisons to threshold.

Returns

precision and recall metrics for all groupby_column groups given k

Return type

pandas.DataFrame

Enrichment

Function to calculate the enrichment score for a given similarity matrix.

cytominer_eval.operations.enrichment.enrichment(similarity_melted_df: pandas.core.frame.DataFrame, replicate_groups: List[str], percentile: Union[float, List[float]]) → pandas.core.frame.DataFrame

Calculate the enrichment score. This score is based on the fisher exact odds score. Similar to the other functions, the closest connections are determined and checked with the replicates. This score effectively calculates how much better the distribution of correct connections is compared to random.

Parameters
  • similarity_melted_df (pandas.DataFrame) – An elongated symmetrical matrix indicating pairwise correlations between samples. Importantly, it must follow the exact structure as output from cytominer_eval.transform.transform.metric_melt().

  • replicate_groups (List) – a list of metadata column names in the original profile dataframe to use as replicate columns.

  • percentile (List of floats) – Determines what percentage of top connections used for the enrichment calculation.

Returns

percentile, threshold, odds ratio and p value

Return type

dict