cytominer_eval¶
The primary way to use cytominer-eval is through evaluate.py. The operation argument controls which metric to calculate.
evaluate.py¶
Calculate evaluation metrics from profiling experiments.
The primary entrypoint into quickly evaluating profile quality.
-
cytominer_eval.evaluate.evaluate(profiles: pandas.core.frame.DataFrame, features: List[str], meta_features: List[str], replicate_groups: Union[List[str], dict], operation: str = 'replicate_reproducibility', groupby_columns: List[str] = ['Metadata_broad_sample'], similarity_metric: str = 'pearson', replicate_reproducibility_quantile: float = 0.95, replicate_reproducibility_return_median_cor: bool = False, precision_recall_k: Union[int, List[int]] = 10, grit_control_perts: List[str] = ['None'], grit_replicate_summary_method: str = 'mean', mp_value_params: dict = {}, enrichment_percentile: Union[float, List[float]] = 0.99, hitk_percent_list=[2, 5, 10])¶ Evaluate profile quality and strength.
For a given profile dataframe containing both metadata and feature measurement columns, use this function to calculate profile quality metrics. The function contains all the necessary arguments for specific evaluation operations.
- Parameters
profiles (pandas.DataFrame) – profiles must be a pandas DataFrame with profile samples as rows and profile features as columns. The columns should contain both metadata and feature measurements.
features (list) – A list of strings corresponding to feature measurement column names in the profiles DataFrame. All features listed must be found in profiles.
meta_features (list) – A list of strings corresponding to metadata column names in the profiles DataFrame. All features listed must be found in profiles.
replicate_groups ({str, list, dict}) – An important variable indicating which metadata columns denote replicate information. All metric operations require replicate profiles. replicate_groups indicates a str or list of columns to use. For operation=”grit”, replicate_groups is a dict with two keys: “profile_col” and “replicate_group_col”. “profile_col” is the column name that stores identifiers for each profile (can be unique), while “replicate_group_col” is the column name indicating a higher order replicate information. E.g. “replicate_group_col” can be a gene column in a CRISPR experiment with multiple guides targeting the same genes. See also
cytominer_eval.operations.grit()andcytominer_eval.transform.util.check_replicate_groups().operation ({'replicate_reproducibility', 'precision_recall', 'grit', 'mp_value'}, optional) – The specific evaluation metric to calculate. The default is “replicate_reproducibility”.
groupby_columns (List of str) – Only used for operation = ‘precision_recall’ and ‘hitk’ Column by which the similarity matrix is grouped and by which the operation is calculated. For example, if groupby_column = “Metadata_broad_sample” then precision/recall is calculated for each sample. Note that it makes sense for these columns to be unique or to span a unique space since precision and hitk may otherwise stop making sense.
similarity_metric ({'pearson', 'spearman', 'kendall'}, optional) – How to calculate pairwise similarity. Defaults to “pearson”. We use the input in pandas.DataFrame.cor(). The default is “pearson”.
- Returns
The resulting evaluation metric. The return is either a single value or a pandas DataFrame summarizing the metric as specified in operation.
- Return type
float, pd.DataFrame
- Other Parameters
replicate_reproducibility_quantile ({0.95, …}, optional) – Only used when operation=’replicate_reproducibility’. This indicates the percentile of the non-replicate pairwise similarity to consider a reproducible phenotype. Defaults to 0.95.
replicate_reproducibility_return_median_cor (bool, optional) – Only used when operation=’replicate_reproducibility’. If True, then also return pairwise correlations as defined by replicate_groups and similarity metric
precision_recall_k (int or list of ints {10, …}, optional) – Only used when operation=’precision_recall’. Used to calculate precision and recall considering the top k profiles according to pairwise similarity.
grit_control_perts ({None, …}, optional) – Only used when operation=’grit’. Specific profile identifiers used as a reference when calculating grit. The list entries must be found in the replicate_groups[replicate_id] column.
grit_replicate_summary_method ({“mean”, “median”}, optional) – Only used when operation=’grit’. Defines how the replicate z scores are summarized. see
cytominer_eval.operations.util.calculate_grit()mp_value_params ({{}, …}, optional) – Only used when operation=’mp_value’. A key, item pair of optional parameters for calculating mp value. See also
cytominer_eval.operations.util.default_mp_value_parameters()enrichment_percentile (float or list of floats, optional) – Only used when operation=’enrichment’. Determines the percentage of top connections used for the enrichment calculation.
hitk_percent_list (list or “all”) – Only used when operation=’hitk’. Default : [2,5,10] A list of percentages at which to calculate the percent scores, ie the amount of indexes below this percentage. If percent_list == “all” a full dict with the length of classes will be created. Percentages are given as integers, ie 50 means 50 %.