find_markers#

SingleCell.find_markers(cell_type_column, /, *, QC_column='passed_QC', cell_types=None, excluded_cell_types=None, min_detection_rate=0.25, min_fold_change=2, pareto=True, all_genes=False, num_threads=None)[source]#

Find “marker genes” that distinguish each cell type from all other cell types. This function gives the same result regardless of whether it is run before or after normalization.

Marker genes are chosen via an adaptation of the strategy of Fischer and Gillis 2021 (ncbi.nlm.nih.gov/pmc/articles/PMC8571500). For a given cell type, genes are scored based on a) their “detection rate” in that cell type (the fraction of cells of that type that have non-zero count for that gene), as well as b) the fold change in detection rate between that cell type and every other cell type. Genes must also have a detection rate of at least min_detection_rate (25% by default) and a minimum fold change of at least min_fold_change (2-fold by default) to be considered as markers.

There is an inherent tradeoff between these two metrics. For instance, candidate marker genes with high enough expression to be expressed in every cell of a given type (i.e. to have a high detection rate) tend to also have at least some expression in other cell types (i.e. a low fold change in detection rate).

Thus, marker genes are selected to optimally trade off between these two metrics: all genes on the Pareto front of the two metrics (i.e. genes for which there is no other gene that does better on both metrics simultaneously) are selected as marker genes.

Note that Fischer and Gillis use AUROC versus log2 fold change in detection rate, instead of detection rate versus fold change in detection rate. However, detection rate is much faster to compute than AUROC, and is a very accurate proxy for AUROC: as Figure 1D in their paper shows, AUROC is almost perfectly correlated with detection rate across marker genes.

Parameters:
  • cell_type_column: SingleCellColumn

    a String, Categorical, Enum, or integer column of obs containing cell-type labels. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array.

  • QC_column: SingleCellColumn | None

    an optional Boolean column of obs indicating which cells passed QC. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array. Set to None to include all cells. Cells failing QC will be ignored.

  • cell_types: str | Iterable[str] | int | Iterable[int] | None

    one or more cell types to find markers for; by default, finds markers for all cell types in cell_type_column. Specifying cell_types is exactly equivalent to filtering the result to these cell types, but will be faster when there are many cell types and markers are only desired for a few of them. Can also be used to change the order in which cell types are reported, even if finding markers for all cell types. Mutually exclusive with excluded_cell_types.

  • excluded_cell_types: str | Iterable[str] | int | Iterable[int] | None

    one or more cell types to exclude from marker finding. Mutually exclusive with cell_types.

  • min_detection_rate: int | float

    the minimum detection rate required to select a gene as a marker gene; must be greater than 0 and less than or equal to 1

  • min_fold_change: int | float

    the minimum fold change in detection rate required to select a gene as a marker gene; must be greater than 1

  • pareto: bool

    if True, include only genes on the Pareto front of detection rate and fold change as markers; if False, include all genes that pass the min_detection_rate and min_fold_change thresholds as markers

  • all_genes: bool

    if True, include all genes in the output, not just marker genes. An additional Boolean column will be included to specify which genes are the marker genes. Note that this option does not change which marker genes are selected, only which information is returned.

  • num_threads: int | None

    the number of threads to use for marker-gene finding. Set num_threads=-1 to use all available cores, as determined by os.cpu_count(), or leave unset to use self.num_threads cores.

Returns:

  • ‘cell_type’: a cell-type name from cell_type_column

  • ’gene’: a gene symbol from var_names

  • ’detection_rate’: the gene’s detection rate in that cell type

  • ’fold_change’, the gene’s fold change in detection rate between that cell type and all other cell types

If all_genes=True, a DataFrame with one row per cell type-gene pair, with those four columns plus one other:

  • ’marker’, a Boolean column listing whether the gene is a marker for that cell type

If all_genes=False, marker genes within each cell type will be sorted in decreasing order of fold change.

Return type:

By default, a DataFrame with one row per marker gene, with columns

Note

This function may give an incorrect output if the count matrix contains explicit zeros (i.e. if (sc.X.data == 0).any()): this is not checked for, due to speed considerations. In the unlikely event that your dataset contains explicit zeros, remove them by running sc.X.eliminate_zeros() (an in-place operation) first.

Note

This function may give an incorrect output if the count matrix contains negative values: this is not checked for, due to speed considerations.