qc#
- SingleCell.qc(*, custom_filter=None, subset=False, QC_column='passed_QC', max_mito_fraction=0.05, min_genes=100, min_counts=None, nonzero_MALAT1=True, remove_doublets=False, batch_column=None, doublet_fraction=None, num_doublet_genes=500, allow_float=False, overwrite=False, verbose=False, num_threads=None)[source]#
Adds a Boolean column to obs indicating which cells passed quality control (QC), or subsets to these cells if subset=True.
By default, filters to cells with ≤5% mitochondrial reads, ≥100 genes detected, and non-zero MALAT1 or Malat1 expression. Can also filter out doublets when remove_doublets=True.
Raises an error if any cell names appear more than once in obs_names (they can be deduplicated with make_obs_names_unique()) or any gene names appear more than once in var_names (they can be deduplicated with make_var_names_unique()).
- Parameters:
custom_filter: SingleCellColumn | None
an optional Boolean column of obs containing a filter to apply on top of the other QC filters; True elements will be kept. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array.
subset: bool
whether to subset to cells passing QC, instead of merely adding a QC_column to obs. This will roughly double memory usage, but speed up subsequent operations.
QC_column: str
the name of a Boolean column to add to obs indicating which cells passed QC, if subset=False. Gives an error if obs already has a column with this name, unless overwrite=True.
max_mito_fraction: int | float | None
if not None, filter to cells with ≤ this fraction of mitochondrial counts (i.e. from genes starting with ‘MT’. The default of 5% matches Seurat’s recommended value.
min_genes: int | None
if not None, filter to cells with ≥ this many genes detected (with non-zero count). The default of 100 matches Scanpy’s recommended value, while Seurat recommends a minimum of 200.
min_counts: int | None
if not None, filter to cells with ≥ this many total counts across all genes. This filter is off by default.
nonzero_MALAT1: bool
if True, filter out cells with 0 expression of the nuclear-expressed lncRNA MALAT1, which likely represent empty droplets or poor-quality cells. There must be exactly one gene in obs_names with the name ‘MALAT1’ or ‘Malat1’ to use this filter.
remove_doublets: bool
if True, remove predicted doublets (see find_doublets()). Doublet detection uses the cxds algorithm to score each cell, then thresholds this continuous score to a binary one (doublet versus non-doublet) using a threshold derived from simulated doublets.
batch_column: SingleCellColumn | None
an optional String, Enum, Categorical, or integer column of obs indicating which batch each cell is from. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array. Only used during doublet detection; doublet detection will be performed separately for each batch; cells where batch_column is null will collectively be treated as a single batch. Set to None if all cells belong to the same sequencing batch. Can only be specified when remove_doublets=True.
doublet_fraction: float | None
an optional fraction of cells (within each batch, if batch_column is specified) to be classified as doublets. If None, automatically detect the threshold via the approach described in find_doublets().
num_doublet_genes: int
the number of highly variable genes, i.e. genes expressed in as close to 50% of cells as possible, to use during doublet detection. This parameter usually has a minimal influence on accuracy as long as it is sufficiently large (in the hundreds), so increasing it further will mainly just increase runtime. If num_doublet_genes is greater than the number of genes in the dataset, all genes will be used.
allow_float: bool
if False, raise an error if self.X.dtype is floating-point (suggesting the user may not be using the raw counts); if True, disable this sanity check. Note that all steps except mitochondrial percent filtering give the same result on normalized counts, so if max_mito_fraction=None were specified (not recommended), this function would give the same result on raw and normalized counts.
overwrite: bool
if False, raise an error if uns[‘QCed’] is True (indicating the dataset has already been QCed) or QC_column is already present in obs; if True, disable these two sanity checks and, when subset=False, overwrite QC_column if present.
verbose: bool
whether to print how many cells were filtered out at each step of the QC process
num_threads: int | None
the number of threads to use when filtering based on mitochondrial counts and MALAT1 expression, and for doublet detection. Set num_threads=-1 to use all available cores, as determined by
os.cpu_count(), or leave unset to use self.num_threads cores. Does not affect the QCed SingleCell dataset’s num_threads; this will always be the same as the original dataset’s num_threads.
- Returns:
A new SingleCell dataset with QC_column added to obs (or subset to QCed cells if subset=True) and uns[‘QCed’] set to True.
- Return type: