find_doublets#

SingleCell.find_doublets(*, batch_column, QC_column='passed_QC', doublet_fraction=None, num_genes=500, doublet_column='doublet', doublet_score_column='doublet_score', overwrite=False, num_threads=None)[source]#

Find doublets using cxds (co-expression-based doublet scoring).

The standard way to filter out doublets is by specifying remove_doublets=True in qc(). If you did that, do not use this function! This function should only be used in the unusual scenario where you want to find doublets without running any other quality-control steps.

This function gives the same result regardless of whether it is run before or after normalization. The actual expression value does not matter, only whether or not it is zero.

Doublets cannot occur across sequencing batches, so make sure to specify batch_column if your dataset has multiple batches! Doublet detection will be done independently within each batch.

Since the cxds score is continuous, it needs to be converted into a binary classification of doublets versus non-doublets. This problem can be framed as finding a cxds score threshold above which a cell is deemed to be a doublet. To determine this threshold, we simulate doublets by combining the counts from randomly selected pairs of cells, via the following steps:

  1. Sample as many random pairs of cells (with replacement) as there are real cells.

  2. Combine the counts from each pair of cells into a simulated doublet. Because cxds operates on binarized count matrices, we average the two cells’ count matrices in a binary sense: if a gene is expressed in either cell, it is deemed to be expressed in the simulated doublet, but if it has a count of 1 in one cell and 0 in the other, it is randomly chosen to be either expressed or not expressed with equal probability (since the average count would be 0.5).

  3. Calculate cxds scores for these simulated doublets, based on the coexpression patterns (the S matrix from cxds) learned from the real data.

  4. Take the median cxds score of the simulated doublets as the threshold. In other words, if a real cell has a higher doublet score than the average simulated doublet, we call it a doublet.

Alternatively, specify doublet_fraction to force a specific fraction of cells to be classified as doublets.

Parameters:
  • batch_column: SingleCellColumn | None

    an optional String, Enum, Categorical, or integer column of obs indicating which batch each cell is from. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array. Doublet detection will be performed separately for each batch; cells where batch_column is null will collectively be treated as a single batch. Set to None if all cells belong to the same sequencing batch.

  • QC_column: SingleCellColumn | None

    an optional Boolean column of obs indicating which cells passed QC. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array. Set to None to include all cells. Cells failing QC will be ignored and have their doublet labels and doublet scores set to null.

  • doublet_fraction: float | None

    an optional fraction of cells (within each batch, if batch_column is specified) to be classified as doublets. If None, automatically detect the threshold via the approach described above.

  • num_genes: int

    the number of highly variable genes, i.e. genes expressed in as close to 50% of cells as possible, to use during doublet detection. This parameter usually has a minimal influence on accuracy as long as it is sufficiently large (in the hundreds), so increasing it further will mainly just increase runtime. If num_genes is greater than the number of genes in the dataset, all genes will be used.

  • doublet_column: str

    the name of a Boolean column to be added to obs containing the doublet labels, i.e. whether each cell is predicted to be a doublet

  • doublet_score_column: str | None

    the name of a column to be added to obs containing each cell’s doublet score. Higher scores indicate greater likelihood of being a doublet. Scores are not normalized and are not comparable across datasets or batches, but are guaranteed to be positive (since they are sums of log p-values). Set doublet_score_column=None to not return doublet scores, for a slight memory reduction and speed increase.

  • overwrite: bool

    if True, overwrite doublet_column and/or doublet_score_column if already present in obs, instead of raising an error

  • num_threads: int | None

    the number of threads to use when finding doublets. Set num_threads=-1 to use all available cores, as determined by os.cpu_count(), or leave unset to use self.num_threads cores.

Returns:

A new SingleCell dataset where obs contains two additional columns, doublet_column (default: doublet), indicating whether each cell is predicted to be a doublet, and doublet_score_column (default: ‘doublet_score’), containing each cell’s doublet score.

Note

This function’s cxds scores are almost exactly half the original implementation’s, because it avoids double-counting the two genes in each gene pair. Slight deviations from this one-half (usually by less than one part in a million) may occur because this function uses a normal approximation to the binomial p-value to avoid long runtimes on large datasets.

Note

This function may give an incorrect output if the count matrix contains explicit zeros (i.e. if (sc.X.data == 0).any()): this is not checked for, due to speed considerations. In the unlikely event that your dataset contains explicit zeros, remove them by running sc.X.eliminate_zeros() (an in-place operation) first.