normalize#

SingleCell.normalize(*, QC_column='passed_QC', method='log1pPF', inplace=False, num_threads=None)[source]#

Normalize this SingleCell dataset’s counts.

Must be run after hvg() and before pca().

normalize() supports three normalization methods. All three methods normalize each cell independently of the rest and log-transform the counts in some way, but differ in the details.

The simplest approach, method=’logCP10k’, computes the log of the counts per 10 thousand: normalized_count = log(count / 10000 + 1). It matches the default settings of Seurat’s NormalizeData() function, aside from differences in floating-point error. This method is not recommended because it implicitly assumes an unrealistically large amount of overdispersion, and performs worse in the benchmarks of the papers discussed below.

The next-simplest approach, method=’log1pPF’, is the default. Instead of using the same denominator of 10 thousand for every cell, it uses X.sum(axis=1) / X.sum(axis=1).mean() as the denominator. In other words, a cell’s denominator is the cell’s library size, relative to the mean library size across all cells. (By library size, we mean the sum of a cell’s counts across all genes.) This approach of dividing each cell’s counts by its relative library size is sometimes called “proportional fitting” (PF). Ahlmann-Eltze and Huber 2023 recommend using this method instead of normalizing by a fixed denominator, like method=’logCP10k’ does. Scanpy’s normalize_total() uses this method with a slight variation: it uses median instead of mean to define the relative library size.

The most complex approach, method=’PFlog1pPF’, takes the output of log1pPF and applies an additional round of proportional fitting after the log-transformation. Booeshaghi et al. 2022 recommend this approach, arguing that log1pPF does not fully normalize for read depth, because the log transform partially undoes the first round of proportional fitting.

Parameters:

QC_column: SingleCellColumn | None
an optional Boolean column of obs indicating which cells passed QC. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array. Set to None to include all cells. Cells failing QC will still be normalized, but will not count towards the calculation of the mean total count across cells when method is ‘PFlog1pPF’ or ‘log1pPF’. Has no effect when method is ‘logCP10k’.
method: Literal['PFlog1pPF', 'log1pPF', 'logCP10k']
the normalization method to use (see above)
inplace: bool
whether to do in-place normalization. This reduces memory usage, but is only possible for float32 count matrices and will raise an error if the count matrix has any other data type.
num_threads: int | None
the number of threads to use when normalizing. Set num_threads=-1 to use all available cores, as determined by os.cpu_count(), or leave unset to use self.num_threads cores. Does not affect the normalized SingleCell dataset’s num_threads; this will always be the same as the original dataset’s num_threads.

Returns:

A new SingleCell dataset with the normalized counts, and uns[‘normalized’] set to True. Or when inplace=True, return the original dataset with the counts normalized in-place.

Return type:

SingleCell