pca#
- SingleCell.pca(*others, QC_column='passed_QC', hvg_column='highly_variable', PC_key='pca', num_PCs=50, subspace_size=100, tolerance=1e-06, max_iterations=100, chunk_size=1024, seed=0, match_parallel=False, overwrite=False, verbose=False, num_threads=None)[source]#
Compute principal components (PCs) across cells.
Requires normalized counts, so must be run after normalize(). By default, only the highly variable genes from hvg() are used to compute PCs.
Uses approximate singular value decomposition (SVD) via the Implicitly Restarted Lanczos Bidiagonalization Algorithm (IRLBA). Seurat uses a different implementation of the same IRLBA algorithm.
- Parameters:
others: SingleCell
optional SingleCell datasets to jointly compute principal components across, alongside this one.
QC_column: SingleCellColumn | None | Sequence[SingleCellColumn | None]
an optional Boolean column of obs indicating which cells passed QC. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array. Set to None to include all cells. Cells failing QC will be ignored and have their PCs set to NaN. When others is specified, QC_column can be a length-1 + len(others) sequence of columns, expressions, Series, functions, or None for each dataset (for self, followed by each dataset in others).
hvg_column: SingleCellColumn | Sequence[SingleCellColumn] | None
an optional Boolean column of var indicating the highly variable genes. Set to None to include all genes. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array. When others is specified, hvg_column can be a length-1 + len(others) sequence of columns, expressions, Series, functions, or None for each dataset (for self, followed by each dataset in others).
PC_key: str
the key of obsm where the principal components will be stored
num_PCs: int
the number of top principal components to calculate
subspace_size: int
the size of the Krylov subspace used by IRLBA when calculating PCs. Must be greater than or equal to num_PCs, and about twice num_PCs is recommended. subspace_size will be automatically clipped to at most the minimum of the number of cells and the number of genes.
tolerance: int | float
the relative tolerance (expressed as the ratio of a singular value’s residual to the maximum singular value) required to deem a singular value converged. IRLBA will stop early, before max_iterations iterations, if all singular values have converged.
max_iterations: int
the maximum number of iterations to run IRLBA for, stopping early if all singular values have converged (see tolerance)
chunk_size: int
the number of rows per fixed block in deterministic parallel reductions. Used to parallelize operations like mean and norm that would otherwise be serial. Block boundaries are fixed regardless of thread count, ensuring floating-point identical results.
seed: int
the random seed to use when initializing the PCs, via R’s
set.seed()functionmatch_parallel: bool
if False, use a different order of operations for single-threaded PCA. This gives a moderate (~2x) boost in single-threaded performance, and lower memory usage, at the cost of no longer exactly matching the PCs produced by the multithreaded version (due to differences in floating-point error arising from the different order of operations). When match_parallel=False, pca() will also give slightly different results when run with CSR vs CSC input; when multiple datasets are provided with a mix of CSR and CSC formats, all datasets are converted to the format shared by the most total cells across datasets. If True, exactly match the results of the multithreaded version when num_threads=1. Must be False unless num_threads=1.
overwrite: bool
if True, overwrite PC_key if already present in obsm, instead of raising an error
verbose: bool
whether to print a message when the singular values did not converge to a tolerance of tolerance within max_iterations iterations
num_threads: int | None
the number of threads to use for PCA. Set num_threads=-1 to use all available cores, as determined by
os.cpu_count(), or leave unset to use self.num_threads cores. Does not affect the returned SingleCell dataset’s num_threads; this will always be the same as the original dataset’s num_threads.
- Returns:
A new SingleCell dataset where obsm contains an additional key, PC_key (default: ‘pca’), containing the top num_PCs principal components. Or, if additional SingleCell dataset(s) are specified via the others argument, a length-1 + len(others) tuple of SingleCell datasets with the PCs added: self, followed by each dataset in others.
- Return type:
SingleCell | tuple[SingleCell, …]
Note
Unlike Seurat’s
RunPCA()function, which requiresScaleData()to be run first, this function does not require the data to be scaled beforehand. Instead, it implicitly scales the data to zero mean and unit variance while performing PCA.