pca#

SingleCell.pca(*others, QC_column='passed_QC', hvg_column='highly_variable', PC_key='pca', num_PCs=50, subspace_size=100, tolerance=1e-06, max_iterations=100, chunk_size=1024, seed=0, match_parallel=False, overwrite=False, verbose=False, num_threads=None)[source]#

Compute principal components (PCs) across cells.

Requires normalized counts, so must be run after normalize(). By default, only the highly variable genes from hvg() are used to compute PCs.

Uses approximate singular value decomposition (SVD) via the Implicitly Restarted Lanczos Bidiagonalization Algorithm (IRLBA). Seurat uses a different implementation of the same IRLBA algorithm.

Parameters:
  • others: SingleCell

    optional SingleCell datasets to jointly compute principal components across, alongside this one.

  • QC_column: SingleCellColumn | None | Sequence[SingleCellColumn | None]

    an optional Boolean column of obs indicating which cells passed QC. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array. Set to None to include all cells. Cells failing QC will be ignored and have their PCs set to NaN. When others is specified, QC_column can be a length-1 + len(others) sequence of columns, expressions, Series, functions, or None for each dataset (for self, followed by each dataset in others).

  • hvg_column: SingleCellColumn | Sequence[SingleCellColumn] | None

    an optional Boolean column of var indicating the highly variable genes. Set to None to include all genes. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array. When others is specified, hvg_column can be a length-1 + len(others) sequence of columns, expressions, Series, functions, or None for each dataset (for self, followed by each dataset in others).

  • PC_key: str

    the key of obsm where the principal components will be stored

  • num_PCs: int

    the number of top principal components to calculate

  • subspace_size: int

    the size of the Krylov subspace used by IRLBA when calculating PCs. Must be greater than or equal to num_PCs, and about twice num_PCs is recommended. subspace_size will be automatically clipped to at most the minimum of the number of cells and the number of genes.

  • tolerance: int | float

    the relative tolerance (expressed as the ratio of a singular value’s residual to the maximum singular value) required to deem a singular value converged. IRLBA will stop early, before max_iterations iterations, if all singular values have converged.

  • max_iterations: int

    the maximum number of iterations to run IRLBA for, stopping early if all singular values have converged (see tolerance)

  • chunk_size: int

    the number of rows per fixed block in deterministic parallel reductions. Used to parallelize operations like mean and norm that would otherwise be serial. Block boundaries are fixed regardless of thread count, ensuring floating-point identical results.

  • seed: int

    the random seed to use when initializing the PCs, via R’s set.seed() function

  • match_parallel: bool

    if False, use a different order of operations for single-threaded PCA. This gives a moderate (~2x) boost in single-threaded performance, and lower memory usage, at the cost of no longer exactly matching the PCs produced by the multithreaded version (due to differences in floating-point error arising from the different order of operations). When match_parallel=False, pca() will also give slightly different results when run with CSR vs CSC input; when multiple datasets are provided with a mix of CSR and CSC formats, all datasets are converted to the format shared by the most total cells across datasets. If True, exactly match the results of the multithreaded version when num_threads=1. Must be False unless num_threads=1.

  • overwrite: bool

    if True, overwrite PC_key if already present in obsm, instead of raising an error

  • verbose: bool

    whether to print a message when the singular values did not converge to a tolerance of tolerance within max_iterations iterations

  • num_threads: int | None

    the number of threads to use for PCA. Set num_threads=-1 to use all available cores, as determined by os.cpu_count(), or leave unset to use self.num_threads cores. Does not affect the returned SingleCell dataset’s num_threads; this will always be the same as the original dataset’s num_threads.

Returns:

A new SingleCell dataset where obsm contains an additional key, PC_key (default: ‘pca’), containing the top num_PCs principal components. Or, if additional SingleCell dataset(s) are specified via the others argument, a length-1 + len(others) tuple of SingleCell datasets with the PCs added: self, followed by each dataset in others.

Return type:

SingleCell | tuple[SingleCell, …]

Note

Unlike Seurat’s RunPCA() function, which requires ScaleData() to be run first, this function does not require the data to be scaled beforehand. Instead, it implicitly scales the data to zero mean and unit variance while performing PCA.