cluster#
- SingleCell.cluster(*, QC_column='passed_QC', resolution=1, min_cluster_size=5, shared_neighbors_key='shared_neighbors', neighbors_key='neighbors', cluster_column='cluster', seed=0, overwrite=False, verbose=False, num_threads=None)[source]#
Cluster cells into cell types using Leiden clustering.
This function is intended to be run after shared_neighbors(); by default, it uses obsm[‘shared_neighbors’] as the input to the clustering.
Our Leiden implementation is based on GVE-Leiden, a non-deterministic parallel version of the Leiden algorithm. To ensure deterministic output, our implementation is not parallelized. However, when multiple resolutions are specified, clusterings for all resolutions run in parallel by default.
Like Seurat and Scanpy’s Leiden clustering, our implementation optimizes the objective function corresponding to Reichardt and Bornholdt’s Potts model (RBConfigurationVertexPartition in the reference implementation of Leiden clustering, leidenalg). This formulation extends modularity by introducing a resolution parameter; with the default resolution = 1, it is exactly equivalent to maximizing modularity.
- Parameters:
QC_column: SingleCellColumn | None | Sequence[SingleCellColumn | None]
an optional Boolean column of obs indicating which cells passed QC. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array. Set to None to include all cells. Cells failing QC will be ignored and have their cell-type labels set to NaN.
resolution: int | float | Iterable[int | float]
the key parameter of Leiden clustering. Larger values result in more clusters. If multiple resolutions are specified, the clustering for each resolution is executed in parallel by default.
min_cluster_size: int
the minimum cluster size. Cells in clusters of size less than min_cluster_size will be merged into the cluster of their nearest neighbor that is in a cluster of size ≥ min_cluster_size. Set min_cluster_size=1 to disable this merging. Clusters with only one or two cells may occur if they are disconnected from the rest of the shared nearest neighbor graph.
neighbors_key: str
the key of obsm containing the nearest neighbors of each cell calculated with neighbors(). This is only used to merge disconnected clusters of size less than min_cluster_size. Not used when min_cluster_size=1.
cluster_column: str | Iterable[str]
the name of an integer column to be added to obs indicating the cell-type labels. If N resolutions are specified, N columns named f’{cluster_column}_0’ through f’{cluster_column}_{N - 1}’ will be added.
seed: int
the random seed to use when clustering
overwrite: bool
if True, overwrite cluster_column if already present in obs, instead of raising an error
verbose: bool
whether to print details of the Leiden clustering; must be False when running multithreaded
num_threads: int | None
the number of threads to use when clustering. Parallelization takes place across resolutions. Set num_threads=-1 to use all available cores, as determined by
os.cpu_count(), or leave unset to use self.num_threads cores. In both cases, only as many cores will be used as the number of resolutions specified. Specifying num_threads=-1 when only one resolution is specified will raise an error, as will specifying a positive value for num_threads that is greater than the number of resolutions specified. Does not affect the returned SingleCell dataset’s num_threads; this will always be the same as the original dataset’s num_threads.
- Returns:
A new SingleCell dataset where obs[cluster_column] is an Enum column containing an integer cell-type label for each cell (‘0’, ‘1’, etc.). Or, if N resolutions are specified, a dataset where obs[f’{cluster_column}_0’] through obs[f’{cluster_column}_{N - 1}’] contain N sets of cell-type labels.
- Return type:
Note
This function may give an incorrect output if you specified a custom shared nearest-neighbor graph that a) is non-symmetric (i.e. (shared_neighbors != shared_neighbors_key.T).nnz, where shared_neighbors = sc.obsp[shared_neighbors_key], is non-zero), b) contains explicit zeros (i.e. if (shared_neighbors.data == 0).any()), or c) contains negative values: these are not checked for, due to speed considerations. In the unlikely event that your custom shared nearest-neighbor graph contains explicit zeros, remove them by running sc.obsp[shared_neighbors_key].eliminate_zeros() (an in-place operation) first.