pacmap#
- SingleCell.pacmap(*, QC_column='passed_QC', PC_key='pca', neighbors_key='neighbors', distances_key='distances', embedding_key='pacmap', num_neighbors=10, num_extra_neighbors=10, num_mid_near_pairs=5, num_further_pairs=20, num_iterations=(100, 100, 250), learning_rate=1, seed=0, match_parallel=False, overwrite=False, num_threads=None)[source]#
Calculate a two-dimensional embedding of this SingleCell dataset suitable for plotting with plot_embedding().
Uses PaCMAP, a relative of UMAP that captures global structure better.
This function is intended to be run after pca() and neighbors(). By default, it uses obsm[‘pca’] and obsm[‘neighbors’] as the inputs to PaCMAP, and stores the output in obsm[‘pacmap’] as a len(obs) × 2 NumPy array. It can also be run on Harmony embeddings by running harmonize() and then specifying PC_key=’harmony’.
- Parameters:
QC_column: SingleCellColumn | None
an optional Boolean column of obs indicating which cells passed QC. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array. Set to None to include all cells. Cells failing QC will be ignored and have their embeddings set to NaN.
PC_key: str
the key of obsm containing the principal components calculated with pca(), to use as an input for the embedding calculation. Can also be set to the Harmony embeddings calculated by harmonize(), by specifying PC_key=’harmony’.
neighbors_key: str
the key of obsm containing the nearest-neighbor indices for each cell, to use as an input for the embedding calculation
distances_key: str
the key of obsm containing the squared Euclidean distance to each nearest neighbor in neighbors_key, to use as an input for the embedding calculation
embedding_key: str
the key of obsm where the embeddings will be stored
num_neighbors: int
the number of nearest neighbors in the original high-dimensional space to consider for each point. Higher values focus on preserving the broader topological structure of local neighborhoods, potentially merging close clusters. Lower values prioritize the very fine-grained local structure, which can reveal intricate patterns but may also fragment larger clusters.
num_extra_neighbors: int
the number of extra nearest neighbors (on top of num_neighbors) to search for initially, before pruning to the num_neighbors of these num_neighbors + num_extra_neighbors cells with the smallest scaled distances. For a pair of cells i and j, the scaled distance between i and j is its squared Euclidean distance, divided by i’s average Euclidean distance to its 3rd, 4th, and 5th nearest neighbors, divided by j’s average Euclidean distance to its 3rd, 4th, and 5th nearest neighbors. Must be a non-negative integer. Defaults to 10, instead of PaCMAP’s original default of 50. neighbors_key and distances_key must contain at least num_neighbors + num_extra_neighbors nearest neighbors.
num_mid_near_pairs: int
the number of moderately close cells (not nearest neighbors) to sample for each cell, used to attract distinct local neighborhoods together. Higher values add more “scaffolding” to preserve the large-scale global structure and the relationships between clusters. Lower values reduce this effect, allowing local structures to be placed more independently of one another.
num_further_pairs: int
the number of distant cells to sample for each cell, used to create repulsive forces that prevent crowding and shape the final layout. Higher values increase this repulsive force, leading to a more spread-out embedding with clearer separation between clusters. Lower values reduce the force, which can result in a more compact layout where clusters may be closer or overlap.
num_iterations: int | tuple[int, int, int]
the number of iterations to run PaCMAP for. Can be a length-3 tuple of the number of iterations for each of the 3 stages of optimization, or a single integer of the number of iterations for the third stage (in which case the number of iterations for the first two stages will be set to 100).
learning_rate: int | float
the learning rate of the Adam optimizer for PaCMAP
seed: int
the random seed to use for PaCMAP
match_parallel: bool
if False, use a different order of operations for single-threaded PaCMAP. This gives a modest (~15%) boost in single-threaded performance at the cost of no longer exactly matching the embedding produced by the multithreaded version (due to differences in floating-point error arising from the different order of operations). Must be False unless num_threads=1.
overwrite: bool
if True, overwrite embedding_key if already present in obsm, instead of raising an error
num_threads: int | None
the number of threads to use when running PaCMAP. Set num_threads=-1 to use all available cores, as determined by
os.cpu_count(), or leave unset to use self.num_threads cores. Does not affect the returned SingleCell dataset’s num_threads; this will always be the same as the original dataset’s num_threads.
- Returns:
A new SingleCell dataset with the PaCMAP embedding stored in obsm[embedding_key].
- Return type:
Note
PaCMAP’s original implementation assumes generic input data, so it initializes the embedding by standardizing the input data, running PCA on it, and taking the first two PCs. Because our input data is already PCs (or harmonized PCs), we avoid redundant calculations by omitting this step and directly initializing the embedding with the first two columns of our input data, i.e. the first two PCs.