harmonize#
- SingleCell.harmonize(*others, QC_column='passed_QC', batch_column=None, PC_key='pca', Harmony_key='harmony', num_clusters=None, max_iterations=10, num_kmeans_iterations=25, kmeans_tolerance=0.01, kmeans_barbar=False, num_init_iterations=5, oversampling_factor=1, chunk_size_kmeans=None, max_clustering_iterations=5, block_proportion=0.05, tolerance=0.01, early_stopping=False, clustering_tolerance=0.001, theta=2, tau=0, alpha=0.2, sigma=0.1, chunk_size_Harmony=None, seed=0, original=False, overwrite=False, verbose=False, num_threads=None)[source]#
Harmonize this SingleCell dataset with other datasets, or harmonize multiple batches of the same dataset, with Harmony2.
Our implementation differs from the original Harmony2 implementation in three key ways:
First, we parallelize Harmony via an innovative nested block strategy. The original implementation partitions cells randomly into blocks, each containing a fixed fraction (block_proportion, 5% by default) of the total cells. It iterates over each block, subtracting the contribution of the cells in the block to the observed and expected cluster-batch co-occurence matrices O and E, re-calculating the soft-clustering assignments R based on the residual O and E, then adding back the contribution of the cells in the block to O and E based on the new R. This approach resists straightforward parallelization because updates to R for future blocks depend on updates to O and E from previous blocks, leading to convergence failure if naively parallelized. Instead, we parallelize within blocks by dividing them into chunks of chunk_size cells (512 by default). We process each chunk in parallel, subtracting only the O and E contributions of the chunk itself, updating R for the chunk, and, after and, after processing all chunks in the block, add back the O and E contributions for all chunks based on the updated R. Updating O and E at the end of each block (rather than after processing every cell in the dataset) ensures convergence, while the inner chunking enables parallelization without disrupting convergence. To use the original implementation’s chunkless strategy for updating R, O, and E, specify original=True, num_threads=1.
Second, we reduce the default number of clustering iterations (Harmony’s inner loop) from 20 to 5, but always complete all 5 iterations without early stopping based on convergence. In practice, the largest changes to the Harmony objective function result from updating the PCs (outer loop), not updating the soft-clustering assignments (inner loop). Our implementation makes updating the PCs sufficiently fast that there are rapidly diminishing returns from performing lots of clustering updates, versus just skipping directly to updating the PCs after a few clustering iterations. By skipping convergence checks, we reduce the number of objective function evaluations (which are expensive) from once per inner iteration to once per outer iteration. To use the original implementation’s default of 20 iterations with early stopping, specify early_stopping=True, max_clustering_iterations=20.
Third, we use random k-means initialization, like version 1 of Harmony, rather than Harmony2’s kmeans++ initialization. We implement a parallel version of kmeans++ called k-means||, but find it does not improve the accuracy of label transfer relative to random initialization and increases runtime. kmeans|| initialization can be enabled with kmeans_barbar=True.
- Parameters:
others: SingleCell
the other SingleCell datasets to harmonize this one with. Can be omitted if harmonizing between batches of the same dataset, but then batch_column must be specified.
QC_column: SingleCellColumn | None | Sequence[SingleCellColumn | None]
an optional Boolean column of obs indicating which cells passed QC. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array. Set to None to include all cells. Cells failing QC will be ignored and have their Harmony embeddings set to NaN. When others is specified, QC_column can be a length-1 + len(others) sequence of columns, expressions, Series, functions, or None for each dataset (for self, followed by each dataset in others).
batch_column: SingleCellColumn | None | Sequence[SingleCellColumn | None]
an optional String, Enum, Categorical, or integer column of obs indicating which batch each cell is from. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array. Each batch will be treated as if it were a distinct dataset; this is exactly equivalent to splitting the dataset with split_by(batch_column) and then passing each of the resulting datasets to harmonize(). Set to None to treat each dataset as having a single batch. When others is specified, batch_column may be a length-1 + len(others) sequence of columns, expressions, Series, functions, or None for each dataset (for self, followed by each dataset in others).
PC_key: str
the key of obsm containing the principal components calculated with pca(), to use as the input to Harmony
Harmony_key: str
the key of obsm where the Harmony embeddings will be stored; will be added in-place to both self and each of the datasets in others!
num_clusters: int | None
the number of clusters used in the Harmony algorithm, including in the initial k-means clustering. If not specified, use 100 clusters if ≥3000 cells, 1 cluster if ≤30 cells, and round(num_cells / 30) if between 30 and 3000 cells.
max_iterations: int
the maximum number of iterations to run Harmony for, if convergence is not achieved. Defaults to 10, like the original Harmony R package, harmony-pytorch, and harmonypy. Set to None to use as many iterations as necessary to achieve convergence.
num_kmeans_iterations: int
the maximum number of iterations of k-means clustering to perform before starting the nearest-neighbor search, stopping early if a relative convergence of kmeans_tolerance is reached. Defaults to 25, like the original Harmony R package, harmony-pytorch, and harmonypy. However, unlike these packages, only one initialization is tried rather than 10 to reduce runtime.
kmeans_tolerance: int | float
the relative change in inertia (the sum of squared distances from each cell to its assigned centroid) used to determine whether to stop optimizing the k-means clustering before num_kmeans_iterations iterations
kmeans_barbar: bool
whether to use k-means|| initialization (a parallel version of k-means++) to initialize the k-means clustering centroids, instead of random initialization. This is more accurate but takes considerably longer for large datasets and num_clusters.
num_init_iterations: int
the number of k-means|| iterations used to initialize the k-means clustering that constitutes the first step of the nearest-neighbor search. k-means|| is a parallel version of the widely used k-means++ initialization scheme for k-means clustering. The default value of 5 is recommended by the k-means|| paper. Only used when kmeans_barbar=True.
oversampling_factor: int | float
the number of candidate centroids selected, on average, at each of the num_init_iterations iterations of k-means||, as a multiple of num_clusters. The default value of 1 is the midpoint (in log space) of the values explored by the k-means|| paper, namely 0.1 to 10. The total number of candidate centroids selected, on average, will be oversampling_factor * num_clusters + 1, from which the final num_clusters centroids will then be selected via k-means++. Only used when kmeans_barbar=True.
chunk_size_kmeans: int | None
the chunk size to use for distance calculations in the initial k-means clustering. Setting this to a power of 2 is recommended. Defaults to min(4096, total number of cells).
max_clustering_iterations: int
the number of iterations to run the clustering step within each Harmony iteration for, or the maximum number of iterations if early_stopping=True. Unlike the original Harmony algorithm, convergence of the clustering is not checked unless early_stopping=True. Defaults to 5 iterations; this differs from the 20 used by the original harmony R package and harmonypy and the 200 iterations used by harmony-pytorch, which do have convergence checks. Must be 4 or more when early_stopping=True, since Harmony’s clustering convergence check requires knowing the errors from the past 3 iterations.
block_proportion: int | float
the proportion of cells to use in each batch update in the clustering step; must be greater than zero and less than or equal to 1
tolerance: int | float
the relative tolerance used to determine whether to stop optimizing the Harmony embeddings before max_iterations iterations
early_stopping: bool
whether to stop clustering before max_clustering_iterations iterations if convergence is reached, like in the original Harmony algorithm
clustering_tolerance: int | float
the relative tolerance used to determine whether to stop clustering before max_clustering_iterations iterations. Only used when early_stopping=True.
theta: int | float
the weight of the diversity penalty term in the Harmony objective function; must be non-negative. Larger values result in more diverse clusters.
tau: int | float
the discounting factor on theta; must be non-negative. By default, tau = 0, so there is no discounting.
alpha: float
the scaling factor for the ridge regression penalty used when correcting the principal components to get the Harmony embeddings; must be greater than 0 and less than 1. The ridge penalty lambda is determined by alpha and the expected number of cells, assuming independence between batches and clusters: lambda = alpha * expected number of cells. Smaller values result in more aggressive correction.
sigma: int | float
the weight of the entropy term in the Harmony objective function; must be positive
chunk_size_Harmony: int | None
the chunk size to use for Harmony. Setting this to a power of 2 is recommended. Defaults to min(256, total number of cells). Not used when original=True.
seed: int
the random seed to use for the initial k-means clustering
original: bool
if True, use the original Harmony algorithm’s blocking strategy, rather than our nested chunks-within-blocks strategy. original=True requires num_threads=1. This gives lower memory usage and closer correspondence to the original algorithm, at the cost of a) a moderate (~20-25%) degradation in performance and b) no longer matching the Harmony embeddings produced by the multithreaded version. If True, exactly match the results of the multithreaded version when num_threads=1. Must be False unless num_threads=1.
overwrite: bool
if True, overwrite Harmony_key if already present in obsm, instead of raising an error
verbose: bool
whether to print details of the harmonization process
num_threads: int | None
the number of threads to use when concatenating principal components and batch/dataset labels across datasets, for the initial k-means clustering, and for the matrix and matrix-vector multiplications within Harmony. Set num_threads=-1 to use all available cores, as determined by
os.cpu_count(), or leave unset to use self.num_threads cores. Does not affect the returned SingleCell dataset’s num_threads; this will always be the same as the original dataset’s num_threads. num_threads will be capped to 64 when running with Scipy linked against OpenBLAS (see warning below).
- Returns:
A length-1 + len(others) tuple of SingleCell datasets with the Harmony embeddings stored in obsm[Harmony_key]: self, followed by each dataset in others. Or, if others is omitted, a single SingleCell dataset with the Harmony embeddings stored in obsm[Harmony_key].
- Return type:
SingleCell | tuple[SingleCell, …]
Warning
If you installed Scipy via pip, it will be linked against OpenBLAS, and harmonize() will be limited to 64 threads due to the limitations of OpenBLAS. To use more than 64 threads, install Scipy linked against MKL BLAS. This is done automatically when installing brisc via conda, but you can also do it manually via conda install “libblas=*=*mkl” scipy.