neighbors#

SingleCell.neighbors(*, QC_column='passed_QC', PC_key='pca', neighbors_key='neighbors', distances_key='distances', num_neighbors=20, num_clusters=None, num_clusters_searched=None, num_kmeans_iterations=2, kmeans_tolerance=0.01, kmeans_barbar=False, num_init_iterations=5, oversampling_factor=1, chunk_size_kmeans=None, chunk_size_search=None, seed=0, overwrite=False, verbose=False, num_threads=None)[source]#

Calculate the num_neighbors nearest neighbors of each cell.

neighbors() is intended to be run after pca(); by default, it uses obsm[‘pca’] as the input to the nearest-neighbors calculation. neighbors() must be re-run if the dataset is subset; not doing so will raise an error. A cell is not considered its own nearest neighbor.

neighbors() is based on the IndexIVFFlat search strategy from the widely-used faiss nearest-neighbor search library. It works in two main steps:

  1. Perform k-means clustering to subdivide the dataset into num_clusters clusters. For large datasets, the number of clusters is four times the square root of the number of cells, rounding up; for small datasets, 1% of the number of cells.

  2. Find each cell’s num_clusters_searched (default: 64) nearest cluster centroids (one of which is the cell’s own cluster), then search these clusters exhaustively for nearest-neighbor candidates.

The key optimization: instead of going one cell at a time and searching its 64 nearest clusters, it group cells into batches of chunk_size_search (default: 256), enumerates which clusters are one of the 64 nearest to at least one cell in the batch, then runs the searches across each of these clusters, one cluster at a time. Since cells in a block will tend to share many of their nearest 64 clusters, this amortizes the cost of streaming the cluster’s cells from memory across all the cells in the block that need them.

The key parameters that affect accuracy and runtime are num_clusters and num_clusters_searched. Runtime goes up about linearly with the fraction of cells searched, which is roughly num_clusters_searched / num_clusters. Accuracy will also increase, but with strongly diminishing returns. We recommend increasing num_clusters_searched (e.g. to 128) if greater accuracy is desired, and decreasing it (e.g. to 32) if greater speed is desired.

Parameters:
  • QC_column: SingleCellColumn | None

    an optional Boolean column of obs indicating which cells passed QC. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array. Set to None to include all cells. Cells failing QC will be ignored and have their nearest neighbors set to -1.

  • PC_key: str

    the key of obsm containing the principal components calculated with pca(), to use as the input for the nearest-neighbors calculation

  • neighbors_key: str

    the key of obsm where the nearest neighbors will be stored

  • distances_key: str

    the key of obsm where the squared Euclidean distance to each nearest neighbor will be stored

  • num_neighbors: int

    the number of nearest neighbors to report for each cell; must be less than or equal to the number of cells

  • num_clusters: int | None

    the number of k-means clusters to use during the nearest-neighbor search. Must be at least num_neighbors + 1 (so that enough nearest-neighbor candidates are found even in the worst-case scenario where every cluster contains only one cell) and at most the number of cells. If None, will be set to ceil(min(4 * sqrt(num_cells), num_cells / 100)) clusters, i.e. the minimum of four times the square root of the number of cells and 1% of the number of cells, rounding up (and ensuring at least num_neighbors + 1 clusters). The core of the heuristic, 4 * sqrt(num_cells), is the low end of the range recommended by faiss, 4 to 16 times the square root. However, faiss also recommends using between 39 and 256 data points per centroid when training the k-means clustering used in the k-nearest neighbors search. To avoid going below 39, we switch to using num_cells / 100 centroids for small datasets (fewer than 640,000 cells), since 100 is the midpoint of 39 and 256 in log space.

  • num_clusters_searched: int | None

    the number of a cell’s nearest clusters to search. Must be at least num_neighbors + 1 (so that enough nearest-neighbor candidates are found even in the worst-case scenario where every cluster contains only one cell) and at most num_clusters. Defaults to 64, or num_neighbors if larger than 64, or num_clusters if smaller than 64.

  • num_kmeans_iterations: int

    the maximum number of iterations of k-means clustering to perform before starting the nearest-neighbor search, stopping early if a relative convergence of kmeans_tolerance is reached

  • kmeans_tolerance: int | float

    the relative change in inertia (the sum of squared distances from each cell to its assigned centroid) used to determine whether to stop optimizing the k-means clustering before num_kmeans_iterations iterations

  • kmeans_barbar: bool

    whether to use k-means|| initialization (a parallel version of k-means++) to initialize the k-means clustering centroids, instead of random initialization. This is more accurate but takes considerably longer for large datasets and num_clusters.

  • num_init_iterations: int

    the number of k-means|| iterations used to initialize the k-means clustering that constitutes the first step of the nearest-neighbor search. k-means|| is a parallel version of the widely used k-means++ initialization scheme for k-means clustering. The default value of 5 is recommended by the k-means|| paper. Only used when kmeans_barbar=True.

  • oversampling_factor: int | float

    the number of candidate centroids selected, on average, at each of the num_init_iterations iterations of k-means||, as a multiple of num_clusters. The default value of 1 is the midpoint (in log space) of the values explored by the k-means|| paper, namely 0.1 to 10. The total number of candidate centroids selected, on average, will be oversampling_factor * num_clusters + 1, from which the final num_clusters centroids will then be selected via k-means++. Only used when kmeans_barbar=True.

  • chunk_size_kmeans: int | None

    the chunk size used for distance calculations during k-means clustering, and also during the per-query centroid ranking step of the nearest-neighbor search. Setting this to a power of 2 is recommended. Defaults to min(4096, num_cells).

  • seed: int

    the random seed to use when finding nearest neighbors

  • overwrite: bool

    if True, overwrite neighbors_key if already present in obsm, instead of raising an error

  • verbose: bool

    whether to print details of the nearest-neighbor search

  • num_threads: int | None

    the number of threads to use when finding nearest neighbors. Set num_threads=-1 to use all available cores, as determined by os.cpu_count(), or leave unset to use self.num_threads cores. Does not affect the returned SingleCell dataset’s num_threads; this will always be the same as the original dataset’s num_threads. num_threads will be capped to 64 when running with Scipy linked against OpenBLAS (see warning below).

Returns:

A new SingleCell dataset with the indices of each cell’s nearest neighbors - not counting the cell itself - stored in obsm[neighbors_key] as a len(obs) × num_neighbors NumPy array, where obsm[neighbors_key][i, j] stores the index of the i`th cell’s `j + 1`th nearest neighbor. (This differs from Scanpy and Seurat, which use a less compact sparse matrix representation instead.) For instance, if `num_neighbors=2 and the 0th cell’s nearest neighbors are the 4th cell and the 6th cell, then obsm[neighbors_key][0] will be np.array([4, 6]). Note that if QC_column is not None, these integer indices are with respect to QCed cells, not all cells. The squared Euclidean distance to each nearest neighbor will be stored in obsm[distances_key] as an array of the same shape as obsm[neighbors_key].

Return type:

SingleCell

Warning

If you installed Scipy via pip, it will be linked against OpenBLAS, and neighbors() will be limited to 64 threads due to the limitations of OpenBLAS. To use more than 64 threads, install Scipy linked against MKL BLAS. This is done automatically when installing brisc via conda, but you can also do it manually via conda install “libblas=*=*mkl” scipy.