label_transfer_from#

SingleCell.label_transfer_from(other, original_cell_type_column, *, QC_column='passed_QC', other_QC_column='passed_QC', Harmony_key='harmony', cell_type_column='cell_type', confidence_column=None, next_best=False, next_best_cell_type_column=None, next_best_confidence_column=None, num_neighbors=20, num_clusters=None, num_clusters_searched=None, num_kmeans_iterations=2, kmeans_tolerance=0.01, kmeans_barbar=False, num_init_iterations=5, oversampling_factor=1, chunk_size_kmeans=None, chunk_size_search=None, seed=0, overwrite=False, verbose=True, num_threads=None)[source]#

Transfer cell-type labels from another dataset to this one, using the two datasets’ Harmony embeddings from harmonize().

For each cell in self, the transferred cell-type label is the most common cell-type label among the num_neighbors cells in other with the nearest Harmony embeddings. The cell-type confidence is the fraction of these neighbors that share this most common cell-type label.

The nearest-neighbor search is conducted using the same method as neighbors(), with one crucial difference: whereas neighbors() searches for a cell’s nearest neighbors in its own dataset, this function searches for a cell’s nearest neighbors in another dataset, i.e. other.

Parameters:
  • other : SingleCell

    the dataset to transfer cell-type labels from

  • original_cell_type_column : SingleCellColumn

    a String, Enum, Categorical, or integer column of other.obs containing cell-type labels. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in other and returns a polars Series or 1D NumPy array.

  • QC_column : SingleCellColumn | None

    an optional Boolean column of self.obs indicating which cells passed QC. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in self and returns a polars Series or 1D NumPy array. Set to None to include all cells. Cells failing QC will have their cell-type labels and confidences set to null.

  • other_QC_column : SingleCellColumn | None

    an optional Boolean column of other.obs indicating which cells passed QC. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in other and returns a polars Series or 1D NumPy array. Set to None to include all cells. Cells failing QC will be ignored during the label transfer.

  • Harmony_key : str

    the key of self.obsm and other.obsm containing the Harmony embeddings for each dataset

  • cell_type_column : str

    the name of a column to be added to self.obs indicating each cell’s most likely cell type, i.e. the most common cell-type label among the cell’s num_neighbors nearest neighbors in other

  • confidence_column : str | None

    the name of a column to be added to self.obs indicating each cell’s cell-type confidence, i.e. the fraction of the cell’s num_neighbors nearest neighbors in other that share the most common cell-type label. If multiple cell types are equally common among the nearest neighbors, tiebreak based on which of them is most common in original_cell_type_column. If None, defaults to f’{cell_type_column}_confidence’.

  • next_best : bool

    whether to also compute each cell’s second-most likely cell type and confidence, or just its most likely

  • next_best_cell_type_column : str | None

    the name of a column to be added to self.obs indicating each cell’s second-most likely cell type, i.e. the second-most common cell-type label among the cell’s num_neighbors nearest neighbors in original_cell_type_column. If None, defaults to f’next_best_{cell_type_column}’. Can only be specified when next_best=True.

  • next_best_confidence_column : str | None

    the name of a column to be added to self.obs indicating each cell’s cell-type confidence, i.e. the fraction of the cell’s num_neighbors nearest neighbors in original_cell_type_column that share the second-most common cell-type label. If multiple cell types are equally common among the nearest neighbors, tiebreak based on which of them is most common in other. If None, defaults to f’next_best_{cell_type_column}_confidence’. Can only be specified when next_best=True.

  • num_neighbors : int

    the number of nearest neighbors to use when determining a cell’s label. All cell-type confidences will be multiples of 1 / num_neighbors.

  • num_clusters : int | None

    the number of k-means clusters to use during the nearest-neighbor search. Must be less than the number of cells. If None, will be set to ceil(min(4 * sqrt(num_cells), num_cells / 100)) clusters, i.e. the minimum of four times the square root of the number of cells in other and 1% of the number of cells in other, rounding up. The core of the heuristic, 4 * sqrt(num_cells), is the low end of the range recommended by faiss, 4 to 16 times the square root. However, faiss also recommends using between 39 and 256 data points per centroid when training the k-means clustering used in the k-nearest neighbors search. To avoid going below 39, we switch to using num_cells / 100 centroids for small datasets (fewer than 640,000 cells), since 100 is the midpoint of 39 and 256 in log space. For datasets of at least 10,000 cells, clusters with fewer than 100 cells will be merged into the adjacent cluster with the nearest centroid, so the actual number of clusters used may be smaller.

  • num_clusters_searched : int | None

    the number of a cell’s nearest clusters to search; must be between 1 and num_clusters. Defaults to min(64, num_clusters).

  • num_kmeans_iterations : int

    the maximum number of iterations of k-means clustering to perform before starting the nearest-neighbor search, stopping early if a relative convergence of kmeans_tolerance is reached

  • kmeans_tolerance : int | float

    the relative change in inertia (the sum of squared distances from each cell to its assigned centroid) used to determine whether to stop optimizing the k-means clustering before num_kmeans_iterations iterations

  • kmeans_barbar : bool

    whether to use k-means|| initialization (a parallel version of k-means++) to initialize the k-means clustering centroids, instead of random initialization. This is more accurate but takes considerably longer for large datasets and num_clusters.

  • num_init_iterations : int

    the number of k-means|| iterations used to initialize the k-means clustering that constitutes the first step of the nearest-neighbor search. k-means|| is a parallel version of the widely used k-means++ initialization scheme for k-means clustering. The default value of 5 is recommended by the k-means|| paper. Only used when kmeans_barbar=True.

  • oversampling_factor : int | float

    the number of candidate centroids selected, on average, at each of the num_init_iterations iterations of k-means||, as a multiple of num_clusters. The default value of 1 is the midpoint (in log space) of the values explored by the k-means|| paper, namely 0.1 to 10. The total number of candidate centroids selected, on average, will be oversampling_factor * num_clusters + 1, from which the final num_clusters centroids will then be selected via k-means++. Only used when kmeans_barbar=True.

  • chunk_size_kmeans : int | None

    the chunk size used for distance calculations during k-means clustering, and also during the per-query centroid ranking step of the nearest-neighbor search. Setting this to a power of 2 is recommended. Defaults to min(4096, number of cells in other).

  • chunk_size_search : int | None

    the chunk size used to group query cells together during the nearest-neighbor search. Overly small values will tend to increase runtime by reducing the reuse of information during the search, whereas overly large values will lead to excessive memory use. Defaults to min(256, number of cells in other).

  • seed : int

    the random seed to use when finding nearest neighbors

  • overwrite : bool

    if True, overwrite cell_type_column and/or confidence_column if already present in this dataset’s obs, instead of raising an error

  • verbose : bool

    whether to print details of the nearest-neighbor search

  • num_threads : int | None

    the number of threads to use for the nearest-neighbor search and label transfer. Set num_threads=-1 to use all available cores, as determined by os.cpu_count(), or leave unset to use self.num_threads cores. Does not affect the returned SingleCell dataset’s num_threads; this will always be the same as the original dataset’s num_threads. num_threads will be capped to 64 when running with Scipy linked against OpenBLAS (see warning below).

Returns:

cell_type_column, containing the transferred cell-type labels, and confidence_column, containing the cell-type confidences. If next_best=True, also adds the columns next_best_cell_type_column and next_best_confidence_column, containing the second-most likely cell type and its confidence.

Return type:

self, but with two columns added to obs

Warning

If you installed Scipy via pip, it will be linked against OpenBLAS, and label_transfer_from() will be limited to 64 threads due to the limitations of OpenBLAS. To use more than 64 threads, install Scipy linked against MKL BLAS. This is done automatically when installing brisc via conda, but you can also do it manually via conda install “libblas=*=*mkl” scipy.