subsample_obs#

SingleCell.subsample_obs(*, n=None, fraction=None, QC_column='passed_QC', by_column=None, subsample_column=None, seed=0, overwrite=False, num_threads=None)[source]#

Subsample a specific number or fraction of cells.

Parameters:
  • n: int | None

    the number of cells to return; mutually exclusive with fraction

  • fraction: int | float | None

    the fraction of cells to return; mutually exclusive with n

  • QC_column: SingleCellColumn | None

    an optional Boolean column of obs indicating which cells passed QC. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array. Set to None to include all cells. Cells failing QC will not be selected when subsampling, and will not count towards the denominator of fraction; QC_column will not appear in the returned SingleCell dataset, since it would be redundant.

  • by_column: SingleCellColumn | None

    an optional String, Enum, Categorical, or integer column of obs to subsample by. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array. Specifying by_column ensures that the same fraction of cells with each value of by_column are subsampled. When combined with n, to make sure the total number of samples is exactly n, some of the smallest groups may be oversampled by one element, or some of the largest groups may be undersampled by one element. Can contain null entries: the corresponding cells will not be included in the result.

  • subsample_column: str | None

    an optional name of a Boolean column to add to obs indicating the subsampled cells; if None, subset to these cells instead

  • seed: int

    the random seed to use when subsampling

  • overwrite: bool

    if True, overwrite subsample_column if already present in obs, instead of raising an error. Must be False when subsample_column is None.

  • num_threads: int | None

    the number of threads to use when subsetting X to the sampled cells. Set num_threads=-1 to use all available cores, as determined by os.cpu_count(). By default (num_threads=None), use self.num_threads cores. Does not affect the subsampled SingleCell dataset’s num_threads; this will always be the same as the original dataset’s num_threads. Can only be specified when subsample_column is None.

Returns:

A new SingleCell dataset subset to the subsampled cells, or if subsample_column is specified, the full dataset with subsample_column added to obs. If QC_column is a string and a QC column exists in the original dataset, it will be removed from the subsampled dataset, since all subsampled cells pass QC and it would be redundant.

Return type:

SingleCell