subsample_var#
- SingleCell.subsample_var(*, n=None, fraction=None, by_column=None, subsample_column=None, seed=0, overwrite=False, num_threads=None)[source]#
Subsample a specific number or fraction of genes.
- Parameters:
n: int | None
the number of genes to return; mutually exclusive with fraction
fraction: int | float | None
the fraction of genes to return; mutually exclusive with n
by_column: SingleCellColumn | None
an optional String, Enum, Categorical, or integer column of var to subsample by. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array. Specifying by_column ensures that the same fraction of genes with each value of by_column are subsampled. When combined with n, to make sure the total number of samples is exactly n, some of the smallest groups may be oversampled by one element, or some of the largest groups may be undersampled by one element. Can contain null entries: the corresponding genes will not be included in the result.
subsample_column: str | None
an optional name of a Boolean column to add to var indicating the subsampled genes; if None, subset to these genes instead
seed: int
the random seed to use when subsampling
overwrite: bool
if True, overwrite subsample_column if already present in var, instead of raising an error. Must be False when subsample_column is None.
num_threads: int | None
the number of threads to use when subsetting X to the sampled genes. Set num_threads=-1 to use all available cores, as determined by
os.cpu_count(). By default (num_threads=None), use self.num_threads cores. Does not affect the subsampled SingleCell dataset’s num_threads; this will always be the same as the original dataset’s num_threads. Can only be specified when subsample_column is None.
- Returns:
A new SingleCell dataset subset to the subsampled genes, or if subsample_column is specified, the full dataset with subsample_column added to var.
- Return type: