subsample_obs#
- Pseudobulk.subsample_obs(*, n=None, fraction=None, cell_types=None, excluded_cell_types=None, by_column=None, subsample_column=None, seed=0, overwrite=False)[source]#
Subsample a specific number or fraction of samples.
- Parameters:
n : int | None
the number of samples to return; mutually exclusive with fraction
cell_types : str | Iterable[str] | None
one or more cell types to operate on; if None, operate on all cell types. Mutually exclusive with excluded_cell_types.
excluded_cell_types : str | Iterable[str] | None
one or more cell types to exclude from the operation; mutually exclusive with cell_types
fraction : int | float | None
the fraction of samples to return; mutually exclusive with n
by_column : str | Expr | Series | ndarray | Callable[[Pseudobulk, str], Series | ndarray] | None | dict[str, str | Expr | Series | ndarray | Callable[[Pseudobulk, str], Series | ndarray] | None]
an optional String, Enum, Categorical, or integer column of obs to subsample by. Can be None, a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this Pseudobulk dataset and a cell type and returns a polars Series or 1D NumPy array. Or, a dictionary mapping cell-type names to any of the above; each cell type in this Pseudobulk dataset must be present. Specifying by_column ensures that the same fraction of cells with each value of by_column are subsampled. When combined with n, to make sure the total number of samples is exactly n, some of the smallest groups may be oversampled by one element, or some of the largest groups can be undersampled by one element. Can contain null entries: the corresponding samples will not be included in the result.
subsample_column : str | None
an optional name of a Boolean column to add to obs indicating the subsampled samples; if None, subset to these samples instead
seed : int
the random seed to use when subsampling
overwrite : bool
if True, overwrite subsample_column if already present in obs, instead of raising an error. Must be False when subsample_column is None.
- Returns:
A new Pseudobulk dataset subset to the subsampled cells, or if subsample_column is specified, the full dataset with subsample_column added to obs.
- Return type: