pseudobulk#

SingleCell.pseudobulk(ID_column, cell_type_column, /, *, QC_column='passed_QC', cell_types=None, excluded_cell_types=None, additional_obs=None, include_nulls=False, sort_genes=False, num_threads=None, verbose=False)[source]#

Pseudobulk a SingleCell dataset with sample ID and cell type columns, yielding a Pseudobulk dataset.

Operates on raw counts, so cannot be run after normalize(). Must be run after qc().

Counts from cells with the same pair of values in ID_column and cell_type_column will be summed to a single value. Cells with null in either column are excluded, unless include_nulls=True.

You can run this function multiple times at different cell type resolutions by setting a different cell_type_column each time, then combining the results afterwards with the | operator (assuming none of the cell types overlap between the two resolutions):

pb_broad = sc.pseudobulk('ID', 'broad_cell_type')
pb_fine = sc.pseudobulk('ID', 'fine_grained_cell_type')
pb = pb_broad | pb_fine

Parameters:

ID_column: SingleCellColumn
a String, Enum, Categorical, or integer column of obs containing sample IDs. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array.
cell_type_column: SingleCellColumn
a String, Enum, Categorical, or integer column of obs containing cell-type labels. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array. If cell_type_column is an integer column, the cell types in the Pseudobulk dataset will be coerced to strings.
QC_column: SingleCellColumn | None
an optional Boolean column of obs indicating which cells passed QC. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array. Set to None to include all cells. Cells failing QC will be excluded from the pseudobulk.
cell_types: str | Iterable[str] | int | Iterable[int] | None
one or more cell types to pseudobulk; by default, pseudobulks all cell types in cell_type_column. Specifying cell_types is exactly equivalent to filtering the result to these cell types, but will be faster when there are many cell types and pseudobulks are only desired for a few of them. Can also be used to change the order in which cell types appear in the resulting Pseudobulk dataset, even if pseudobulking all cell types. Mutually exclusive with excluded_cell_types and include_nulls.
excluded_cell_types: str | Iterable[str] | int | Iterable[int] | None
one or more cell types to exclude from pseudobulking. Mutually exclusive with cell_types and include_nulls.
additional_obs: DataFrame | None
an optional DataFrame of additional sample-level covariates, which will be joined to the pseudobulk’s obs for each cell type
include_nulls: bool
whether to exclude cells with null values in ID_column and/or cell_type_column from the pseudobulk. If include_nulls=True, null will be treated just like any other value. This means that, for instance, all cells from a given cell type that have null as the sample ID will be pseudobulked together, as will all cells from a given sample ID that have null as the cell type. Mutually exclusive with cell_types and excluded_cell_types.
sort_genes: bool
whether to sort genes in alphabetical order in the pseudobulk; by default, genes appear in the same order as in the SingleCell dataset
num_threads: int | None
the number of threads to use when pseudobulking; parallelism happens across {sample, cell type} pairs (or just samples, if cell_type_column is None). Set num_threads=-1 to use all available cores, as determined by os.cpu_count(), or leave unset to use self.num_threads cores. For count matrices stored in the usual CSR format, parallelization takes place across cell types and samples, so specifying more threads than the number of cell type-sample pairs will not provide additional speedup. Does not affect the Pseudobulk dataset’s num_threads; this will always be the same as the SingleCell dataset’s num_threads.
verbose: bool
whether to print the number of cells excluded when include_nulls=False (and neither cell_types nor excluded_cell_types are specified)

Returns:

A Pseudobulk dataset with X (the pseudobulked counts), obs (metadata per sample), and var (metadata per gene) fields, each of which are dictionaries across cell types. The columns of each cell type’s obs will be:

ID_column
’num_cells’ (the number of cells for that sample and cell type)

followed by whichever columns of the SingleCell dataset’s obs are constant across samples. var will be identical to the SingleCell dataset’s var.

Return type:

Pseudobulk

Note

This function may give an incorrect output if the count matrix contains negative values: this is not checked for, due to speed considerations.