Pseudobulk#

A pseudobulked single-cell dataset resulting from calling pseudobulk() on a SingleCell dataset. Has slots for:

X: a dict of NumPy arrays of counts per cell and gene for each cell type
obs: a dict of polars DataFrames of sample metadata for each cell type
var: a dict of polars DataFrames of gene metadata for each cell type
num_threads: the default number of threads to use for operations on the dataset that support multithreading (which can be overridden by individual functions)

as well as obs_names and var_names, aliases for a dict of obs[:, 0] and var[:, 0] for each cell type.

In many ways, Pseudobulk datasets behave like dictionaries:

pb1 | pb2 combines pseudobulks with non-overlapping cell types into one big pseudobulk
cell_type in pb tests whether cell_type is a cell type in the pseudobulk
for cell_type in pb: and for cell_type in pb.keys(): yield the cell type names
for X, obs, var in pb.values(): yields each cell type’s X, obs, and var
for cell_type, (X, obs, var) in pseudobulk.items(): yields both the name and the X, obs and var for each cell type

There are also custom iterators if you just want one field per cell type:

for X in pseudobulk.iter_X(): yields just the X for each cell type
for obs in pseudobulk.iter_obs(): yields just the obs
for var in pseudobulk.iter_var(): yields just the var

class brisc.Pseudobulk(source=None, /, *, X=None, obs=None, var=None, num_threads=None)[source]#

Parameters:

source: str | Path | None
a directory to load a saved Pseudobulk dataset from (see save()). Mutually exclusive with X, obs, and var.
X: dict[str, ndarray[dtype[floating]]] | None
a {cell type: NumPy array} dictionary of counts or log CPMs. Mutually exclusive with source.
obs: dict[str, DataFrame] | None
a {cell type: polars DataFrame} dict of metadata per sample, when X is a dictionary. The first column must be String, Enum, Categorical, or integer. Mutually exclusive with source.
var: dict[str, DataFrame] | None
a {cell type: polars DataFrame} dict of metadata per gene, when X is a dictionary. The first column must be String, Enum, Categorical, or integer. Mutually exclusive with source.
num_threads: int | None
the default number of threads to use for all subsequent operations on this Pseudobulk dataset. By default (num_threads=None), use all available cores, as determined by os.cpu_count().

Analysis#

`qc`	Subsets each cell type to samples passing quality control (QC).
`library_size`	Calculate normalization factor-adjusted library sizes for each sample in each cell type, via the approach of edgeR's `calcNormFactors()`.
`cpm`	Calculate counts per million for each cell type.
`log_cpm`	Calculate log counts per million for each cell type.
`regress_out`	Regress out covariates from obs.
`de`	Perform differential expression (DE) on a Pseudobulk dataset with limma-voom.

I/O#

save

Saves a Pseudobulk dataset to directory (which must not exist unless overwrite=True, and will be created) with three files per cell type: the X at f'{cell_type}.X.npy', the obs at f'{cell_type}.obs.parquet', and the var at f'{cell_type}.var.parquet'.

Properties#

`X`	A dictionary of count matrices for each cell type, as NumPy arrays.
`obs`	A dictionary of Polars DataFrames of sample-level metadata for each cell type.
`var`	A dictionary of Polars DataFrames of gene-level metadata for each cell type.
`obs_names`	A shortcut to access the first column of obs for each cell type.
`var_names`	A shortcut to access the first column of var for each cell type.
`num_threads`	The default number of threads used for this Pseudobulk dataset's operations.
`shape`	a dictionary mapping each cell type to a length-2 tuple where the first element is the number of samples, and the second is the number of genes.

Data access#

`sample`	Get the row of X[cell_type] corresponding to a single sample, based on the sample's name in obs_names.
`gene`	Get the column of X[cell_type] corresponding to a single gene, based on the gene's name in var_names.
`peek_obs`	Print a row of obs (the first row, by default) for a cell type (the first cell type, by default) with each column on its own line.
`peek_var`	Print a row of var (the first row, by default) for a cell type (the first cell type, by default) with each column on its own line.

Dictionary interface#

`keys`	Get a KeysView (like you would get from dict.keys()) of this Pseudobulk dataset's cell types.
`values`	Get a ValuesView (like you would get from dict.values()) of (X, obs, var) tuples for each cell type in this Pseudobulk dataset.
`items`	Get an ItemsView (like you would get from dict.items()) of (cell_type, (X, obs, var)) tuples for each cell type in this Pseudobulk dataset.
`iter_X`	Iterate over each cell type's X.
`iter_obs`	Iterate over each cell type's obs.
`iter_var`	Iterate over each cell type's var.
`__contains__`	Check if this Pseudobulk dataset contains the specified cell type.
`__or__`	Combine the cell types of this Pseudobulk dataset with another.
`__eq__`	Test for equality with another Pseudobulk dataset.

Manipulation#

`set_obs_names`	Sets a column as the new first column of obs, i.e. the obs_names.
`set_var_names`	Sets a column as the new first column of var, i.e. the var_names.
`set_num_threads`	Return a new Pseudobulk dataset with a different default number of threads.
`filter_obs`	Equivalent to `df.filter()` from polars, but applied to both obs and X for each cell type.
`filter_var`	Equivalent to `df.filter()` from polars, but applied to both var and X for each cell type.
`select_obs`	Equivalent to `df.select()` from polars, but applied to each cell type's obs.
`select_var`	Equivalent to `df.select()` from polars, but applied to each cell type's var.
`select_cell_types`	Create a new Pseudobulk dataset subset to the cell type(s) in cell_types and more_cell_types.
`with_columns_obs`	Equivalent to `df.with_columns()` from polars, but applied to each cell type's obs.
`with_columns_var`	Equivalent to `df.with_columns()` rom polars, but applied to each cell type's var.
`drop_obs`	Create a new Pseudobulk dataset with columns and more_columns removed from obs.
`drop_var`	Create a new Pseudobulk dataset with columns and more_columns removed from var.
`drop_cell_types`	Create a new Pseudobulk dataset with cell_types and more_cell_types removed.
`rename_obs`	Create a new Pseudobulk dataset with column(s) of obs renamed for each cell type.
`rename_var`	Create a new Pseudobulk dataset with column(s) of var renamed for each cell type.
`rename_cell_types`	Create a new Pseudobulk dataset with cell type(s) renamed.
`cast_X`	Cast each cell type's X to the specified data type.
`cast_obs`	Cast column(s) of each cell type's obs to the specified data type(s).
`cast_var`	Cast column(s) of each cell type's var to the specified data type(s).
`join_obs`	Left-join each cell type's obs with another DataFrame, using the same logic as `df.join()`.
`join_var`	Left-join each cell type's var with another DataFrame, using the same logic as `df.join()`.
`subsample_obs`	Subsample a specific number or fraction of samples.
`subsample_var`	Subsample a specific number or fraction of genes.
`split_by_cell_type`	Split this Pseudobulk dataset into a tuple of Pseudobulk datasets with one cell type each.
`concat_obs`	Concatenate one or more other Pseudobulk datasets with this one, sample-wise.
`concat_var`	Concatenate one or more other Pseudobulk datasets with this one, gene-wise.
`copy`	Make a copy of this Pseudobulk dataset.
`to_df`	Convert this Pseudobulk object to a polars DataFrame, with one row per (sample, cell type) pair and one column per gene.
`map_X`	Apply a function to each cell type's X.
`map_obs`	Apply a function to each cell type's obs.
`map_var`	Apply a function to each cell type's var.
`pipe`	Apply a function to a Pseudobulk dataset.
`pipe_X`	Apply a function to a Pseudobulk dataset's X.
`pipe_obs`	Apply a function to a Pseudobulk dataset's obs.
`pipe_var`	Apply a function to a Pseudobulk dataset's var.