qc#
- Pseudobulk.qc(group_column, /, *, custom_filter=None, min_samples=2, min_cells=10, max_standard_deviations=3, min_nonzero_fraction=0.8, cell_types=None, excluded_cell_types=None, error_if_negative_counts=True, allow_float=False, verbose=False)[source]#
Subsets each cell type to samples passing quality control (QC). If samples fall into discrete groups (e.g. disease cases versus controls), these should be specified via the group_column argument.
Filters, in order, to:
samples that pass the custom_filter (if specified), have non-missing values for group_column (if specified), and have at least min_cells cells of that type (default: 10)
samples where the number of genes with 0 counts is at most max_standard_deviations standard deviations above the mean (default: 3)
genes with at least 1 count in 100 * min_nonzero_fraction`% (default: 80%) of samples (in every group, if `group_column is specified)
If at any point during this filtering process, there are fewer than min_samples (default: 2) samples (in any group, if group_column is specified), or group_column is specified and all samples have the same value of group_column, the cell type is filtered out entirely.
- Parameters:
group_column: PseudobulkColumn | None | dict[str, PseudobulkColumn | None]
an optional String, Categorical, Enum, Boolean, or integer column of obs with sample group information, e.g. which samples are disease cases and which are controls. If specified, the min_nonzero_fraction and min_samples filters must pass for every group, rather than merely passing for the dataset as a whole. Set to None if samples do not fall into discrete groups. Can be None, a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this Pseudobulk dataset and a cell type and returns a polars Series or 1D NumPy array. Or, a dictionary mapping cell-type names to any of the above; each cell type in this Pseudobulk dataset must be present. Can contain null entries: the corresponding samples will be deemed to fail QC.
custom_filter: PseudobulkColumn | None | dict[str, PseudobulkColumn | None]
an optional Boolean column of obs containing a filter to apply on top of the other QC filters; True elements will be kept. Can be None, a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this Pseudobulk dataset and a cell type and returns a polars Series or 1D NumPy array. Or, a dictionary mapping cell-type names to any of the above; each cell type in this Pseudobulk dataset must be present.
min_samples: int
filter to cell types with at least this many samples in every group, or with at least this many total samples if group_column is None
min_cells: int | None
if not None, filter to samples with ≥ this many cells of each cell type
max_standard_deviations: int | float | None
if not None, filter to samples where the number of genes with 0 counts is at most this many standard deviations above the mean
min_nonzero_fraction: int | float | None
if not None, filter to genes with at least one count in this fraction of samples in each group, or if group_column is None, at least one count in this fraction of samples overall. Note: min_nonzero_fraction=0 filters out only genes with all-zero counts, while min_nonzero_fraction=None does not filter out any genes.
cell_types: str | Iterable[str] | None
one or more cell types to QC; if None, QC all cell types. Mutually exclusive with excluded_cell_types.
excluded_cell_types: str | Iterable[str] | None
one or more cell types to exclude from QC; mutually exclusive with cell_types
error_if_negative_counts: bool
if True, raise an error if any counts are negative
allow_float: bool
if False, raise an error if self.X.dtype is floating-point (suggesting the user may not be using the raw counts); if True, disable this sanity check
verbose: bool
whether to print how many samples and genes were filtered out at each step of the QC process
- Returns:
A new Pseudobulk dataset with each cell type’s X, obs and var subset to samples and genes passing QC.
- Return type:
Note
This function may give an incorrect output if the count matrix contains negative values: this is not checked for, due to speed considerations.