de#

Pseudobulk.de(formula, /, *, coefficient=None, contrasts=None, group=None, categorical_columns=None, ordinal_columns=None, cell_types=None, excluded_cell_types=None, library_size_column='library_size', strict=False, robust=False, return_voom_info=True, allow_float=False, verbose=False)[source]#

Perform differential expression (DE) on a Pseudobulk dataset with limma-voom. The DE design is specified via an R formula string that references columns from obs.

Must be run after library_size(). Requires raw counts, so must not be run after cpm() or log_cpm().

To avoid confounding due to library size or the number of cells included in the pseudobulk, we strongly recommend including ‘+ log2(num_cells) + log2(library_size)’ to the formula, where ‘library_size’ is a column of obs that will be added when running library_size() before this function.

By default, DE is reported with respect to the first non-intercept column of the design matrix, which is usually just the first term in the formula. This can be changed via the coefficient and contrasts arguments.

By default, voomByGroup is used instead of voom when reporting DE with respect to a categorical variable, e.g. when comparing disease cases to healthy controls. The unique values of coefficient are used as the groups. This can be changed via the group argument.

The design matrix is constructed via the model.matrix() R function. String, Categorical, and Enum columns of obs referenced in formula are converted to unordered factors, which by default are one-hot encoded into N - 1 columns of the design matrix, where N is the number of unique values (contr.treatment in R). Use the ordinal_columns argument to treat specific columns as ordered factors (i.e. ordinal variables) instead. Use the categorical_columns argument to treat specific integer columns as unordered factors (i.e. categorical variables).

The way that ordered and unordered factors are encoded can also be changed globally. For example, to use Helmert contrasts for ordered factors:

from ryp import r
r('options(contrasts=c(unordered="contr.treatment", '
  '                    ordered="contr.helmert"))')

To view the current value of the contrasts option, use:

from ryp import r
r('getOption("contrasts")')
Parameters:
  • formula: str | dict[str, str]

    a string representation of an R formula specifying the DE design in terms of columns of obs, e.g. ‘~ disease_status + age + sex’. Will be converted into an R formula object with R’s as.formula() function and then expanded into a design matrix with R’s model.matrix() function. Must begin with a tilde (~). May also be a dictionary mapping cell-type names to formulas; each cell type in this Pseudobulk dataset must be present.

  • coefficient: str | int | Iterable[str | int] | dict[str, str | int | Iterable[str | int] | None] | None

    the name or 0-based index of a coefficient in the design matrix to report DE with respect to, or a sequence of names or indices to report DE with respect to multiple coefficients. Or, a dictionary mapping cell-type names to any of the above, or to None to use the default for that cell type; each cell type in this Pseudobulk dataset must be present. Negative indices work in the usual Python way. If not specified, or set to None for a given cell type, defaults to the first non-intercept column of the design matrix: column 1 if the formula includes an intercept (the usual case), or column 0 if it does not (e.g. when the formula contains ~0 or -1). Mutually exclusive with contrasts.

  • contrasts: dict[str, str] | None | dict[str, dict[str, str] | None]

    an optional dictionary mapping contrast names to string representations of R formulas specifying contrasts between names of columns in the design matrix (e.g. {‘DrugA_vs_Control’: ‘DrugA - Control’}); the contrast names (keys of the dictionary) will appear in the ‘Coefficient’ column of the output DE object. Or, a dictionary mapping cell-type names to these dictionaries; each cell type in this Pseudobulk dataset must be present. If specified, DE will be performed with respect to each contrast by running limma’s makeContrasts() and contrasts.fit() functions after lmFit(). Mutually exclusive with coefficient. Notably, contrasts can be specified that involve column names that would not be valid R variable names, and backticks are optional: {‘CD8 vs CD4’: ‘CD8+ T-cells - CD4+ T-cells’} is valid even though the two column names ‘CD8+ T-cells’ and ‘CD4+ T-cells’ are not escaped with backticks.

  • group: False | PseudobulkColumn | None | dict[str, False | str | Expr | Series | ndarray | None]

    if group=False, force the use of voom instead of voomByGroup. If group=None, group on the unique combinations of values of the categorical columns of obs referenced in coefficient, or the columns of obs referenced in contrasts. Here, categorical columns are those that are String, Categorical, Enum, or Boolean and not specified in ordinal_columns, or those that are integer and specified in categorical_columns. If group is a column, force the use of voomByGroup and group on the unique values of that column. group can also be a dictionary mapping cell-type names to False, None, or a column for each cell type. When using voomByGroup, the same groups are also used as the group argument to calcNormFactors() when normalizing by library size. All groups must have at least two samples.

  • categorical_columns: str | Iterable[str] | None | dict[str, str | Iterable[str] | None]

    one or more names of integer columns of obs to treat as categorical (i.e. convert to unordered factors), or a dictionary mapping cell-type names to names of integer columns

  • ordinal_columns: str | Iterable[str] | None | dict[str, str | Iterable[str] | None]

    one or more names of integer, String, Categorical, or Enum columns of obs to treat as ordinal (i.e. convert to ordered factors), or a dictionary mapping cell-type names to names of such columns. By default, ordered factors are assumed to have equally spaced levels and are expanded into N - 1 columns in the design matrix, where each column represents a polynomial term of increasing degree (linear, quadratic, cubic, etc.) calculated from these equally spaced levels (contr.poly in R).

  • cell_types: str | Iterable[str] | None

    one or more cell types to test for DE; if None, test all cell types. If specified and using dictionaries for formula, coefficient, contrasts, or group, the keys of these dictionaries must match the cell types specified here. Mutually exclusive with excluded_cell_types.

  • excluded_cell_types: str | Iterable[str] | None

    one or more cell types to exclude when testing for DE. If specified and using dictionaries for formula, coefficient, contrasts, or group, the keys of these dictionaries must also exclude these cell types. Mutually exclusive with cell_types.

  • library_size_column: PseudobulkColumn

    a floating-point column of obs containing each sample’s library size. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this Pseudobulk dataset and a cell type and returns a polars Series or 1D NumPy array. Or, a dictionary mapping cell-type names to any of the above; each cell type in this Pseudobulk dataset must be present.

  • strict: bool

    whether to raise an error if the design matrix does not have more rows than columns and/or is rank-deficient for any cell type. If strict=False, cell types where this is the case will be skipped, and an error will only be raised if every cell type is skipped.

  • robust: bool

    whether to specify robust=True in limma’s eBayes() function. You may wish to specify this if your dataset contains outliers.

  • return_voom_info: bool

    whether to include the voom weights and voom plot data in the returned DE object; set to False for reduced runtime if you do not need to use the voom weights or generate voom plots

  • allow_float: bool

    if False, raise an error if self.X.dtype is floating-point (suggesting the user may not be using the raw counts, e.g. due to accidentally having run log_cpm() already); if True, disable this sanity check

  • verbose: bool

    whether to print out details of the DE estimation

Returns:

A DE object with a table attribute containing a polars DataFrame of the DE results, with columns:

  • cell_type: the cell type in which DE was tested

  • coefficient: the coefficient (or contrast) for which DE was

    tested

  • gene: the gene for which DE was tested

  • logFC: the log2 fold change of the gene, i.e. its effect size

  • SE: the standard error of the effect size

  • LCI: the lower 95% confidence interval of the effect size

  • UCI: the upper 95% confidence interval of the effect size

  • AveExpr: the gene’s average expression in this cell type, in log

    CPM

  • p: the DE p-value

  • Bonferroni: the Bonferroni-corrected DE p-value

  • FDR: the FDR q-value for the DE

If return_voom_info=True, the DE object also includes a voom_weights attribute containing a {cell_type: DataFrame} dictionary of voom weights, and a voom_plot_data attribute containing a {cell_type: DataFrame} dictionary of info necessary to construct a voom plot with DE.plot_voom().

Return type:

DE