regress_out#

Pseudobulk.regress_out(formula, /, *, categorical_columns=None, ordinal_columns=None, cell_types=None, excluded_cell_types=None, error_if_int=True, verbose=False, num_threads=None)[source]#

Regress out covariates from obs. Must be run after log_cpm().

To avoid confounding due to library size or the number of cells included in the pseudobulk, we strongly recommend including `’+ log2(num_cells) + log2(library_size)’` to the `formula`, where `’library_size’` is a column of `obs` that can be added by running `library_size()` before this function.

The design matrix is constructed via the model.matrix() R function. String, Categorical, and Enum columns of obs referenced in formula are converted to unordered factors, which by default are one-hot encoded into N - 1 columns of the design matrix, where N is the number of unique values (contr.treatment in R). Use the ordinal_columns argument to treat specific columns as ordered factors (i.e. ordinal variables) instead. Use the categorical_columns argument to treat specific integer columns as unordered factors (i.e. categorical variables).

The way that ordered and unordered factors are encoded can also be changed globally. For example, to use Helmert contrasts for ordered factors:

from ryp import r
r('options(contrasts=c(unordered="contr.treatment", '
  '                    ordered="contr.helmert"))')

To view the current value of the contrasts option, use:

from ryp import r
r('getOption("contrasts")')
Parameters:
  • formula: str | dict[str, str]

    a string representation of an R formula specifying the design matrix to regress out in terms of columns of obs, e.g. ‘~ disease_status + age + sex’. Will be converted into an R formula object with R’s as.formula() function and then expanded into a design matrix with R’s model.matrix() function. Must begin with a tilde (~). May also be a dictionary mapping cell-type names to formulas; each cell type in this Pseudobulk dataset must be present.

  • categorical_columns: str | Iterable[str] | None | dict[str, str | Iterable[str] | None]

    one or more names of integer columns of obs to treat as categorical (i.e. convert to unordered factors), or a dictionary mapping cell-type names to names of integer columns

  • ordinal_columns: str | Iterable[str] | None | dict[str, str | Iterable[str] | None]

    one or more names of integer, String, Categorical, or Enum columns of obs to treat as ordinal (i.e. convert to ordered factors), or a dictionary mapping cell-type names to names of such columns. By default, ordered factors are assumed to have equally spaced levels and are expanded into N - 1 columns in the design matrix, where each column represents a polynomial term of increasing degree (linear, quadratic, cubic, etc.) calculated from these equally spaced levels (contr.poly in R).

  • cell_types: str | Iterable[str] | None

    one or more cell types to regress the covariates out of; if None, regress covariates out of all cell types. Mutually exclusive with excluded_cell_types.

  • excluded_cell_types: str | Iterable[str] | None

    one or more cell types to exclude when regressing out covariates; mutually exclusive with cell_types

  • error_if_int: bool

    if True, raise an error if self.X.dtype is integer (indicating the user may not have run log_cpm() yet)

  • verbose: bool

    whether to print out details of the regressing-out process

  • num_threads: int | None

    the number of threads to use when regressing out. Set num_threads=-1 to use all available cores, as determined by os.cpu_count(), or leave unset to use self.num_threads cores. Does not affect the returned Pseudobulk dataset’s num_threads; this will always be the same as the original dataset’s num_threads.

Returns:

A new Pseudobulk dataset with covariates regressed out.

Return type:

Pseudobulk