library_size#

Pseudobulk.library_size(*, library_size_column='library_size', cell_types=None, excluded_cell_types=None, logratio_trim=0.3, sum_trim=0.05, A_cutoff=-10000000000.0, allow_float=False, overwrite=False, num_threads=None)[source]#

Calculate normalization factor-adjusted library sizes for each sample in each cell type, via the approach of edgeR’s calcNormFactors().

Uses the same method as calcNormFactors() with the default method=’TMM’. However, results differ from edgeR due to the presence of a floating-point bug in edgeR’s calcNormFactors() implementation. When calculating logR, the log2 ratio of count / library_size for a gene between a particular sample and a “reference” sample, the numerator and denominator of the ratio both involve a division by their sample’s library size. In principle, these divisions by library size are equivalent to multiplying by the same constant across genes, namely the ratio of the two samples’ library sizes. But in practice, even if two genes have the same count ratio between the two samples, they may still have slightly different count / library_size ratios due to floating-point roundoff, leading to these genes erroneously being assigned different logR ranks instead of being treated as tied. Our implementation fixes this bug by changing the order of operations so that the library size ratio is calculated first, then multiplied by the count ratio. Because this bug affects which genes are included in the trimmed mean, its impact can be relatively large, sometimes leading to a >1% error in edgeR’s estimated library size relative to our correct implementation.

Does not support the lib.size and refColumn arguments to calcNormFactors(); these are both assumed to be NULL (the default) and will always be calculated internally. The doWeighting argument is also not supported and is assumed to be TRUE (the default), so asymptotic binomial precision weights will always be used.

Parameters:
  • library_size_column: str

    the name of a floating-point column to add to obs containing each sample’s library size

  • cell_types: str | Iterable[str] | None

    one or more cell types to calculate library sizes for; if None, calculate library sizes for all cell types. Mutually exclusive with excluded_cell_types.

  • excluded_cell_types: str | Iterable[str] | None

    one or more cell types to exclude when calculating library sizes; mutually exclusive with cell_types

  • logratio_trim: int | float

    the amount of trim to use on log-ratios (“M” values); must be greater than 0 and less than 1

  • sum_trim: int | float

    the amount of trim to use on the combined absolute levels (“A” values); must be greater than 0 and less than 1

  • A_cutoff: int | float

    the cutoff on “A” values to use before trimming

  • allow_float: bool

    if False, raise an error if self.X.dtype is floating-point (suggesting the user may not be using the raw counts); if True, disable this sanity check

  • overwrite: bool

    if True, overwrite library_size_column if already present in obs, instead of raising an error.

  • num_threads: int | None

    the number of threads to use when calculating library sizes. Set num_threads=-1 to use all available cores, as determined by os.cpu_count(), or leave unset to use self.num_threads cores. Does not affect the returned Pseudobulk dataset’s num_threads; this will always be the same as the original dataset’s num_threads.

Returns:

A new Pseudobulk dataset where obs[library_size_column] contains the norm factor-corrected library sizes for each cell type: raw library sizes (column sums) times norm factors.

Return type:

Pseudobulk