hvg#

SingleCell.hvg(*others, QC_column='passed_QC', batch_column=None, num_genes=2000, min_cells=3, exclude=None, flavor='seurat_v3', span=0.3, hvg_column='highly_variable', rank_column='highly_variable_rank', overwrite=False, verbose=False, num_threads=None)[source]#

Select highly variable genes using the same approach as Seurat.

Operates on raw counts, so must be run before normalize() (but after qc()). When run with multiple datasets, only considers genes present in every dataset.

By default, uses the same approach as Seurat’s FindVariableFeatures() function, and Scanpy’s highly_variable_genes() function with the flavor argument set to the non-default value ‘seurat_v3’.

The general idea is that since genes with higher mean expression tend to have higher variance in expression (because they have more non-zero values), we want to select genes that have a high variance relative to their mean expression. Otherwise, we’d only be picking highly expressed genes! To correct for the mean-variance relationship, fit a LOESS curve fit to the mean-variance trend.

Parameters:

others: SingleCell
optional SingleCell datasets to jointly compute highly variable genes across, alongside this one. Each dataset will be treated as a separate batch. If batch_column is not None, each dataset AND each distinct value of batch_column within each dataset will be treated as a separate batch. Variances will be computed per batch and then aggregated (see flavor) across batches.
QC_column: SingleCellColumn | None | Sequence[SingleCellColumn | None]
an optional Boolean column of obs indicating which cells passed QC. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array. Set to None to include all cells. Cells failing QC will be ignored. When others is specified, QC_column can be a length-1 + len(others) sequence of columns, expressions, Series, functions, or None for each dataset (for self, followed by each dataset in others).
batch_column: SingleCellColumn | None | Sequence[SingleCellColumn | None]
an optional String, Enum, Categorical, or integer column of obs indicating which batch each cell is from. Can be a column name, a polars expression, a polars Series, a 1D NumPy array, or a function that takes in this SingleCell dataset and returns a polars Series or 1D NumPy array. Each batch will be treated as if it were a distinct dataset; this is exactly equivalent to splitting the dataset with split_by(batch_column) and then passing each of the resulting datasets to hvg(), except that the min_cells filter will always be calculated per-dataset rather than per-batch. Variances will be computed per batch and then aggregated (see flavor) across batches. Set to None to treat each dataset as having a single batch. When others is specified, batch_column can be a length-1 + len(others) sequence of columns, expressions, Series, functions, or None for each dataset (for self, followed by each dataset in others).
num_genes: int
the number of highly variable genes to select. The default of 2000 matches Seurat and Scanpy’s recommended value. Fewer than num_genes genes will be selected if not enough genes have non-zero count in ≥ min_cells cells (or when min_cells is None, if not enough genes are present).
min_cells: int
if not None, filter to genes detected (with non-zero count) in ≥ this many cells in every dataset, before calculating highly variable genes. The default value of 3 matches Seurat and Scanpy’s recommended value. Note that genes with zero variance in any dataset will always be filtered out, even if min_cells is 0.
exclude: str | Iterable[str] | None
one or more optional case-insensitive regular expressions matching genes to exclude from the highly variable gene calculation. For instance, to exclude mitochondrial genes (starting with ‘MT-’) and ribosomal genes (starting with ‘RPL-’, ‘RPS’, ‘MRPL’, or ‘MRPS’), specify exclude=(‘^MT-’, ‘^RPL’, ‘^RPS’, ‘^MRPL’, ‘^MRPS’).
flavor: Literal['seurat_v3', 'seurat_v3_paper']
the highly variable gene algorithm to use. Must be one of ‘seurat_v3’ and ‘seurat_v3_paper’, both of which match the algorithms with the same name in Scanpy. Both algorithms select genes based on two criteria: 1) which genes are ranked as most variable (taking the median of the ranks across batches where the gene is among the top num_genes highly variable genes) and 2) the number of batches in which a gene is ranked in among the top num_genes in variability. ‘seurat_v3’ ranks genes by 1) and uses 2) to tiebreak, whereas ‘seurat_v3_paper’ ranks genes by 2) and uses 1) to tiebreak. When there is only one batch, both algorithms are the same and only rank based on 1).
span: int | float
the span of the LOESS fit; higher values will lead to more smoothing
hvg_column: str
the name of a Boolean column to be added to (each dataset’s) var indicating the highly variable genes
rank_column: str
the name of an integer column to be added to (each dataset’s) var with the rank of each highly variable gene’s variance (1 = highest variance, 2 = next-highest, etc.); will be null for non-highly variable genes. In the very unlikely event of ties, the gene that appears first in var will get the lowest rank.
overwrite: bool
if True, overwrite hvg_column and/or rank_column if already present in var, instead of raising an error
verbose: bool
whether to print the number of genes present in every dataset, when jointly computing highly variable genes across multiple datasets
num_threads: int | None
the number of threads to use when finding highly variable genes. Set num_threads=-1 to use all available cores, as determined by os.cpu_count(), or leave unset to use self.num_threads cores. Does not affect the num_threads of the returned SingleCell dataset(s); this will always be the same as the num_threads of the original dataset(s).

Returns:

A new SingleCell dataset where var contains an additional Boolean column, hvg_column (default: ‘highly_variable’), indicating the num_genes most highly variable genes, and rank_column (default: ‘highly_variable_rank’) indicating the (one-based) rank of each highly variable gene’s variance, with null values for non-highly variable genes. Or, if additional SingleCell dataset(s) are specified via the others argument, a length-1 + len(others) tuple of SingleCell datasets with these two columns added: self, followed by each dataset in others.

Return type:

SingleCell | tuple[SingleCell, …]

Note

This function may give an incorrect output if the count matrix contains explicit zeros (i.e. if (sc.X.data == 0).any()): this is not checked for, due to speed considerations. In the unlikely event that your dataset contains explicit zeros, remove them by running sc.X.eliminate_zeros() (an in-place operation) first.

Note

This function may give an incorrect output if the count matrix contains negative values: this is not checked for, due to speed considerations.

Note

This function may not give identical results to Seurat and Scanpy. It avoids floating-point summation, which is more numerically stable than Scanpy and Seurat’s calculations. If multiple genes are tied as the num_genes-th most highly variable gene in a batch or dataset, this function includes all of them, whereas Seurat and Scanpy arbitrarily pick one (or a subset) of them. Also, this function uses the ordering from a stable sort to break ties when selecting the final list of highly variable genes, instead of the unstable sort used by Seurat and Scanpy.