SingleCell#
A single-cell dataset. Has slots for:
X: a scipy sparse array of counts per cell and gene
obs: a polars DataFrame of cell metadata
var: a polars DataFrame of gene metadata
obsm: a dictionary of NumPy arrays and polars DataFrames of cell metadata
varm: a dictionary of NumPy arrays and polars DataFrames of gene metadata
uns: a dictionary of scalars (strings, numbers or Booleans) or NumPy arrays, or nested dictionaries thereof
num_threads: the default number of threads to use for operations on the dataset that support multithreading (which can be overridden by individual functions)
as well as obs_names and var_names, aliases for obs[:, 0] and var[:, 0].
- class brisc.SingleCell(source=None, /, *, X=None, obs=None, var=None, obsm=None, varm=None, obsp=None, varp=None, uns=None, X_key=None, assay=None, obs_columns=None, var_columns=None, num_threads=-1)[source]#
Load a SingleCell dataset from a file, or create one from an in-memory AnnData object or count matrix + metadata.
SingleCell supports reading and writing files from each of the three major single-cell ecosystems:
scverse/Scanpy AnnData (.h5ad)
Seurat (.rds and .h5Seurat)
Bioconductor SingleCellExperiment (.rds)
as well as raw 10x data files (.h5 or .mtx/.mtx.gz).
By default, when an AnnData object, .h5ad file, .h5Seurat file, or .rds file contains both raw and normalized counts, only the raw counts will be loaded. To load normalized counts instead, use the X argument (for AnnData objects) or X_key argument (for files).
- Parameters:
source: str | Path | 'AnnData' | None
a filename or AnnData object, or None if specifying X, obs, and var instead. Supported file formats are scverse/Scanpy AnnData (.h5ad), Seurat (.rds and .h5Seurat), Bioconductor SingleCellExperiment (.rds), and raw 10x data files (.h5 or .mtx/.mtx.gz). If source is a 10x .mtx/.mtx.gz filename, barcodes.tsv/barcodes.tsv.gz and features.tsv/features.tsv.gz are assumed to be in the same directory (with the ungzipped versions used preferentially), unless custom paths to these files are specified via the obs and/or var arguments.
X: csr_array | csc_array | csr_matrix | csc_matrix | False | None
If source is None, the data as a sparse array or matrix (with rows = cells, columns = genes). If source is an AnnData object, an optional sparse array or matrix to use as X. By default, X will be loaded from source.layers[‘UMIs’] or source.raw.X if present and source.X otherwise. If X is None when source is None, or False when source is a filename, do not store any data in X and set it to None. This helps save memory, but the resulting dataset cannot be saved, converted to another format, or used to run analyses that require X.
obs: DataFrame | None
a polars DataFrame of metadata for each cell (row of X), or None if specifying source instead. Or, if source is a 10x .mtx/.mtx.gz filename, an optional filename for cell-level metadata, which is otherwise assumed to be at barcodes.tsv (or barcodes.tsv.gz) in the same directory as the .mtx/.mtx.gz file.
var: DataFrame | None
a polars DataFrame of metadata for each gene (column of X), or None if specifying source instead. Or, if source is a 10x .mtx/.mtx.gz filename, an optional filename for gene-level metadata, which is otherwise assumed to be at features.tsv (or features.tsv.gz) in the same directory as the .mtx/.mtx.gz file.
obsm: dict[str, ndarray | DataFrame] | False | None
an optional dictionary mapping string names to NumPy arrays and polars DataFrames of metadata for each cell, or False to skip loading obsm when reading .h5ad and .h5Seurat files
varm: dict[str, ndarray | DataFrame] | False | None
an optional dictionary mapping string names to NumPy arrays and polars DataFrames of metadata for each gene, or False to skip loading varm when reading .h5ad files
obsp: dict[str, csr_array | csc_array | csr_matrix | csc_matrix] | False | None
an optional dictionary mapping string names to sparse arrays or matrices containing pairwise cell-cell information like nearest-neighbors graphs, or False to skip loading obsp when reading .h5ad and .h5Seurat files
varp: dict[str, csr_array | csc_array | csr_matrix | csc_matrix] | False | None
an optional dictionary mapping string names to sparse arrays or matrices containing pairwise gene-gene information, or False to skip loading varp when reading .h5ad files
uns: UnsDict | False | None
an optional dictionary mapping string names to unstructured metadata - scalars (strings, numbers or Booleans), NumPy arrays, or nested dictionaries thereof - or False to skip loading uns when reading .h5ad or .h5Seurat files
X_key: str | None
if source is an AnnData .h5ad, Seurat .rds or .h5Seurat filename, or SingleCellExperiment .rds filename, the location within source to use as X:
If source is an .h5ad filename, the name of the key in the .h5ad file to use as X. If None, defaults to ‘layers/UMIs’ (i.e. self.layers[‘UMIs’] in Scanpy) or ‘raw/X’ (i.e. self.raw.X in Scanpy) if present, otherwise ‘X’. Tip: SingleCell.ls(h5ad_file) shows the structure of an .h5ad file without loading it, allowing you to figure out which key to use as X.
If source is a Seurat .rds or .h5Seurat filename, the layer within the active assay (or the assay specified by the assay argument, if not None) to use as X. Set to ‘data’ to load the normalized counts, or ‘scale.data’ to load the normalized and scaled counts, if available. If None, defaults to ‘counts’.
If source is a SingleCellExperiment .rds filename, the element within @assays@data to use as X. Set to ‘logcounts’ to load the normalized counts, if available. If None, defaults to ‘counts’.
assay: str | None
if source is a Seurat .rds or .h5Seurat/.h5seurat filename, the name of the assay within the Seurat object to load data from. Defaults to the Seurat object’s active.assay attribute (usually ‘RNA’).
obs_columns: str | Iterable[str]
if source is an .h5ad or .h5Seurat filename, the columns of obs to load. If not specified, load all columns. Specifying only a subset of columns can speed up reading. Not supported for .h5 files, since they only have a single obs column (‘barcodes’), nor for Seurat and SingleCellExperiment .rds files, since .rds files do not support partial loading.
var_columns: str | Iterable[str]
if source is an .h5ad, .h5, or .h5Seurat filename, the columns of var to load. If not specified, load all columns. Specifying only a subset of columns can speed up reading. Not supported for Seurat and SingleCellExperiment .rds files, since the .rds file format does not support partial loading.
num_threads: int
the number of threads to use when reading .h5ad and .h5 files, and the default number of threads to use for all subsequent operations on this SingleCell dataset. Also sets the number of threads for this SingleCell dataset’s count matrix, if present. By default (num_threads=-1), use all available cores, as determined by
os.cpu_count().
Examples
Load an .h5ad file:
>>> sc = SingleCell('data.h5ad')
Load an .h5ad file where raw counts are stored in a non-default location, adata.raw.counts (use SingleCell.ls(‘data.h5ad’) to inspect its structure before loading):
>>> sc = SingleCell('data.h5ad', X_key='raw/counts')
Load only selected metadata columns from an .h5ad file to reduce loading time and memory usage:
>>> sc = SingleCell('data.h5ad', ... obs_columns=['cell_type', 'batch'], ... var_columns=['gene_symbol'])
Skip loading the count matrix to minimize memory usage (note: dataset cannot be saved, converted, or used for analyses that require X):
>>> sc = SingleCell('large_data.h5ad', X=False)
Load a Seurat .h5Seurat file:
>>> sc = SingleCell('seurat_obj.h5Seurat')
Load a Seurat .rds file:
>>> sc = SingleCell('seurat_obj.rds')
Load a Bioconductor SingleCellExperiment .rds file and use log-normalized counts:
>>> sc = SingleCell('sce_obj.rds', X_key='logcounts')
Load raw 10x Genomics data from an .h5 file:
>>> sc = SingleCell('matrix.h5')
Load raw 10x Genomics data from an .mtx.gz file, with barcodes and features stored in the same directory as barcodes.tsv.gz and features.tsv.gz in the usual way:
>>> sc = SingleCell('matrix.mtx.gz')
Manually create a SingleCell dataset from an in-memory sparse matrix and metadata:
>>> import polars as pl >>> from scipy.sparse import csr_array >>> X = csr_array([[1, 0, 3], [0, 2, 0]]) >>> obs = pl.DataFrame({'cell_id': ['cell1', 'cell2']}) >>> var = pl.DataFrame({'gene_id': ['g1', 'g2', 'g3']}) >>> sc = SingleCell(X=X, obs=obs, var=var)
Notes
To avoid substantial overhead, both ordered and unordered categorical columns of obs and var will be loaded as polars Enums rather than polars Categoricals.
SingleCell does not support dense matrices, which are highly memory-inefficient for single-cell data. Passing a NumPy array as the X argument will give an error: convert it to a sparse array first with csr_array(numpy_array). However, when loading from disk or converting from other formats, dense matrices will be automatically converted to sparse matrices.
Reading and writing .rds files requires the ryp Python-R bridge. To create a SingleCell dataset from an in-memory Seurat or SingleCellExperiment object in the ryp R workspace, use from_seurat() or from_sce().
Reading and writing loom files is not supported because SingleCell only supports sparse count matrices, while loom only supports dense matrices. Using loom files for SingleCell data is not recommended due to this wastefulness. If necessary, loom files can be loaded with SingleCell(scanpy.read_loom(loom_filename)).
scanpy.read_loom()implicitly converts the counts to a sparse matrix by default.
Analysis#
Adds quality-control metrics to obs for each cell: the sum of counts across all genes (num_counts), the number of genes with non-zero expression (num_genes), and the fraction of counts that are mitochondrial (mito_fraction). |
|
Adds a Boolean column to obs indicating which cells passed quality control (QC), or subsets to these cells if subset=True. |
|
Skips QC, but allows the dataset to be used by downstream functions that require QCed data. |
|
Find doublets using cxds (co-expression-based doublet scoring). |
|
Get a DataFrame of sample-level covariates, i.e. the columns of obs that are the same for all cells within each sample. |
|
Pseudobulk a SingleCell dataset with sample ID and cell type columns, yielding a Pseudobulk dataset. |
|
Select highly variable genes using the same approach as Seurat. |
|
Normalize this SingleCell dataset's counts. |
|
Compute principal components (PCs) across cells. |
|
Calculate the num_neighbors nearest neighbors of each cell. |
|
Calculate the shared nearest neighbor graph of this dataset's cells. |
|
Harmonize this SingleCell dataset with other datasets, or harmonize multiple batches of the same dataset, with Harmony2. |
|
Cluster cells into cell types using Leiden clustering. |
|
Transfer cell-type labels from another dataset to this one, using the two datasets' Harmony embeddings from harmonize(). |
|
Calculate a two-dimensional embedding of this SingleCell dataset with UMAP (Uniform Manifold Approximation and Projection), suitable for plotting with plot_embedding(). |
|
Calculate a two-dimensional embedding of this SingleCell dataset suitable for plotting with plot_embedding(). |
|
Calculate a two-dimensional embedding of this SingleCell dataset suitable for plotting with plot_embedding(). |
|
Find "marker genes" that distinguish each cell type from all other cell types. |
|
Plot a heatmap of the count of each combination of two categorical columns, x and y. |
|
Make a dot plot of a set of marker genes of interest across cell types. |
|
Plot a UMAP embedding created with umap(). |
|
Plot a PaCMAP embedding created with pacmap(). |
|
Plot a LocalMAP embedding created with localmap(). |
|
Plot the specified 2D embedding. |
I/O#
Save this SingleCell dataset to a file. |
|
Print the fields in an .h5ad file. |
|
Load just obs from an .h5ad file as a polars DataFrame. |
|
Load just var from an .h5ad file as a polars DataFrame. |
|
Load just obsm from an .h5ad file as a dictionary of Numpy arrays or DataFrames. |
|
Load just varm from an .h5ad file as a dictionary of Numpy arrays or DataFrames. |
|
Load just obsp from an .h5ad file as a dictionary of sparse arrays. |
|
Load just varp from an .h5ad file as a dictionary of sparse arrays. |
|
Load just uns from an .h5ad file as a dictionary. |
|
Converts this SingleCell dataset to an AnnData object, the representation used by Scanpy. |
|
Create a SingleCell dataset from a Seurat object that has already been loaded into memory via the ryp Python-R bridge. |
|
Convert this SingleCell dataset to a Seurat object in the R workspace of the ryp Python-R bridge. |
|
Create a SingleCell dataset from a SingleCellExperiment object that has already been loaded into memory via the ryp Python-R bridge. |
|
Convert this SingleCell dataset to a SingleCellExperiment object in the R workspace of the ryp Python-R bridge. |
Properties#
The count matrix, as a sparse array. |
|
A Polars DataFrame of metadata for each cell. |
|
A Polars DataFrame of metadata for each gene. |
|
A dictionary of 2D NumPy arrays, where the length of each array's first dimension is the number of cells. |
|
A dictionary of 2D NumPy arrays, where the length of each array's first dimension is the number of genes. |
|
A dictionary of 2D sparse arrays, where the length and width of each array is the number of cells. |
|
A dictionary of 2D sparse arrays, where the length and width of each array is the number of genes. |
|
A dictionary of miscellaneous metadata. |
|
A shortcut to access the first column of obs. |
|
A shortcut to access the first column of var. |
|
The default number of threads used for this SingleCell dataset's operations. |
|
a length-2 tuple where the first element is the number of cells, and the second is the number of genes. |
Data access#
Get the row of X corresponding to a single cell, based on the cell's name in obs_names. |
|
Get the column of X corresponding to a single gene, based on the gene's name in var_names. |
|
Print a row of obs (the first row, by default) with each column on its own line. |
|
Print a row of var (the first row, by default) with each column on its own line. |
Manipulation#
Sets a column as the new first column of obs, i.e. the obs_names. |
|
Sets a column as the new first column of var, i.e. the var_names. |
|
Return a new SingleCell dataset with a different default number of threads. |
|
Make obs_names unique. |
|
Make var_names unique. |
|
Equivalent to |
|
Equivalent to |
|
Equivalent to |
|
Equivalent to |
|
Subsets obsm to the specified key(s). |
|
Subsets varm to the specified key(s). |
|
Subsets obsp to the specified key(s). |
|
Subsets varp to the specified key(s). |
|
Subsets uns to the specified key(s). |
|
Equivalent to |
|
Equivalent to |
|
Adds one or more keys to obsm, overwriting existing keys with the same names if present. |
|
Adds one or more keys to varm, overwriting existing keys with the same names if present. |
|
Adds one or more keys to obsp, overwriting existing keys with the same names if present. |
|
Adds one or more keys to varp, overwriting existing keys with the same names if present. |
|
Adds one or more keys to uns, overwriting existing keys with the same names if present. |
|
Create a new SingleCell dataset with X removed, to reduce memory use. |
|
Create a new SingleCell dataset with columns and more_columns removed from obs. |
|
Create a new SingleCell dataset with columns and more_columns removed from var. |
|
Create a new SingleCell dataset with keys and more_keys removed from obsm. |
|
Create a new SingleCell dataset with keys and more_keys removed from varm. |
|
Create a new SingleCell dataset with keys and more_keys removed from obsp. |
|
Create a new SingleCell dataset with keys and more_keys removed from varp. |
|
Create a new SingleCell dataset with keys and more_keys removed from uns. |
|
Create a new SingleCell dataset with column(s) of obs renamed. |
|
Create a new SingleCell dataset with column(s) of var renamed. |
|
Create a new SingleCell dataset with key(s) of obsm renamed. |
|
Create a new SingleCell dataset with key(s) of varm renamed. |
|
Create a new SingleCell dataset with key(s) of obsp renamed. |
|
Create a new SingleCell dataset with key(s) of varp renamed. |
|
Create a new SingleCell dataset with key(s) of uns renamed. |
|
Cast X to the specified data type. |
|
Cast column(s) of obs to the specified data type(s). |
|
Cast column(s) of var to the specified data type(s). |
|
Left-join obs with another DataFrame. |
|
Left-join var with another DataFrame. |
|
Subsample a specific number or fraction of cells. |
|
Subsample a specific number or fraction of genes. |
|
Make a copy of this SingleCell dataset, converting X to a csr_array. |
|
Make a copy of this SingleCell dataset, converting X to a csc_array. |
|
Make a copy of this SingleCell dataset. |
|
Concatenate one or more other SingleCell datasets with this one, cell-wise. |
|
Concatenate one or more other SingleCell datasets with this one, gene-wise. |
|
The opposite of concat_obs(): splits a SingleCell dataset into a dictionary of SingleCell datasets, one per unique value of a column of obs. |
|
The opposite of concat_var(): splits a SingleCell dataset into a dictionary of SingleCell datasets, one per unique value of a column of var. |
|
Apply a function to a SingleCell dataset. |
|
Apply a function to a SingleCell dataset's X. |
|
Apply a function to a SingleCell dataset's obs. |
|
Apply a function to a SingleCell dataset's var. |
|
Apply a function to a SingleCell dataset's obsm. |
|
Apply a function to a specific key in a SingleCell dataset's obsm. |
|
Apply a function to a SingleCell dataset's varm. |
|
Apply a function to a specific key in a SingleCell dataset's varm. |
|
Apply a function to a SingleCell dataset's obsp. |
|
Apply a function to a specific key in a SingleCell dataset's obsp. |
|
Apply a function to a SingleCell dataset's varp. |
|
Apply a function to a specific key in a SingleCell dataset's varp. |
|
Apply a function to a SingleCell dataset's uns. |
|
Apply a function to a specific key in a SingleCell dataset's uns. |