SingleCell#

A single-cell dataset. Has slots for:

  • X: a scipy sparse array of counts per cell and gene

  • obs: a polars DataFrame of cell metadata

  • var: a polars DataFrame of gene metadata

  • obsm: a dictionary of NumPy arrays and polars DataFrames of cell metadata

  • varm: a dictionary of NumPy arrays and polars DataFrames of gene metadata

  • uns: a dictionary of scalars (strings, numbers or Booleans) or NumPy arrays, or nested dictionaries thereof

  • num_threads: the default number of threads to use for operations on the dataset that support multithreading (which can be overridden by individual functions)

as well as obs_names and var_names, aliases for obs[:, 0] and var[:, 0].

class brisc.SingleCell(source=None, /, *, X=None, obs=None, var=None, obsm=None, varm=None, obsp=None, varp=None, uns=None, X_key=None, assay=None, obs_columns=None, var_columns=None, num_threads=-1)[source]#

Load a SingleCell dataset from a file, or create one from an in-memory AnnData object or count matrix + metadata.

SingleCell supports reading and writing files from each of the three major single-cell ecosystems:

  • scverse/Scanpy AnnData (.h5ad)

  • Seurat (.rds and .h5Seurat)

  • Bioconductor SingleCellExperiment (.rds)

as well as raw 10x data files (.h5 or .mtx/.mtx.gz).

By default, when an AnnData object, .h5ad file, .h5Seurat file, or .rds file contains both raw and normalized counts, only the raw counts will be loaded. To load normalized counts instead, use the X argument (for AnnData objects) or X_key argument (for files).

Parameters:
  • source: str | Path | 'AnnData' | None

    a filename or AnnData object, or None if specifying X, obs, and var instead. Supported file formats are scverse/Scanpy AnnData (.h5ad), Seurat (.rds and .h5Seurat), Bioconductor SingleCellExperiment (.rds), and raw 10x data files (.h5 or .mtx/.mtx.gz). If source is a 10x .mtx/.mtx.gz filename, barcodes.tsv/barcodes.tsv.gz and features.tsv/features.tsv.gz are assumed to be in the same directory (with the ungzipped versions used preferentially), unless custom paths to these files are specified via the obs and/or var arguments.

  • X: csr_array | csc_array | csr_matrix | csc_matrix | False | None

    If source is None, the data as a sparse array or matrix (with rows = cells, columns = genes). If source is an AnnData object, an optional sparse array or matrix to use as X. By default, X will be loaded from source.layers[‘UMIs’] or source.raw.X if present and source.X otherwise. If X is None when source is None, or False when source is a filename, do not store any data in X and set it to None. This helps save memory, but the resulting dataset cannot be saved, converted to another format, or used to run analyses that require X.

  • obs: DataFrame | None

    a polars DataFrame of metadata for each cell (row of X), or None if specifying source instead. Or, if source is a 10x .mtx/.mtx.gz filename, an optional filename for cell-level metadata, which is otherwise assumed to be at barcodes.tsv (or barcodes.tsv.gz) in the same directory as the .mtx/.mtx.gz file.

  • var: DataFrame | None

    a polars DataFrame of metadata for each gene (column of X), or None if specifying source instead. Or, if source is a 10x .mtx/.mtx.gz filename, an optional filename for gene-level metadata, which is otherwise assumed to be at features.tsv (or features.tsv.gz) in the same directory as the .mtx/.mtx.gz file.

  • obsm: dict[str, ndarray | DataFrame] | False | None

    an optional dictionary mapping string names to NumPy arrays and polars DataFrames of metadata for each cell, or False to skip loading obsm when reading .h5ad and .h5Seurat files

  • varm: dict[str, ndarray | DataFrame] | False | None

    an optional dictionary mapping string names to NumPy arrays and polars DataFrames of metadata for each gene, or False to skip loading varm when reading .h5ad files

  • obsp: dict[str, csr_array | csc_array | csr_matrix | csc_matrix] | False | None

    an optional dictionary mapping string names to sparse arrays or matrices containing pairwise cell-cell information like nearest-neighbors graphs, or False to skip loading obsp when reading .h5ad and .h5Seurat files

  • varp: dict[str, csr_array | csc_array | csr_matrix | csc_matrix] | False | None

    an optional dictionary mapping string names to sparse arrays or matrices containing pairwise gene-gene information, or False to skip loading varp when reading .h5ad files

  • uns: UnsDict | False | None

    an optional dictionary mapping string names to unstructured metadata - scalars (strings, numbers or Booleans), NumPy arrays, or nested dictionaries thereof - or False to skip loading uns when reading .h5ad or .h5Seurat files

  • X_key: str | None

    if source is an AnnData .h5ad, Seurat .rds or .h5Seurat filename, or SingleCellExperiment .rds filename, the location within source to use as X:

    • If source is an .h5ad filename, the name of the key in the .h5ad file to use as X. If None, defaults to ‘layers/UMIs’ (i.e. self.layers[‘UMIs’] in Scanpy) or ‘raw/X’ (i.e. self.raw.X in Scanpy) if present, otherwise ‘X’. Tip: SingleCell.ls(h5ad_file) shows the structure of an .h5ad file without loading it, allowing you to figure out which key to use as X.

    • If source is a Seurat .rds or .h5Seurat filename, the layer within the active assay (or the assay specified by the assay argument, if not None) to use as X. Set to ‘data’ to load the normalized counts, or ‘scale.data’ to load the normalized and scaled counts, if available. If None, defaults to ‘counts’.

    • If source is a SingleCellExperiment .rds filename, the element within @assays@data to use as X. Set to ‘logcounts’ to load the normalized counts, if available. If None, defaults to ‘counts’.

  • assay: str | None

    if source is a Seurat .rds or .h5Seurat/.h5seurat filename, the name of the assay within the Seurat object to load data from. Defaults to the Seurat object’s active.assay attribute (usually ‘RNA’).

  • obs_columns: str | Iterable[str]

    if source is an .h5ad or .h5Seurat filename, the columns of obs to load. If not specified, load all columns. Specifying only a subset of columns can speed up reading. Not supported for .h5 files, since they only have a single obs column (‘barcodes’), nor for Seurat and SingleCellExperiment .rds files, since .rds files do not support partial loading.

  • var_columns: str | Iterable[str]

    if source is an .h5ad, .h5, or .h5Seurat filename, the columns of var to load. If not specified, load all columns. Specifying only a subset of columns can speed up reading. Not supported for Seurat and SingleCellExperiment .rds files, since the .rds file format does not support partial loading.

  • num_threads: int

    the number of threads to use when reading .h5ad and .h5 files, and the default number of threads to use for all subsequent operations on this SingleCell dataset. Also sets the number of threads for this SingleCell dataset’s count matrix, if present. By default (num_threads=-1), use all available cores, as determined by os.cpu_count().

Examples

Load an .h5ad file:

>>> sc = SingleCell('data.h5ad')

Load an .h5ad file where raw counts are stored in a non-default location, adata.raw.counts (use SingleCell.ls(‘data.h5ad’) to inspect its structure before loading):

>>> sc = SingleCell('data.h5ad', X_key='raw/counts')

Load only selected metadata columns from an .h5ad file to reduce loading time and memory usage:

>>> sc = SingleCell('data.h5ad',
...                 obs_columns=['cell_type', 'batch'],
...                 var_columns=['gene_symbol'])

Skip loading the count matrix to minimize memory usage (note: dataset cannot be saved, converted, or used for analyses that require X):

>>> sc = SingleCell('large_data.h5ad', X=False)

Load a Seurat .h5Seurat file:

>>> sc = SingleCell('seurat_obj.h5Seurat')

Load a Seurat .rds file:

>>> sc = SingleCell('seurat_obj.rds')

Load a Bioconductor SingleCellExperiment .rds file and use log-normalized counts:

>>> sc = SingleCell('sce_obj.rds', X_key='logcounts')

Load raw 10x Genomics data from an .h5 file:

>>> sc = SingleCell('matrix.h5')

Load raw 10x Genomics data from an .mtx.gz file, with barcodes and features stored in the same directory as barcodes.tsv.gz and features.tsv.gz in the usual way:

>>> sc = SingleCell('matrix.mtx.gz')

Manually create a SingleCell dataset from an in-memory sparse matrix and metadata:

>>> import polars as pl
>>> from scipy.sparse import csr_array
>>> X = csr_array([[1, 0, 3], [0, 2, 0]])
>>> obs = pl.DataFrame({'cell_id': ['cell1', 'cell2']})
>>> var = pl.DataFrame({'gene_id': ['g1', 'g2', 'g3']})
>>> sc = SingleCell(X=X, obs=obs, var=var)
Notes

To avoid substantial overhead, both ordered and unordered categorical columns of obs and var will be loaded as polars Enums rather than polars Categoricals.

SingleCell does not support dense matrices, which are highly memory-inefficient for single-cell data. Passing a NumPy array as the X argument will give an error: convert it to a sparse array first with csr_array(numpy_array). However, when loading from disk or converting from other formats, dense matrices will be automatically converted to sparse matrices.

Reading and writing .rds files requires the ryp Python-R bridge. To create a SingleCell dataset from an in-memory Seurat or SingleCellExperiment object in the ryp R workspace, use from_seurat() or from_sce().

Reading and writing loom files is not supported because SingleCell only supports sparse count matrices, while loom only supports dense matrices. Using loom files for SingleCell data is not recommended due to this wastefulness. If necessary, loom files can be loaded with SingleCell(scanpy.read_loom(loom_filename)). scanpy.read_loom() implicitly converts the counts to a sparse matrix by default.

Analysis#

qc_metrics

Adds quality-control metrics to obs for each cell: the sum of counts across all genes (num_counts), the number of genes with non-zero expression (num_genes), and the fraction of counts that are mitochondrial (mito_fraction).

qc

Adds a Boolean column to obs indicating which cells passed quality control (QC), or subsets to these cells if subset=True.

skip_qc

Skips QC, but allows the dataset to be used by downstream functions that require QCed data.

find_doublets

Find doublets using cxds (co-expression-based doublet scoring).

get_sample_covariates

Get a DataFrame of sample-level covariates, i.e. the columns of obs that are the same for all cells within each sample.

pseudobulk

Pseudobulk a SingleCell dataset with sample ID and cell type columns, yielding a Pseudobulk dataset.

hvg

Select highly variable genes using the same approach as Seurat.

normalize

Normalize this SingleCell dataset's counts.

pca

Compute principal components (PCs) across cells.

neighbors

Calculate the num_neighbors nearest neighbors of each cell.

shared_neighbors

Calculate the shared nearest neighbor graph of this dataset's cells.

harmonize

Harmonize this SingleCell dataset with other datasets, or harmonize multiple batches of the same dataset, with Harmony2.

cluster

Cluster cells into cell types using Leiden clustering.

label_transfer_from

Transfer cell-type labels from another dataset to this one, using the two datasets' Harmony embeddings from harmonize().

umap

Calculate a two-dimensional embedding of this SingleCell dataset with UMAP (Uniform Manifold Approximation and Projection), suitable for plotting with plot_embedding().

pacmap

Calculate a two-dimensional embedding of this SingleCell dataset suitable for plotting with plot_embedding().

localmap

Calculate a two-dimensional embedding of this SingleCell dataset suitable for plotting with plot_embedding().

find_markers

Find "marker genes" that distinguish each cell type from all other cell types.

plot_heatmap

Plot a heatmap of the count of each combination of two categorical columns, x and y.

plot_markers

Make a dot plot of a set of marker genes of interest across cell types.

plot_umap

Plot a UMAP embedding created with umap().

plot_pacmap

Plot a PaCMAP embedding created with pacmap().

plot_localmap

Plot a LocalMAP embedding created with localmap().

plot_embedding

Plot the specified 2D embedding.

I/O#

save

Save this SingleCell dataset to a file.

ls

Print the fields in an .h5ad file.

read_obs

Load just obs from an .h5ad file as a polars DataFrame.

read_var

Load just var from an .h5ad file as a polars DataFrame.

read_obsm

Load just obsm from an .h5ad file as a dictionary of Numpy arrays or DataFrames.

read_varm

Load just varm from an .h5ad file as a dictionary of Numpy arrays or DataFrames.

read_obsp

Load just obsp from an .h5ad file as a dictionary of sparse arrays.

read_varp

Load just varp from an .h5ad file as a dictionary of sparse arrays.

read_uns

Load just uns from an .h5ad file as a dictionary.

to_scanpy

Converts this SingleCell dataset to an AnnData object, the representation used by Scanpy.

from_seurat

Create a SingleCell dataset from a Seurat object that has already been loaded into memory via the ryp Python-R bridge.

to_seurat

Convert this SingleCell dataset to a Seurat object in the R workspace of the ryp Python-R bridge.

from_sce

Create a SingleCell dataset from a SingleCellExperiment object that has already been loaded into memory via the ryp Python-R bridge.

to_sce

Convert this SingleCell dataset to a SingleCellExperiment object in the R workspace of the ryp Python-R bridge.

Properties#

X

The count matrix, as a sparse array.

obs

A Polars DataFrame of metadata for each cell.

var

A Polars DataFrame of metadata for each gene.

obsm

A dictionary of 2D NumPy arrays, where the length of each array's first dimension is the number of cells.

varm

A dictionary of 2D NumPy arrays, where the length of each array's first dimension is the number of genes.

obsp

A dictionary of 2D sparse arrays, where the length and width of each array is the number of cells.

varp

A dictionary of 2D sparse arrays, where the length and width of each array is the number of genes.

uns

A dictionary of miscellaneous metadata.

obs_names

A shortcut to access the first column of obs.

var_names

A shortcut to access the first column of var.

num_threads

The default number of threads used for this SingleCell dataset's operations.

shape

a length-2 tuple where the first element is the number of cells, and the second is the number of genes.

Data access#

cell

Get the row of X corresponding to a single cell, based on the cell's name in obs_names.

gene

Get the column of X corresponding to a single gene, based on the gene's name in var_names.

peek_obs

Print a row of obs (the first row, by default) with each column on its own line.

peek_var

Print a row of var (the first row, by default) with each column on its own line.

Manipulation#

set_obs_names

Sets a column as the new first column of obs, i.e. the obs_names.

set_var_names

Sets a column as the new first column of var, i.e. the var_names.

set_num_threads

Return a new SingleCell dataset with a different default number of threads.

make_obs_names_unique

Make obs_names unique.

make_var_names_unique

Make var_names unique.

filter_obs

Equivalent to df.filter() from polars, but applied to both obs/obsm and X.

filter_var

Equivalent to df.filter() from polars, but applied to both var/varm and X.

select_obs

Equivalent to df.select() from polars, but applied to obs.

select_var

Equivalent to df.select() from polars, but applied to var.

select_obsm

Subsets obsm to the specified key(s).

select_varm

Subsets varm to the specified key(s).

select_obsp

Subsets obsp to the specified key(s).

select_varp

Subsets varp to the specified key(s).

select_uns

Subsets uns to the specified key(s).

with_columns_obs

Equivalent to df.with_columns() from polars, but applied to obs.

with_columns_var

Equivalent to df.with_columns() from polars, but applied to var.

with_obsm

Adds one or more keys to obsm, overwriting existing keys with the same names if present.

with_varm

Adds one or more keys to varm, overwriting existing keys with the same names if present.

with_obsp

Adds one or more keys to obsp, overwriting existing keys with the same names if present.

with_varp

Adds one or more keys to varp, overwriting existing keys with the same names if present.

with_uns

Adds one or more keys to uns, overwriting existing keys with the same names if present.

drop_X

Create a new SingleCell dataset with X removed, to reduce memory use.

drop_obs

Create a new SingleCell dataset with columns and more_columns removed from obs.

drop_var

Create a new SingleCell dataset with columns and more_columns removed from var.

drop_obsm

Create a new SingleCell dataset with keys and more_keys removed from obsm.

drop_varm

Create a new SingleCell dataset with keys and more_keys removed from varm.

drop_obsp

Create a new SingleCell dataset with keys and more_keys removed from obsp.

drop_varp

Create a new SingleCell dataset with keys and more_keys removed from varp.

drop_uns

Create a new SingleCell dataset with keys and more_keys removed from uns.

rename_obs

Create a new SingleCell dataset with column(s) of obs renamed.

rename_var

Create a new SingleCell dataset with column(s) of var renamed.

rename_obsm

Create a new SingleCell dataset with key(s) of obsm renamed.

rename_varm

Create a new SingleCell dataset with key(s) of varm renamed.

rename_obsp

Create a new SingleCell dataset with key(s) of obsp renamed.

rename_varp

Create a new SingleCell dataset with key(s) of varp renamed.

rename_uns

Create a new SingleCell dataset with key(s) of uns renamed.

cast_X

Cast X to the specified data type.

cast_obs

Cast column(s) of obs to the specified data type(s).

cast_var

Cast column(s) of var to the specified data type(s).

join_obs

Left-join obs with another DataFrame.

join_var

Left-join var with another DataFrame.

subsample_obs

Subsample a specific number or fraction of cells.

subsample_var

Subsample a specific number or fraction of genes.

tocsr

Make a copy of this SingleCell dataset, converting X to a csr_array.

tocsc

Make a copy of this SingleCell dataset, converting X to a csc_array.

copy

Make a copy of this SingleCell dataset.

concat_obs

Concatenate one or more other SingleCell datasets with this one, cell-wise.

concat_var

Concatenate one or more other SingleCell datasets with this one, gene-wise.

split_by_obs

The opposite of concat_obs(): splits a SingleCell dataset into a dictionary of SingleCell datasets, one per unique value of a column of obs.

split_by_var

The opposite of concat_var(): splits a SingleCell dataset into a dictionary of SingleCell datasets, one per unique value of a column of var.

pipe

Apply a function to a SingleCell dataset.

pipe_X

Apply a function to a SingleCell dataset's X.

pipe_obs

Apply a function to a SingleCell dataset's obs.

pipe_var

Apply a function to a SingleCell dataset's var.

pipe_obsm

Apply a function to a SingleCell dataset's obsm.

pipe_obsm_key

Apply a function to a specific key in a SingleCell dataset's obsm.

pipe_varm

Apply a function to a SingleCell dataset's varm.

pipe_varm_key

Apply a function to a specific key in a SingleCell dataset's varm.

pipe_obsp

Apply a function to a SingleCell dataset's obsp.

pipe_obsp_key

Apply a function to a specific key in a SingleCell dataset's obsp.

pipe_varp

Apply a function to a SingleCell dataset's varp.

pipe_varp_key

Apply a function to a specific key in a SingleCell dataset's varp.

pipe_uns

Apply a function to a SingleCell dataset's uns.

pipe_uns_key

Apply a function to a specific key in a SingleCell dataset's uns.