Integration and Label Transfer#

This tutorial covers reference mapping: integrating an annotated reference dataset with a query, then transferring the reference’s cell-type labels onto the query. It uses the same Parse Biosciences ~10 million cell PBMC dataset as the other tutorials, treating the PBS-control cells as the annotated reference and the cytokine-treated cells as the query. Because these query cells happen to carry their own labels, we can check the transferred labels against them at the end.

Loading and quality control#

Load the data and run QC as in Basic Workflow, keeping all cells (subset=False) and the treatment column that defines the reference and query.

from brisc import SingleCell
import polars as pl

sc = SingleCell(
    'Parse_10M_PBMC_cytokines.h5ad', num_threads=-1,
    obs_columns=['sample', 'donor', 'cell_type', 'treatment', 'cytokine'])\
    .qc(allow_float=True)

Reference and query#

split_by_obs() splits a dataset into one SingleCell per value of an obs column. Splitting on treatment separates the PBS controls (the reference) from the cytokine-treated cells (the query), and drops cells that failed QC.

sc = sc.split_by_obs('treatment')
sc_ref, sc_query = sc['PBS'], sc['cytokine']
print(sc_ref)
print(sc_query)
SingleCell dataset in CSR format with 603,928 cells (obs), 40,352 genes (var), and 1,164,409,519 non-zero float32 entries (X)
    obs: _index, sample, donor, cell_type, treatment, cytokine, passed_QC
    var: _index, n_cells
SingleCell dataset in CSR format with 8,839,235 cells (obs), 40,352 genes (var), and 17,313,941,540 non-zero float32 entries (X)
    obs: _index, sample, donor, cell_type, treatment, cytokine, passed_QC
    var: _index, n_cells

The reference is far smaller than the query — typical of reference mapping, where a small, carefully annotated dataset labels a much larger one.

Integration#

Integrating two datasets puts their cells in one batch-corrected coordinate space, so cells of the same type align regardless of which dataset they came from. hvg(), pca(), and harmonize() each accept additional datasets and process them jointly, returning the same datasets as a tuple. So feature selection, PCA, and integration all run on the reference and query together (normalization runs on each dataset on its own).

sc_ref, sc_query = sc_ref.hvg(sc_query, batch_column='donor')
sc_ref = sc_ref.normalize()
sc_query = sc_query.normalize()
sc_ref, sc_query = sc_ref.pca(sc_query)
sc_ref, sc_query = sc_ref.harmonize(sc_query)

harmonize() corrects the shared PCs with Harmony, treating each dataset as a batch, and stores the result in obsm['harmony'].

Label transfer#

label_transfer_from() transfers cell-type labels from the reference to the query. For each query cell, it finds the 20 nearest reference cells in the shared Harmony embedding (obsm['harmony']) and assigns the most common reference label; the fraction of those neighbors that agree becomes a confidence score.

sc_query = sc_query.label_transfer_from(
    sc_ref, 'cell_type', cell_type_column='cell_type_transferred')

This adds cell_type_transferred and cell_type_transferred_confidence to obs:

print(sc_query.obs.select('cell_type', 'cell_type_transferred',
                          'cell_type_transferred_confidence').head(10))
shape: (10, 3)
┌────────────┬───────────────────────┬──────────────────────────────────┐
│ cell_type  ┆ cell_type_transferred ┆ cell_type_transferred_confidence │
│ ---        ┆ ---                   ┆ ---                              │
│ enum       ┆ enum                  ┆ f32                              │
╞════════════╪═══════════════════════╪══════════════════════════════════╡
│ CD8 Naive  ┆ CD8 Naive             ┆ 1.0                              │
│ B Naive    ┆ B Naive               ┆ 0.85                             │
│ CD14 Mono  ┆ CD14 Mono             ┆ 1.0                              │
│ CD14 Mono  ┆ CD14 Mono             ┆ 0.9                              │
│ CD4 Naive  ┆ CD4 Naive             ┆ 0.95                             │
│ CD8 Naive  ┆ CD4 Memory            ┆ 0.45                             │
│ NK         ┆ NK                    ┆ 1.0                              │
│ CD4 Memory ┆ CD4 Memory            ┆ 0.95                             │
│ NK         ┆ NK                    ┆ 1.0                              │
│ cDC        ┆ cDC                   ┆ 1.0                              │
└────────────┴───────────────────────┴──────────────────────────────────┘

Most calls are confident; the low-confidence row (0.45) is a CD8 Naive cell labeled CD4 Memory, and the confidence score is what flags these uncertain transfers so you can filter on them. Passing next_best=True also records each cell’s runner-up label and its confidence (in next_best_cell_type_transferred and next_best_cell_type_transferred_confidence), which helps when a cell falls between two similar types.

Validation#

Because the query carries ground-truth labels, we can measure how well the transfer recovered them.

match = (pl.col('cell_type').cast(pl.String) ==
         pl.col('cell_type_transferred').cast(pl.String))
print(f'overall accuracy: {sc_query.obs.select(match.mean()).item():.1%}')
print(sc_query.obs.group_by('cell_type').agg(
    n_cells=pl.len(),
    accuracy=match.mean(),
    mean_confidence=pl.col('cell_type_transferred_confidence').mean())
    .sort('accuracy', descending=True))
overall accuracy: 88.1%
shape: (18, 4)
┌───────────────────────┬─────────┬──────────┬─────────────────┐
│ cell_type             ┆ n_cells ┆ accuracy ┆ mean_confidence │
│ ---                   ┆ ---     ┆ ---      ┆ ---             │
│ enum                  ┆ u32     ┆ f64      ┆ f32             │
╞═══════════════════════╪═════════╪══════════╪═════════════════╡
│ CD14 Mono             ┆ 1443470 ┆ 0.993401 ┆ 0.988664        │
│ Plasmablast           ┆ 3908    ┆ 0.973132 ┆ 0.969703        │
│ cDC                   ┆ 102056  ┆ 0.970987 ┆ 0.974583        │
│ NK                    ┆ 475494  ┆ 0.970096 ┆ 0.956886        │
│ pDC                   ┆ 17561   ┆ 0.966346 ┆ 0.97694         │
│ B Naive               ┆ 547712  ┆ 0.965303 ┆ 0.959042        │
│ HSPC                  ┆ 15324   ┆ 0.938593 ┆ 0.967753        │
│ CD16 Mono             ┆ 212214  ┆ 0.935381 ┆ 0.945373        │
│ B Intermediate/Memory ┆ 279140  ┆ 0.907025 ┆ 0.920662        │
│ CD4 Naive             ┆ 1603760 ┆ 0.889186 ┆ 0.849431        │
│ ILC                   ┆ 7780    ┆ 0.879692 ┆ 0.942018        │
│ CD8 Naive             ┆ 573696  ┆ 0.879201 ┆ 0.865165        │
│ NK CD56bright         ┆ 116694  ┆ 0.856462 ┆ 0.918812        │
│ CD8 Memory            ┆ 684115  ┆ 0.846543 ┆ 0.821517        │
│ CD4 Memory            ┆ 2164338 ┆ 0.806086 ┆ 0.818958        │
│ MAIT                  ┆ 289824  ┆ 0.792943 ┆ 0.840404        │
│ Treg                  ┆ 156180  ┆ 0.791087 ┆ 0.864583        │
│ NKT                   ┆ 145969  ┆ 0.433667 ┆ 0.727498        │
└───────────────────────┴─────────┴──────────┴─────────────────┘

Common, distinct types transfer almost perfectly (CD14 Mono 99%, NK 97%, B Naive 97%), while rare or closely related types are harder — NKT (43%) is mostly absorbed into the neighboring NK and CD8 populations, and its low mean confidence (0.73) reflects that.

Pipeline summary#

The full reference-mapping pipeline:

sc = SingleCell('data.h5ad').qc(allow_float=True)
sc = sc.split_by_obs('treatment')
sc_ref, sc_query = sc['PBS'], sc['cytokine']
sc_ref, sc_query = sc_ref.hvg(sc_query, batch_column='donor')
sc_ref = sc_ref.normalize()
sc_query = sc_query.normalize()
sc_ref, sc_query = sc_ref.pca(sc_query)
sc_ref, sc_query = sc_ref.harmonize(sc_query)
sc_query = sc_query.label_transfer_from(
    sc_ref, 'cell_type', cell_type_column='cell_type_transferred')

Step

Method

What it does

Split

sc.split_by_obs('treatment')

Separate the reference (PBS) and query (cytokine)

Feature selection

sc_ref.hvg(sc_query)

Highly variable genes across both datasets

Normalization

sc_ref.normalize()

log1pPF log-normalization, per dataset

PCA

sc_ref.pca(sc_query)

A shared principal-component space

Integration

sc_ref.harmonize(sc_query)

Harmony-corrected embedding in obsm['harmony']

Label transfer

sc_query.label_transfer_from(sc_ref)

Transfer labels via nearest neighbors in Harmony space