Integrating data using ingest and BBKNN#
The following tutorial describes a simple PCA-based method for integrating data we call ingest and compares it with BBKNN [Polanski19]. BBKNN integrates well with the Scanpy workflow and is accessible through the bbknn function.
The ingest function assumes an annotated reference dataset that captures the biological variability of interest. The rational is to fit a model on the reference data and use it to project new data. For the time being, this model is a PCA combined with a neighbor lookup search tree, for which we use UMAP’s implementation [McInnes18]. Similar PCA-based integrations have been used before, for instance, in [Weinreb18].
As ingest is simple and the procedure clear, the workflow is transparent and fast.
Like BBKNN, ingest leaves the data matrix itself invariant.
Unlike BBKNN, ingest solves the label mapping problem (like scmap) and maintains an embedding that might have desired properties like specific clusters or trajectories.
We refer to this asymmetric dataset integration as ingesting annotations from an annotated reference adata_ref
into an adata
that still lacks this annotation. It is different from learning a joint representation that integrates datasets in a symmetric way as BBKNN, Scanorma, Conos, CCA (e.g. in Seurat) or a conditional VAE (e.g. in scVI, trVAE) would do, but comparable to the initiall MNN implementation in scran. Take a look at tools in the external API or at the ecoystem page to get a start with other tools.
import anndata
import scanpy as sc
import pandas as pd
sc.settings.verbosity = 1 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()
sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(3, 3), facecolor="white")
-----
anndata 0.10.9
scanpy 1.10.3
-----
Cython 3.0.11
PIL 10.4.0
asttokens NA
charset_normalizer 3.3.2
comm 0.2.2
cycler 0.12.1
cython 3.0.11
cython_runtime NA
dateutil 2.9.0.post0
debugpy 1.8.5
decorator 5.1.1
executing 2.1.0
h5py 3.11.0
igraph 0.11.6
ipykernel 6.29.5
jaraco NA
jedi 0.19.1
joblib 1.4.2
kiwisolver 1.4.7
legacy_api_wrap NA
leidenalg 0.10.2
llvmlite 0.43.0
louvain 0.8.2
matplotlib 3.9.2
more_itertools 10.5.0
mpl_toolkits NA
natsort 8.4.0
numba 0.60.0
numpy 2.0.2
packaging 24.1
pandas 2.2.3
parso 0.8.4
pkg_resources NA
platformdirs 4.3.3
prompt_toolkit 3.0.47
psutil 6.0.0
pure_eval 0.2.3
pydev_ipython NA
pydevconsole NA
pydevd 2.9.5
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.18.0
pyparsing 3.1.4
pytz 2024.2
scipy 1.14.1
session_info 1.0.0
six 1.16.0
sklearn 1.5.2
stack_data 0.6.3
texttable 1.7.0
threadpoolctl 3.5.0
tornado 6.4.1
traitlets 5.14.3
vscode NA
wcwidth 0.2.13
zmq 26.2.0
-----
IPython 8.27.0
jupyter_client 8.6.2
jupyter_core 5.7.2
-----
Python 3.12.7 (main, Oct 1 2024, 11:15:50) [GCC 14.2.1 20240910]
Linux-6.11.3-zen1-1-zen-x86_64-with-glibc2.40
-----
Session information updated at 2024-10-14 15:28
PBMCs#
We consider an annotated reference dataset adata_ref
and a dataset for which you want to query labels and embeddings adata
.
# this is an earlier version of the dataset from the pbmc3k tutorial
adata_ref = sc.datasets.pbmc3k_processed()
adata = sc.datasets.pbmc68k_reduced()
To use sc.tl.ingest
, the datasets need to be defined on the same variables.
var_names = adata_ref.var_names.intersection(adata.var_names)
adata_ref = adata_ref[:, var_names].copy()
adata = adata[:, var_names].copy()
The model and graph (here PCA, neighbors, UMAP) trained on the reference data will explain the biological variation observed within it.
sc.pp.pca(adata_ref)
sc.pp.neighbors(adata_ref)
sc.tl.umap(adata_ref)
The manifold still looks essentially the same as in the clustering tutorial.
Mapping PBMCs using ingest#
Let’s map labels and embeddings from adata_ref
to adata
based on a chosen representation. Here, we use adata_ref.obsm['X_pca']
to map cluster labels and the UMAP coordinates.
sc.tl.ingest(adata, adata_ref, obs="louvain")
adata.uns["louvain_colors"] = adata_ref.uns["louvain_colors"] # fix colors
By comparing the ‘bulk_labels’ annotation with ‘louvain’, we see that the data has been reasonably mapped, only the annotation of dendritic cells seems ambiguous and might have been ambiiguous in adata
already.
adata_concat = anndata.concat([adata_ref, adata], label="batch", keys=["ref", "new"])
adata_concat.obs["louvain"] = (
adata_concat.obs["louvain"]
.astype("category")
.cat.reorder_categories(adata_ref.obs["louvain"].cat.categories)
)
# fix category colors
adata_concat.uns["louvain_colors"] = adata_ref.uns["louvain_colors"]
While there seems to be some batch-effect in the monocytes and dendritic cell clusters, the new data is otherwise mapped relatively homogeneously.
The megakaryoctes are only present in adata_ref
and no cells from adata
map onto them. If interchanging reference data and query data, Megakaryocytes do not appear as a separate cluster anymore. This is an extreme case as the reference data is very small; but one should always question if the reference data contain enough biological variation to meaningfully accomodate query data.
Using BBKNN#
sc.tl.pca(adata_concat)
%%time
sc.external.pp.bbknn(adata_concat, batch_key="batch") # running bbknn 1.3.6
WARNING: consider updating your call to make use of `computation`
CPU times: user 1.56 s, sys: 19 ms, total: 1.58 s
Wall time: 771 ms
sc.tl.umap(adata_concat)
Also BBKNN doesn’t maintain the Megakaryocytes cluster. However, it seems to mix cells more homogeneously.
Pancreas#
The following data has been used in the scGen paper [Lotfollahi et al., 2019], has been used here, was curated here and can be downloaded from here (the BBKNN paper).
It contains data for human pancreas from 4 different studies [Baron et al., 2016, Muraro et al., 2016, Segerstolpe et al., 2016, Wang et al., 2016], which have been used in the seminal papers on single-cell dataset integration [Butler et al., 2018, Haghverdi et al., 2018] and many times ever since.
# note that this collection of batches is already intersected on the genes
adata_all = sc.read(
"data/pancreas.h5ad",
backup_url="https://www.dropbox.com/s/qj1jlm9w10wmt0u/pancreas.h5ad?dl=1",
)
adata_all.shape
(14693, 2448)
Inspect the cell types observed in these studies.
counts = adata_all.obs["celltype"].value_counts()
counts.to_frame()
count | |
---|---|
celltype | |
alpha | 4214 |
beta | 3354 |
ductal | 1804 |
acinar | 1368 |
not applicable | 1154 |
delta | 917 |
gamma | 571 |
endothelial | 289 |
activated_stellate | 284 |
dropped | 178 |
quiescent_stellate | 173 |
mesenchymal | 80 |
macrophage | 55 |
PSC | 54 |
unclassified endocrine | 41 |
co-expression | 39 |
mast | 32 |
epsilon | 28 |
mesenchyme | 27 |
schwann | 13 |
t_cell | 7 |
MHC class II | 5 |
unclear | 4 |
unclassified | 2 |
To simplify visualization, let’s remove the 5 minority classes.
minority_classes = counts.index[-5:].tolist() # get the minority classes
# actually subset
adata_all = adata_all[~adata_all.obs["celltype"].isin(minority_classes)].copy()
# reorder according to abundance
adata_all.obs["celltype"] = adata_all.obs["celltype"].cat.reorder_categories(
counts.index[:-5].tolist()
)
Seeing the batch effect#
sc.pp.pca(adata_all)
sc.pp.neighbors(adata_all)
sc.tl.umap(adata_all)
We observe a batch effect.
BBKNN#
It can be well-resolved using BBKNN [Polanski19].
%%time
sc.external.pp.bbknn(adata_all, batch_key="batch")
WARNING: consider updating your call to make use of `computation`
CPU times: user 1.17 s, sys: 24 ms, total: 1.2 s
Wall time: 1.14 s
sc.tl.umap(adata_all)
If one prefers to work more iteratively starting from one reference dataset, one can use ingest.
Mapping onto a reference batch using ingest#
Choose one reference batch for training the model and setting up the neighborhood graph (here, a PCA) and separate out all other batches.
As before, the model trained on the reference batch will explain the biological variation observed within it.
adata_ref = adata_all[adata_all.obs["batch"] == "0"].copy()
Compute the PCA, neighbors and UMAP on the reference data.
sc.pp.pca(adata_ref)
sc.pp.neighbors(adata_ref)
sc.tl.umap(adata_ref)
The reference batch contains 12 of the 19 cell types across all batches.
Iteratively map labels (such as ‘celltype’) and embeddings (such as ‘X_pca’ and ‘X_umap’) from the reference data onto the query batches.
adatas = [adata_all[adata_all.obs["batch"] == i].copy() for i in ["1", "2", "3"]]
sc.settings.verbosity = 2 # a bit more logging
for iadata, adata in enumerate(adatas, 1):
print(f"... integrating batch {iadata}")
adata.obs["celltype_orig"] = adata.obs["celltype"] # save the original cell type
sc.tl.ingest(adata, adata_ref, obs="celltype")
... integrating batch 1
running ingest
finished (0:00:02)
... integrating batch 2
running ingest
finished (0:00:03)
... integrating batch 3
running ingest
finished (0:00:01)
Each of the query batches now carries annotation that has been contextualized with adata_ref
. By concatenating, we can view it together.
adata_concat = anndata.concat([adata_ref, *adatas], label="batch", join="outer")
adata_concat.obs["celltype"] = (
adata_concat.obs["celltype"]
.astype("category")
.cat.reorder_categories(adata_ref.obs["celltype"].cat.categories)
)
# fix category coloring
adata_concat.uns["celltype_colors"] = adata_ref.uns["celltype_colors"]
Compared to the BBKNN result, this is maintained clusters in a much more pronounced fashion. If one already observed a desired continuous structure (as in the hematopoietic datasets, for instance), ingest
allows to easily maintain this structure.
Evaluating consistency#
Let us subset the data to the query batches.
adata_query = adata_concat[adata_concat.obs["batch"].isin(["1", "2", "3"])].copy()
The following plot is a bit hard to read, hence, move on to confusion matrices below.
Cell types conserved across batches#
Let us first focus on cell types that are conserved with the reference, to simplify reading of the confusion matrix.
# intersected categories
conserved_categories = adata_query.obs["celltype"].cat.categories.intersection(
adata_query.obs["celltype_orig"].cat.categories
)
# intersect categories
obs_query_conserved = adata_query.obs.loc[
adata_query.obs["celltype"].isin(conserved_categories)
& adata_query.obs["celltype_orig"].isin(conserved_categories)
].copy()
# remove unused categories
obs_query_conserved["celltype"] = obs_query_conserved[
"celltype"
].cat.remove_unused_categories()
# remove unused categories and fix category ordering
obs_query_conserved["celltype_orig"] = (
obs_query_conserved["celltype_orig"]
.cat.remove_unused_categories()
.cat.reorder_categories(obs_query_conserved["celltype"].cat.categories)
)
pd.crosstab(obs_query_conserved["celltype"], obs_query_conserved["celltype_orig"])
celltype_orig | alpha | beta | ductal | acinar | delta | gamma | endothelial |
---|---|---|---|---|---|---|---|
celltype | |||||||
alpha | 1814 | 3 | 16 | 1 | 1 | 20 | 0 |
beta | 53 | 805 | 6 | 1 | 11 | 35 | 0 |
ductal | 7 | 5 | 681 | 241 | 0 | 0 | 0 |
acinar | 2 | 3 | 3 | 165 | 0 | 3 | 0 |
delta | 6 | 3 | 2 | 0 | 304 | 71 | 0 |
gamma | 1 | 5 | 0 | 1 | 0 | 187 | 0 |
endothelial | 2 | 0 | 0 | 0 | 0 | 0 | 36 |
Overall, the conserved cell types are also mapped as expected. The main exception are some acinar cells in the original annotation that appear as acinar cells. However, already the reference data is observed to feature a cluster of both acinar and ductal cells, which explains the discrepancy, and indicates a potential inconsistency in the initial annotation.
All cell types#
Let us now move on to look at all cell types.
pd.crosstab(adata_query.obs["celltype"], adata_query.obs["celltype_orig"])
celltype_orig | PSC | acinar | alpha | beta | co-expression | delta | dropped | ductal | endothelial | epsilon | gamma | mast | mesenchymal | mesenchyme | not applicable | unclassified endocrine |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
celltype | ||||||||||||||||
alpha | 0 | 1 | 1814 | 3 | 2 | 1 | 40 | 16 | 0 | 3 | 20 | 7 | 0 | 0 | 312 | 10 |
beta | 1 | 1 | 53 | 805 | 37 | 11 | 41 | 6 | 0 | 0 | 35 | 0 | 0 | 1 | 522 | 24 |
ductal | 0 | 241 | 7 | 5 | 0 | 0 | 37 | 681 | 0 | 0 | 0 | 0 | 2 | 0 | 94 | 0 |
acinar | 0 | 165 | 2 | 3 | 0 | 0 | 23 | 3 | 0 | 0 | 3 | 0 | 0 | 0 | 90 | 0 |
delta | 0 | 0 | 6 | 3 | 0 | 304 | 13 | 2 | 0 | 6 | 71 | 0 | 0 | 0 | 95 | 7 |
gamma | 0 | 1 | 1 | 5 | 0 | 0 | 2 | 0 | 0 | 1 | 187 | 0 | 0 | 0 | 15 | 0 |
endothelial | 1 | 0 | 2 | 0 | 0 | 0 | 6 | 0 | 36 | 0 | 0 | 0 | 0 | 6 | 7 | 0 |
activated_stellate | 48 | 1 | 1 | 3 | 0 | 0 | 11 | 6 | 0 | 0 | 0 | 0 | 78 | 20 | 17 | 0 |
quiescent_stellate | 4 | 0 | 1 | 1 | 0 | 0 | 5 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
macrophage | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
We observe that PSC (pancreatic stellate cells) cells are in fact just inconsistently annotated and correctly mapped on ‘activated_stellate’ cells.
Also, it’s nice to see that ‘mesenchyme’ and ‘mesenchymal’ cells both map onto the same category. However, that category is again ‘activated_stellate’ and likely incorrect.
Visualizing distributions across batches#
Often, batches correspond to experiments that one wants to compare. Scanpy offers to convenient visualization possibilities for this.
a density plot
a partial visualization of a subset of categories/groups in an emnbedding
Density plot#
sc.tl.embedding_density(adata_concat, groupby="batch")
computing density on 'umap'