scanpy.external.pp.scanorama_integrate(adata, key, basis='X_pca', adjusted_basis='X_scanorama', knn=20, sigma=15, approx=True, alpha=0.1, batch_size=5000, **kwargs)

Use Scanorama [Hie19] to integrate different experiments.

Scanorama [Hie19] is an algorithm for integrating single-cell data from multiple experiments stored in an AnnData object. This function should be run after performing PCA but before computing the neighbor graph, as illustrated in the example below.

This uses the implementation of scanorama [Hie19].

adata : AnnDataAnnData

The annotated data matrix.

key : strstr

The name of the column in adata.obs that differentiates among experiments/batches. Cells from the same batch must be contiguously stored in adata.

basis : strstr (default: 'X_pca')

The name of the field in adata.obsm where the PCA table is stored. Defaults to 'X_pca', which is the default for

adjusted_basis : strstr (default: 'X_scanorama')

The name of the field in adata.obsm where the integrated embeddings will be stored after running this function. Defaults to X_scanorama.

knn : intint (default: 20)

Number of nearest neighbors to use for matching.

sigma : floatfloat (default: 15)

Correction smoothing parameter on Gaussian kernel.

approx : boolbool (default: True)

Use approximate nearest neighbors with Python annoy; greatly speeds up matching runtime.

alpha : floatfloat (default: 0.1)

Alignment score minimum cutoff.

batch_size : intint (default: 5000)

The batch size used in the alignment vector computation. Useful when integrating very large (>100k samples) datasets. Set to large value that runs within available memory.


Any additional arguments will be passed to scanorama.integrate().


Updates adata with the field adata.obsm[adjusted_basis], containing Scanorama embeddings such that different experiments are integrated.


First, load libraries and example dataset, and preprocess.

>>> import scanpy as sc
>>> import scanpy.external as sce
>>> adata = sc.datasets.pbmc3k()
>>> sc.pp.recipe_zheng17(adata)

We now arbitrarily assign a batch metadata variable to each cell for the sake of example, but during real usage there would already be a column in adata.obs giving the experiment each cell came from.

>>> adata.obs['batch'] = 1350*['a'] + 1350*['b']

Finally, run Scanorama. Afterwards, there will be a new table in adata.obsm containing the Scanorama embeddings.

>>> sce.pp.scanorama_integrate(adata, 'batch')
>>> 'X_scanorama' in adata.obsm