scanpy.external.pp.scanorama_integrate#
- scanpy.external.pp.scanorama_integrate(adata, key, *, basis='X_pca', adjusted_basis='X_scanorama', knn=20, sigma=15, approx=True, alpha=0.1, batch_size=5000, **kwargs)[source]#
Use Scanorama [Hie et al., 2019] to integrate different experiments.
Scanorama [Hie et al., 2019] is an algorithm for integrating single-cell data from multiple experiments stored in an AnnData object. This function should be run after performing PCA but before computing the neighbor graph, as illustrated in the example below.
This uses the implementation of scanorama [Hie et al., 2019].
- Parameters:
- adata
AnnData
The annotated data matrix.
- key
str
The name of the column in
adata.obs
that differentiates among experiments/batches. Cells from the same batch must be contiguously stored inadata
.- basis
str
(default:'X_pca'
) The name of the field in
adata.obsm
where the PCA table is stored. Defaults to'X_pca'
, which is the default forsc.pp.pca()
.- adjusted_basis
str
(default:'X_scanorama'
) The name of the field in
adata.obsm
where the integrated embeddings will be stored after running this function. Defaults toX_scanorama
.- knn
int
(default:20
) Number of nearest neighbors to use for matching.
- sigma
float
(default:15
) Correction smoothing parameter on Gaussian kernel.
- approx
bool
(default:True
) Use approximate nearest neighbors with Python
annoy
; greatly speeds up matching runtime.- alpha
float
(default:0.1
) Alignment score minimum cutoff.
- batch_size
int
(default:5000
) The batch size used in the alignment vector computation. Useful when integrating very large (>100k samples) datasets. Set to large value that runs within available memory.
- kwargs
Any additional arguments will be passed to
scanorama.assemble()
.
- adata
- Return type:
- Returns:
Updates adata with the field
adata.obsm[adjusted_basis]
, containing Scanorama embeddings such that different experiments are integrated.
Example
First, load libraries and example dataset, and preprocess.
>>> import scanpy as sc >>> import scanpy.external as sce >>> adata = sc.datasets.pbmc3k() >>> sc.pp.recipe_zheng17(adata) >>> sc.pp.pca(adata)
We now arbitrarily assign a batch metadata variable to each cell for the sake of example, but during real usage there would already be a column in
adata.obs
giving the experiment each cell came from.>>> adata.obs['batch'] = 1350*['a'] + 1350*['b']
Finally, run Scanorama. Afterwards, there will be a new table in
adata.obsm
containing the Scanorama embeddings.>>> sce.pp.scanorama_integrate(adata, 'batch', verbose=1) Processing datasets a <=> b >>> 'X_scanorama' in adata.obsm True