scanpy.external.tl.sam

Contents

scanpy.external.tl.sam#

scanpy.external.tl.sam(adata, *, max_iter=10, num_norm_avg=50, k=20, distance='correlation', standardization='StandardScaler', weight_pcs=False, sparse_pca=False, n_pcs=150, n_genes=3000, projection='umap', inplace=True, verbose=True)[source]#

Self-Assembling Manifolds single-cell RNA sequencing analysis tool [Tarashansky19].

SAM iteratively rescales the input gene expression matrix to emphasize genes that are spatially variable along the intrinsic manifold of the data. It outputs the gene weights, nearest neighbor matrix, and a 2D projection.

The AnnData input should contain unstandardized, non-negative values. Preferably, the data should be log-normalized and no genes should be filtered out.

Parameters:
k int (default: 20)

The number of nearest neighbors to identify for each cell.

distance str (default: 'correlation')

The distance metric to use when identifying nearest neighbors. Can be any of the distance metrics supported by pdist().

max_iter int (default: 10)

The maximum number of iterations SAM will run.

projection Literal['umap', 'tsne', 'None'] (default: 'umap')

If ‘tsne’, generates a t-SNE embedding. If ‘umap’, generates a UMAP embedding. If ‘None’, no embedding will be generated.

standardization Literal['Normalizer', 'StandardScaler', 'None'] (default: 'StandardScaler')

If ‘Normalizer’, use sklearn.preprocessing.Normalizer, which normalizes expression data prior to PCA such that each cell has unit L2 norm. If ‘StandardScaler’, use sklearn.preprocessing.StandardScaler, which normalizes expression data prior to PCA such that each gene has zero mean and unit variance. Otherwise, do not normalize the expression data. We recommend using ‘StandardScaler’ for large datasets with many expected cell types and ‘Normalizer’ otherwise. If ‘None’, no transformation is applied.

num_norm_avg int (default: 50)

The top ‘num_norm_avg’ dispersions are averaged to determine the normalization factor when calculating the weights. This prevents genes with large spatial dispersions from skewing the distribution of weights.

weight_pcs bool (default: False)

If True, scale the principal components by their eigenvalues. In datasets with many expected cell types, setting this to False might improve the resolution as these cell types might be encoded by lower- variance principal components.

sparse_pca bool (default: False)

If True, uses an implementation of PCA that accepts sparse inputs. This way, we no longer need a temporary dense copy of the sparse data. However, this implementation is slower and so is only worth using when memory constraints become noticeable.

n_pcs int | None (default: 150)

Determines the number of top principal components selected at each iteration of the SAM algorithm. If None, this number is chosen automatically based on the size of the dataset. If weight_pcs is set to True, this parameter primarily affects the runtime of the SAM algorithm (more PCs = longer runtime).

n_genes int | None (default: 3000)

Determines the number of top SAM-weighted genes to use at each iteration of the SAM algorithm. If None, this number is chosen automatically based on the size of the dataset. This parameter primarily affects the runtime of the SAM algorithm (more genes = longer runtime). For extremely homogeneous datasets, decreasing n_genes may improve clustering resolution.

inplace bool (default: True)

Set fields in adata if True. Otherwise, returns a copy.

verbose bool (default: True)

If True, displays SAM log statements.

Return type:

SAM | tuple[SAM, AnnData]

Returns:

sam_obj if inplace is True or (sam_obj,AnnData) otherwise

adata - AnnData
.var['weights']

SAM weights for each gene.

.var['spatial_dispersions']

Spatial dispersions for each gene (these are used to compute the SAM weights)

.uns['sam']

Dictionary of SAM-specific outputs, such as the parameters used for preprocessing (‘preprocess_args’) and running (‘run_args’) SAM.

.uns['neighbors']

A dictionary with key ‘connectivities’ containing the kNN adjacency matrix output by SAM. If built-in scanpy dimensionality reduction methods are to be used using the SAM-output AnnData, users should recompute the neighbors using .obs['X_pca'] with scanpy.pp.neighbors.

.obsm['X_pca']

The principal components output by SAM.

.obsm['X_umap']

The UMAP projection output by SAM.

.layers['X_disp']

The expression matrix used for nearest-neighbor averaging.

.layers['X_knn_avg']

The nearest-neighbor-averaged expression data used for computing the spatial dispersions of genes.

Example

>>> import scanpy.external as sce
>>> import scanpy as sc

* Running SAM *

Assuming we are given an AnnData object called adata, we can run the SAM algorithm as follows:

>>> sam_obj = sce.tl.sam(adata,inplace=True)

The input AnnData object should contain unstandardized, non-negative expression values. Preferably, the data should be log-normalized and no genes should be filtered out.

Please see the documentation for a description of all available parameters.

For more detailed tutorials, please visit the original Github repository: atarashansky/self-assembling-manifold

* Plotting *

To visualize the output, we can use:

>>> sce.pl.sam(adata,projection='X_umap')

sce.pl.sam accepts all keyword arguments used in the matplotlib.pyplot.scatter function.

* SAMGUI *

SAM comes with the SAMGUI module, a graphical-user interface written with Plotly and ipythonwidgets for interactively exploring and annotating the scRNAseq data and running SAM.

Dependencies can be installed with Anaconda by following the instructions in the self-assembling-manifold Github README: atarashansky/self-assembling-manifold

In a Jupyter notebook, execute the following to launch the interface:

>>> from samalg.gui import SAMGUI
>>> sam_gui = SAMGUI(sam_obj) # sam_obj is your SAM object
>>> sam_gui.SamPlot

This can also be enabled in Jupyer Lab by following the instructions in the self-assembling-manifold README.