scanpy.external.tl.sam#
- scanpy.external.tl.sam(adata, *, max_iter=10, num_norm_avg=50, k=20, distance='correlation', standardization='StandardScaler', weight_pcs=False, sparse_pca=False, n_pcs=150, n_genes=3000, projection='umap', inplace=True, verbose=True)[source]#
Self-Assembling Manifolds single-cell RNA sequencing analysis tool [Tarashansky et al., 2019].
SAM iteratively rescales the input gene expression matrix to emphasize genes that are spatially variable along the intrinsic manifold of the data. It outputs the gene weights, nearest neighbor matrix, and a 2D projection.
The AnnData input should contain unstandardized, non-negative values. Preferably, the data should be log-normalized and no genes should be filtered out.
- Parameters:
- k
int
(default:20
) The number of nearest neighbors to identify for each cell.
- distance
str
(default:'correlation'
) The distance metric to use when identifying nearest neighbors. Can be any of the distance metrics supported by
pdist()
.- max_iter
int
(default:10
) The maximum number of iterations SAM will run.
- projection
Literal
['umap'
,'tsne'
,'None'
] (default:'umap'
) If ‘tsne’, generates a t-SNE embedding. If ‘umap’, generates a UMAP embedding. If ‘None’, no embedding will be generated.
- standardization
Literal
['Normalizer'
,'StandardScaler'
,'None'
] (default:'StandardScaler'
) If ‘Normalizer’, use sklearn.preprocessing.Normalizer, which normalizes expression data prior to PCA such that each cell has unit L2 norm. If ‘StandardScaler’, use sklearn.preprocessing.StandardScaler, which normalizes expression data prior to PCA such that each gene has zero mean and unit variance. Otherwise, do not normalize the expression data. We recommend using ‘StandardScaler’ for large datasets with many expected cell types and ‘Normalizer’ otherwise. If ‘None’, no transformation is applied.
- num_norm_avg
int
(default:50
) The top ‘num_norm_avg’ dispersions are averaged to determine the normalization factor when calculating the weights. This prevents genes with large spatial dispersions from skewing the distribution of weights.
- weight_pcs
bool
(default:False
) If True, scale the principal components by their eigenvalues. In datasets with many expected cell types, setting this to False might improve the resolution as these cell types might be encoded by lower- variance principal components.
- sparse_pca
bool
(default:False
) If True, uses an implementation of PCA that accepts sparse inputs. This way, we no longer need a temporary dense copy of the sparse data. However, this implementation is slower and so is only worth using when memory constraints become noticeable.
- n_pcs
int
|None
(default:150
) Determines the number of top principal components selected at each iteration of the SAM algorithm. If None, this number is chosen automatically based on the size of the dataset. If weight_pcs is set to True, this parameter primarily affects the runtime of the SAM algorithm (more PCs = longer runtime).
- n_genes
int
|None
(default:3000
) Determines the number of top SAM-weighted genes to use at each iteration of the SAM algorithm. If None, this number is chosen automatically based on the size of the dataset. This parameter primarily affects the runtime of the SAM algorithm (more genes = longer runtime). For extremely homogeneous datasets, decreasing
n_genes
may improve clustering resolution.- inplace
bool
(default:True
) Set fields in
adata
if True. Otherwise, returns a copy.- verbose
bool
(default:True
) If True, displays SAM log statements.
- k
- Return type:
- Returns:
sam_obj if inplace is True or (sam_obj,AnnData) otherwise
- adata - AnnData
.var['weights']
SAM weights for each gene.
.var['spatial_dispersions']
Spatial dispersions for each gene (these are used to compute the SAM weights)
.uns['sam']
Dictionary of SAM-specific outputs, such as the parameters used for preprocessing (‘preprocess_args’) and running (‘run_args’) SAM.
.uns['neighbors']
A dictionary with key ‘connectivities’ containing the kNN adjacency matrix output by SAM. If built-in scanpy dimensionality reduction methods are to be used using the SAM-output AnnData, users should recompute the neighbors using
.obs['X_pca']
withscanpy.pp.neighbors
..obsm['X_pca']
The principal components output by SAM.
.obsm['X_umap']
The UMAP projection output by SAM.
.layers['X_disp']
The expression matrix used for nearest-neighbor averaging.
.layers['X_knn_avg']
The nearest-neighbor-averaged expression data used for computing the spatial dispersions of genes.
Example
>>> import scanpy.external as sce >>> import scanpy as sc
* Running SAM *
Assuming we are given an AnnData object called
adata
, we can run the SAM algorithm as follows:>>> sam_obj = sce.tl.sam(adata,inplace=True)
The input AnnData object should contain unstandardized, non-negative expression values. Preferably, the data should be log-normalized and no genes should be filtered out.
Please see the documentation for a description of all available parameters.
For more detailed tutorials, please visit the original Github repository: atarashansky/self-assembling-manifold
* Plotting *
To visualize the output, we can use:
>>> sce.pl.sam(adata,projection='X_umap')
sce.pl.sam
accepts all keyword arguments used in thematplotlib.pyplot.scatter
function.* SAMGUI *
SAM comes with the SAMGUI module, a graphical-user interface written with
Plotly
andipythonwidgets
for interactively exploring and annotating the scRNAseq data and running SAM.Dependencies can be installed with Anaconda by following the instructions in the self-assembling-manifold Github README: atarashansky/self-assembling-manifold
In a Jupyter notebook, execute the following to launch the interface:
>>> from samalg.gui import SAMGUI >>> sam_gui = SAMGUI(sam_obj) # sam_obj is your SAM object >>> sam_gui.SamPlot
This can also be enabled in Jupyer Lab by following the instructions in the self-assembling-manifold README.