scanpy.external.tl.phenograph

Contents

scanpy.external.tl.phenograph#

scanpy.external.tl.phenograph(data, clustering_algo='louvain', *, k=30, directed=False, prune=False, min_cluster_size=10, jaccard=True, primary_metric='euclidean', n_jobs=-1, q_tol=0.001, louvain_time_limit=2000, nn_method='kdtree', partition_type=None, resolution_parameter=1, n_iterations=-1, use_weights=True, seed=None, copy=False, **kargs)[source]#

PhenoGraph clustering [Levine et al., 2015].

PhenoGraph is a clustering method designed for high-dimensional single-cell data. It works by creating a graph (“network”) representing phenotypic similarities between cells and then identifying communities in this graph. It supports both Louvain and Leiden algorithms for community detection.

Note

More information and bug reports here.

Parameters:
data AnnData | ndarray | spmatrix

AnnData, or Array of data to cluster, or sparse matrix of k-nearest neighbor graph. If ndarray, n-by-d array of n cells in d dimensions. if sparse matrix, n-by-n adjacency matrix.

clustering_algo Optional[Literal['louvain', 'leiden']] (default: 'louvain')

Choose between 'Louvain' or 'Leiden' algorithm for clustering.

k int (default: 30)

Number of nearest neighbors to use in first step of graph construction.

directed bool (default: False)

Whether to use a symmetric (default) or asymmetric ('directed') graph. The graph construction process produces a directed graph, which is symmetrized by one of two methods (see prune below).

prune bool (default: False)

prune=False, symmetrize by taking the average between the graph and its transpose. prune=True, symmetrize by taking the product between the graph and its transpose.

min_cluster_size int (default: 10)

Cells that end up in a cluster smaller than min_cluster_size are considered outliers and are assigned to -1 in the cluster labels.

jaccard bool (default: True)

If True, use Jaccard metric between k-neighborhoods to build graph. If False, use a Gaussian kernel.

primary_metric Literal['euclidean', 'manhattan', 'correlation', 'cosine'] (default: 'euclidean')

Distance metric to define nearest neighbors. Note that performance will be slower for correlation and cosine.

n_jobs int (default: -1)

Nearest Neighbors and Jaccard coefficients will be computed in parallel using n_jobs. If 1 is given, no parallelism is used. If set to -1, all CPUs are used. For n_jobs below -1, n_cpus + 1 + n_jobs are used.

q_tol float (default: 0.001)

Tolerance, i.e. precision, for monitoring modularity optimization.

louvain_time_limit int (default: 2000)

Maximum number of seconds to run modularity optimization. If exceeded the best result so far is returned.

nn_method Literal['kdtree', 'brute'] (default: 'kdtree')

Whether to use brute force or kdtree for nearest neighbor search. For very large high-dimensional data sets, brute force, with parallel computation, performs faster than kdtree.

partition_type type[MutableVertexPartition] | None (default: None)

Defaults to RBConfigurationVertexPartition. For the available options, consult the documentation for find_partition().

resolution_parameter float (default: 1)

A parameter value controlling the coarseness of the clustering in Leiden. Higher values lead to more clusters. Set to None if overriding partition_type to one that does not accept a resolution_parameter.

n_iterations int (default: -1)

Number of iterations to run the Leiden algorithm. If the number of iterations is negative, the Leiden algorithm is run until an iteration in which there was no improvement.

use_weights bool (default: True)

Use vertices in the Leiden computation.

seed int | None (default: None)

Leiden initialization of the optimization.

copy bool (default: False)

Return a copy or write to adata.

kargs Any

Additional arguments passed to find_partition() and the constructor of the partition_type.

Return type:

tuple[ndarray | None, spmatrix, float | None] | None

Returns:

Depending on copy, returns or updates adata with the following fields:

communities - ndarray (obs, dtype int)

integer array of community assignments for each row in data.

graph - spmatrix (obsp, dtype float)

the graph that was used for clustering.

Q - float (uns, dtype float)

the modularity score for communities on graph.

Example

>>> from anndata import AnnData
>>> import scanpy as sc
>>> import scanpy.external as sce
>>> import numpy as np
>>> import pandas as pd

With annotated data as input:

>>> adata = sc.datasets.pbmc3k()
>>> sc.pp.normalize_per_cell(adata)

Then do PCA:

>>> sc.pp.pca(adata, n_comps=100)

Compute phenograph clusters:

Louvain community detection

>>> sce.tl.phenograph(adata, clustering_algo="louvain", k=30)

Leiden community detection

>>> sce.tl.phenograph(adata, clustering_algo="leiden", k=30)

Return only Graph object

>>> sce.tl.phenograph(adata, clustering_algo=None, k=30)

Now to show phenograph on tSNE (for example):

Compute tSNE:

>>> sc.tl.tsne(adata, random_state=7)

Plot phenograph clusters on tSNE:

>>> sc.pl.tsne(
...     adata, color = ["pheno_louvain", "pheno_leiden"], s = 100,
...     palette = sc.pl.palettes.vega_20_scanpy, legend_fontsize = 10
... )

Cluster and cluster centroids for input Numpy ndarray

>>> df = np.random.rand(1000, 40)
>>> dframe = pd.DataFrame(df)
>>> dframe.index, dframe.columns = (map(str, dframe.index), map(str, dframe.columns))
>>> adata = AnnData(dframe)
>>> sc.pp.pca(adata, n_comps=20)
>>> sce.tl.phenograph(adata, clustering_algo="leiden", k=50)
>>> sc.tl.tsne(adata, random_state=1)
>>> sc.pl.tsne(
...     adata, color=['pheno_leiden'], s=100,
...     palette=sc.pl.palettes.vega_20_scanpy, legend_fontsize=10
... )