scanpy.external.tl.phenograph

scanpy.external.tl.phenograph(adata, clustering_algo='louvain', k=30, directed=False, prune=False, min_cluster_size=10, jaccard=True, primary_metric='euclidean', n_jobs=-1, q_tol=0.001, louvain_time_limit=2000, nn_method='kdtree', partition_type=None, resolution_parameter=1, n_iterations=-1, use_weights=True, seed=None, copy=False, **kargs)

PhenoGraph clustering [Levine15].

PhenoGraph is a clustering method designed for high-dimensional single-cell data. It works by creating a graph (“network”) representing phenotypic similarities between cells and then identifying communities in this graph. It supports both Louvain and Leiden algorithms for community detection.

Note

More information and bug reports here.

Parameters:
adata : Union[AnnData, ndarray, spmatrix]

AnnData, or Array of data to cluster, or sparse matrix of k-nearest neighbor graph. If ndarray, n-by-d array of n cells in d dimensions. if sparse matrix, n-by-n adjacency matrix.

clustering_algo : Optional[Literal['louvain', 'leiden']] (default: 'louvain')

Choose between 'Louvain' or 'Leiden' algorithm for clustering.

k : int (default: 30)

Number of nearest neighbors to use in first step of graph construction.

directed : bool (default: False)

Whether to use a symmetric (default) or asymmetric ('directed') graph. The graph construction process produces a directed graph, which is symmetrized by one of two methods (see prune below).

prune : bool (default: False)

prune=False, symmetrize by taking the average between the graph and its transpose. prune=True, symmetrize by taking the product between the graph and its transpose.

min_cluster_size : int (default: 10)

Cells that end up in a cluster smaller than min_cluster_size are considered outliers and are assigned to -1 in the cluster labels.

jaccard : bool (default: True)

If True, use Jaccard metric between k-neighborhoods to build graph. If False, use a Gaussian kernel.

primary_metric : Literal['euclidean', 'manhattan', 'correlation', 'cosine'] (default: 'euclidean')

Distance metric to define nearest neighbors. Note that performance will be slower for correlation and cosine.

n_jobs : int (default: -1)

Nearest Neighbors and Jaccard coefficients will be computed in parallel using n_jobs. If 1 is given, no parallelism is used. If set to -1, all CPUs are used. For n_jobs below -1, n_cpus + 1 + n_jobs are used.

q_tol : float (default: 0.001)

Tolerance, i.e. precision, for monitoring modularity optimization.

louvain_time_limit : int (default: 2000)

Maximum number of seconds to run modularity optimization. If exceeded the best result so far is returned.

nn_method : Literal['kdtree', 'brute'] (default: 'kdtree')

Whether to use brute force or kdtree for nearest neighbor search. For very large high-dimensional data sets, brute force, with parallel computation, performs faster than kdtree.

partition_type : Optional[Type[MutableVertexPartition]] (default: None)

Defaults to RBConfigurationVertexPartition. For the available options, consult the documentation for find_partition().

resolution_parameter : float (default: 1)

A parameter value controlling the coarseness of the clustering in Leiden. Higher values lead to more clusters. Set to None if overriding partition_type to one that does not accept a resolution_parameter.

n_iterations : int (default: -1)

Number of iterations to run the Leiden algorithm. If the number of iterations is negative, the Leiden algorithm is run until an iteration in which there was no improvement.

use_weights : bool (default: True)

Use vertices in the Leiden computation.

seed : Optional[int] (default: None)

Leiden initialization of the optimization.

copy : bool (default: False)

Return a copy or write to adata.

kargs : Any

Additional arguments passed to find_partition() and the constructor of the partition_type.

Return type:

Tuple[Optional[ndarray], spmatrix, Optional[float]]

Returns:

: Depending on copy, returns or updates adata with the following fields:

communities - ndarray (obs, dtype int)

integer array of community assignments for each row in data.

graph - spmatrix (obsp, dtype float)

the graph that was used for clustering.

Q - float (uns, dtype float)

the modularity score for communities on graph.

Example

>>> from anndata import AnnData
>>> import scanpy as sc
>>> import scanpy.external as sce
>>> import numpy as np
>>> import pandas as pd

With annotated data as input:

>>> adata = sc.datasets.pbmc3k()
>>> sc.pp.normalize_per_cell(adata)

Then do PCA:

>>> sc.tl.pca(adata, n_comps=100)

Compute phenograph clusters:

Louvain community detection

>>> sce.tl.phenograph(adata, clustering_algo="louvain", k=30)

Leiden community detection

>>> sce.tl.phenograph(adata, clustering_algo="leiden", k=30)

Return only Graph object

>>> sce.tl.phenograph(adata, clustering_algo=None, k=30)

Now to show phenograph on tSNE (for example):

Compute tSNE:

>>> sc.tl.tsne(adata, random_state=7)

Plot phenograph clusters on tSNE:

>>> sc.pl.tsne(
...     adata, color = ["pheno_louvain", "pheno_leiden"], s = 100,
...     palette = sc.pl.palettes.vega_20_scanpy, legend_fontsize = 10
... )

Cluster and cluster centroids for input Numpy ndarray

>>> df = np.random.rand(1000, 40)
>>> dframe = pd.DataFrame(df)
>>> dframe.index, dframe.columns = (map(str, dframe.index), map(str, dframe.columns))
>>> adata = AnnData(dframe)
>>> sc.tl.pca(adata, n_comps=20)
>>> sce.tl.phenograph(adata, clustering_algo="leiden", k=50)
>>> sc.tl.tsne(adata, random_state=1)
>>> sc.pl.tsne(
...     adata, color=['pheno_leiden'], s=100,
...     palette=sc.pl.palettes.vega_20_scanpy, legend_fontsize=10
... )