scanpy.external.tl.phenograph#
- scanpy.external.tl.phenograph(data, clustering_algo='louvain', *, k=30, directed=False, prune=False, min_cluster_size=10, jaccard=True, primary_metric='euclidean', n_jobs=-1, q_tol=0.001, louvain_time_limit=2000, nn_method='kdtree', partition_type=None, resolution_parameter=1, n_iterations=-1, use_weights=True, seed=None, copy=False, **kargs)[source]#
PhenoGraph clustering [Levine et al., 2015].
PhenoGraph is a clustering method designed for high-dimensional single-cell data. It works by creating a graph (“network”) representing phenotypic similarities between cells and then identifying communities in this graph. It supports both Louvain and Leiden algorithms for community detection.
Note
More information and bug reports here.
- Parameters:
- data
AnnData
|ndarray
|spmatrix
AnnData, or Array of data to cluster, or sparse matrix of k-nearest neighbor graph. If ndarray, n-by-d array of n cells in d dimensions. if sparse matrix, n-by-n adjacency matrix.
- clustering_algo
Optional
[Literal
['louvain'
,'leiden'
]] (default:'louvain'
) Choose between
'Louvain'
or'Leiden'
algorithm for clustering.- k
int
(default:30
) Number of nearest neighbors to use in first step of graph construction.
- directed
bool
(default:False
) Whether to use a symmetric (default) or asymmetric (
'directed'
) graph. The graph construction process produces a directed graph, which is symmetrized by one of two methods (seeprune
below).- prune
bool
(default:False
) prune=False
, symmetrize by taking the average between the graph and its transpose.prune=True
, symmetrize by taking the product between the graph and its transpose.- min_cluster_size
int
(default:10
) Cells that end up in a cluster smaller than min_cluster_size are considered outliers and are assigned to -1 in the cluster labels.
- jaccard
bool
(default:True
) If
True
, use Jaccard metric between k-neighborhoods to build graph. IfFalse
, use a Gaussian kernel.- primary_metric
Literal
['euclidean'
,'manhattan'
,'correlation'
,'cosine'
] (default:'euclidean'
) Distance metric to define nearest neighbors. Note that performance will be slower for correlation and cosine.
- n_jobs
int
(default:-1
) Nearest Neighbors and Jaccard coefficients will be computed in parallel using n_jobs. If 1 is given, no parallelism is used. If set to -1, all CPUs are used. For n_jobs below -1,
n_cpus + 1 + n_jobs
are used.- q_tol
float
(default:0.001
) Tolerance, i.e. precision, for monitoring modularity optimization.
- louvain_time_limit
int
(default:2000
) Maximum number of seconds to run modularity optimization. If exceeded the best result so far is returned.
- nn_method
Literal
['kdtree'
,'brute'
] (default:'kdtree'
) Whether to use brute force or kdtree for nearest neighbor search. For very large high-dimensional data sets, brute force, with parallel computation, performs faster than kdtree.
- partition_type
type
[MutableVertexPartition
] |None
(default:None
) Defaults to
RBConfigurationVertexPartition
. For the available options, consult the documentation forfind_partition()
.- resolution_parameter
float
(default:1
) A parameter value controlling the coarseness of the clustering in Leiden. Higher values lead to more clusters. Set to
None
if overridingpartition_type
to one that does not accept aresolution_parameter
.- n_iterations
int
(default:-1
) Number of iterations to run the Leiden algorithm. If the number of iterations is negative, the Leiden algorithm is run until an iteration in which there was no improvement.
- use_weights
bool
(default:True
) Use vertices in the Leiden computation.
- seed
int
|None
(default:None
) Leiden initialization of the optimization.
- copy
bool
(default:False
) Return a copy or write to
adata
.- kargs
Any
Additional arguments passed to
find_partition()
and the constructor of thepartition_type
.
- data
- Return type:
- Returns:
Depending on
copy
, returns or updatesadata
with the following fields:
Example
>>> from anndata import AnnData >>> import scanpy as sc >>> import scanpy.external as sce >>> import numpy as np >>> import pandas as pd
With annotated data as input:
>>> adata = sc.datasets.pbmc3k() >>> sc.pp.normalize_per_cell(adata)
Then do PCA:
>>> sc.pp.pca(adata, n_comps=100)
Compute phenograph clusters:
Louvain community detection
>>> sce.tl.phenograph(adata, clustering_algo="louvain", k=30)
Leiden community detection
>>> sce.tl.phenograph(adata, clustering_algo="leiden", k=30)
Return only
Graph
object>>> sce.tl.phenograph(adata, clustering_algo=None, k=30)
Now to show phenograph on tSNE (for example):
Compute tSNE:
>>> sc.tl.tsne(adata, random_state=7)
Plot phenograph clusters on tSNE:
>>> sc.pl.tsne( ... adata, color = ["pheno_louvain", "pheno_leiden"], s = 100, ... palette = sc.pl.palettes.vega_20_scanpy, legend_fontsize = 10 ... )
Cluster and cluster centroids for input Numpy ndarray
>>> df = np.random.rand(1000, 40) >>> dframe = pd.DataFrame(df) >>> dframe.index, dframe.columns = (map(str, dframe.index), map(str, dframe.columns)) >>> adata = AnnData(dframe) >>> sc.pp.pca(adata, n_comps=20) >>> sce.tl.phenograph(adata, clustering_algo="leiden", k=50) >>> sc.tl.tsne(adata, random_state=1) >>> sc.pl.tsne( ... adata, color=['pheno_leiden'], s=100, ... palette=sc.pl.palettes.vega_20_scanpy, legend_fontsize=10 ... )