scanpy.external.tl.phenograph#

scanpy.external.tl.phenograph(data, clustering_algo='louvain', *, k=30, directed=False, prune=False, min_cluster_size=10, jaccard=True, primary_metric='euclidean', n_jobs=-1, q_tol=0.001, louvain_time_limit=2000, nn_method='kdtree', partition_type=None, resolution_parameter=1, n_iterations=-1, use_weights=True, seed=None, copy=False, **kargs)[source]#

PhenoGraph clustering .

PhenoGraph is a clustering method designed for high-dimensional single-cell data. It works by creating a graph (“network”) representing phenotypic similarities between cells and then identifying communities in this graph. It supports both Louvain and Leiden algorithms for community detection.

Note

Parameters:
data

AnnData, or Array of data to cluster, or sparse matrix of k-nearest neighbor graph. If ndarray, n-by-d array of n cells in d dimensions. if sparse matrix, n-by-n adjacency matrix.

clustering_algo `Optional`[`Literal`[`'louvain'`, `'leiden'`]] (default: `'louvain'`)

Choose between `'Louvain'` or `'Leiden'` algorithm for clustering.

k `int` (default: `30`)

Number of nearest neighbors to use in first step of graph construction.

directed `bool` (default: `False`)

Whether to use a symmetric (default) or asymmetric (`'directed'`) graph. The graph construction process produces a directed graph, which is symmetrized by one of two methods (see `prune` below).

prune `bool` (default: `False`)

`prune=False`, symmetrize by taking the average between the graph and its transpose. `prune=True`, symmetrize by taking the product between the graph and its transpose.

min_cluster_size `int` (default: `10`)

Cells that end up in a cluster smaller than min_cluster_size are considered outliers and are assigned to -1 in the cluster labels.

jaccard `bool` (default: `True`)

If `True`, use Jaccard metric between k-neighborhoods to build graph. If `False`, use a Gaussian kernel.

primary_metric `Literal`[`'euclidean'`, `'manhattan'`, `'correlation'`, `'cosine'`] (default: `'euclidean'`)

Distance metric to define nearest neighbors. Note that performance will be slower for correlation and cosine.

n_jobs `int` (default: `-1`)

Nearest Neighbors and Jaccard coefficients will be computed in parallel using n_jobs. If 1 is given, no parallelism is used. If set to -1, all CPUs are used. For n_jobs below -1, `n_cpus + 1 + n_jobs` are used.

q_tol `float` (default: `0.001`)

Tolerance, i.e. precision, for monitoring modularity optimization.

louvain_time_limit `int` (default: `2000`)

Maximum number of seconds to run modularity optimization. If exceeded the best result so far is returned.

nn_method `Literal`[`'kdtree'`, `'brute'`] (default: `'kdtree'`)

Whether to use brute force or kdtree for nearest neighbor search. For very large high-dimensional data sets, brute force, with parallel computation, performs faster than kdtree.

partition_type (default: `None`)

Defaults to `RBConfigurationVertexPartition`. For the available options, consult the documentation for `find_partition()`.

resolution_parameter `float` (default: `1`)

A parameter value controlling the coarseness of the clustering in Leiden. Higher values lead to more clusters. Set to `None` if overriding `partition_type` to one that does not accept a `resolution_parameter`.

n_iterations `int` (default: `-1`)

Number of iterations to run the Leiden algorithm. If the number of iterations is negative, the Leiden algorithm is run until an iteration in which there was no improvement.

use_weights `bool` (default: `True`)

Use vertices in the Leiden computation.

seed (default: `None`)

Leiden initialization of the optimization.

copy `bool` (default: `False`)

Return a copy or write to `adata`.

kargs `Any`

Additional arguments passed to `find_partition()` and the constructor of the `partition_type`.

Return type:

Returns:

Depending on `copy`, returns or updates `adata` with the following fields:

communities - `ndarray` (`obs`, dtype `int`)

integer array of community assignments for each row in data.

graph - `spmatrix` (`obsp`, dtype `float`)

the graph that was used for clustering.

Q - `float` (`uns`, dtype `float`)

the modularity score for communities on graph.

Example

```>>> from anndata import AnnData
>>> import scanpy as sc
>>> import scanpy.external as sce
>>> import numpy as np
>>> import pandas as pd
```

With annotated data as input:

```>>> adata = sc.datasets.pbmc3k()
```

Then do PCA:

```>>> sc.pp.pca(adata, n_comps=100)
```

Compute phenograph clusters:

Louvain community detection

```>>> sce.tl.phenograph(adata, clustering_algo="louvain", k=30)
```

Leiden community detection

```>>> sce.tl.phenograph(adata, clustering_algo="leiden", k=30)
```

Return only `Graph` object

```>>> sce.tl.phenograph(adata, clustering_algo=None, k=30)
```

Now to show phenograph on tSNE (for example):

Compute tSNE:

```>>> sc.tl.tsne(adata, random_state=7)
```

Plot phenograph clusters on tSNE:

```>>> sc.pl.tsne(
...     adata, color = ["pheno_louvain", "pheno_leiden"], s = 100,
...     palette = sc.pl.palettes.vega_20_scanpy, legend_fontsize = 10
... )
```

Cluster and cluster centroids for input Numpy ndarray

```>>> df = np.random.rand(1000, 40)
>>> dframe = pd.DataFrame(df)
>>> dframe.index, dframe.columns = (map(str, dframe.index), map(str, dframe.columns))