scanpy.tl.rank_genes_groups

scanpy.tl.rank_genes_groups#

scanpy.tl.rank_genes_groups(adata, groupby, *, mask_var=None, use_raw=None, groups='all', reference='rest', n_genes=None, rankby_abs=False, pts=False, key_added=None, copy=False, method=None, corr_method='benjamini-hochberg', tie_correct=False, layer=None, **kwds)[source]#

Rank genes for characterizing groups.

Expects logarithmized data.

Array type support#
Array type	supported	… experimentally in dask `Array`
`numpy.ndarray`	✅	❌
`scipy.sparse.{csr,csc}_{array,matrix}`	✅	❌

Warning

Comparing between cells leads to highly inflated p-values, since cells are not independent observations [Squair et al., 2021]. Especially in single-cell data, consider instead to use more appropriate methods such as combining pseudobulking with PyDESeq2 documentation.

decoupler.pp.pseudobulk() or scanpy.get.aggregate() can be used to aggregate samples for pseudobulking. Ours is a bit more verbose, but supports Dask arrays for improved performance.

Parameters:

adata AnnData: Annotated data matrix.
groupby str: The key of the observations grouping to consider.
mask_var ndarray[tuple[Any, ...], dtype[bool]] | str | None (default: None): Select subset of genes to use in statistical tests.
use_raw bool | None (default: None): Use raw attribute of adata if present. The default behavior is to use raw if present.
layer str | None (default: None): Key from adata.layers whose value will be used to perform tests on.
groups Literal['all'] | Iterable[str] (default: 'all'): Subset of groups, e.g. ['g1', 'g2', 'g3'], to which comparison shall be restricted, or 'all' (default), for all groups. Note that if reference='rest' all groups will still be used as the reference, not just those specified in groups.
reference str (default: 'rest'): If 'rest', compare each group to the union of the rest of the group. If a group identifier, compare with respect to this group.
n_genes int | None (default: None): The number of genes that appear in the returned tables. Defaults to all genes.
method Literal['logreg', 't-test', 'wilcoxon', 't-test_overestim_var'] | None (default: None): The default method is 't-test', 't-test_overestim_var' overestimates variance of each group, 'wilcoxon' uses Wilcoxon rank-sum, 'logreg' uses logistic regression. See Ntranos et al. [2019], here and here, for why this is meaningful.
corr_method Literal['benjamini-hochberg', 'bonferroni'] (default: 'benjamini-hochberg'): p-value correction method. Used only for 't-test', 't-test_overestim_var', and 'wilcoxon'.
tie_correct bool (default: False): Use tie correction for 'wilcoxon' scores. Used only for 'wilcoxon'.
rankby_abs bool (default: False): Rank genes by the absolute value of the score, not by the score. The returned scores are never the absolute values.
pts bool (default: False): Compute the fraction of cells expressing the genes.
key_added str | None (default: None): The key in adata.uns information is saved to.
copy bool (default: False): Whether to copy adata or modify it inplace.
kwds: Are passed to test methods. Currently this affects only parameters that are passed to sklearn.linear_model.LogisticRegression. For instance, you can pass penalty='l1' to try to come up with a minimal set of genes that are good predictors (sparse solution meaning few non-zero fitted coefficients).

Return type:

AnnData | None

Returns:

Returns None if copy=False, else returns an AnnData object. Sets the following fields:

adata.uns['rank_genes_groups' | key_added]['names']structured numpy.ndarray (dtype object): Structured array to be indexed by group id storing the gene names. Ordered according to scores.
adata.uns['rank_genes_groups' | key_added]['scores']structured numpy.ndarray (dtype object): Structured array to be indexed by group id storing the z-score underlying the computation of a p-value for each gene for each group. Ordered according to scores.
adata.uns['rank_genes_groups' | key_added]['logfoldchanges']structured numpy.ndarray (dtype object): Structured array to be indexed by group id storing the log2 fold change for each gene for each group. Ordered according to scores. Only provided if method is ‘t-test’ like. Note: this is an approximation calculated from mean-log values.
adata.uns['rank_genes_groups' | key_added]['pvals']structured numpy.ndarray (dtype float): p-values.
adata.uns['rank_genes_groups' | key_added]['pvals_adj']structured numpy.ndarray (dtype float): Corrected p-values.
adata.uns['rank_genes_groups' | key_added]['pts']pandas.DataFrame (dtype float): Fraction of cells expressing the genes for each group.
adata.uns['rank_genes_groups' | key_added]['pts_rest']pandas.DataFrame (dtype float): Only if reference is set to 'rest'. Fraction of cells from the union of the rest of each group expressing the genes.

Notes

There are slight inconsistencies depending on whether sparse or dense data are passed. See here.

Examples

>>> import scanpy as sc
>>> adata = sc.datasets.pbmc68k_reduced()
>>> sc.tl.rank_genes_groups(adata, "bulk_labels", method="wilcoxon")
>>> # to visualize the results
>>> sc.pl.rank_genes_groups(adata)

scanpy.tl.rank_genes_groups

Contents

scanpy.tl.rank_genes_groups#