scanpy.experimental.pp.highly_variable_genes

scanpy.experimental.pp.highly_variable_genes#

scanpy.experimental.pp.highly_variable_genes(adata, *, theta=100, clip=None, n_top_genes=None, batch_key=None, chunksize=1000, flavor='pearson_residuals', check_values=True, layer=None, subset=False, inplace=True)[source]#

Select highly variable genes using analytic Pearson residuals [Lause et al., 2021].

In Lause et al. [2021], Pearson residuals of a negative binomial offset model are computed (with overdispersion theta shared across genes). By default, overdispersion theta=100 is used and residuals are clipped to sqrt(n_obs). Finally, genes are ranked by residual variance.

Expects raw count input.

Parameters:
adata AnnData

The annotated data matrix of shape n_obs × n_vars. Rows correspond to cells and columns to genes.

theta float (default: 100)

The negative binomial overdispersion parameter theta for Pearson residuals. Higher values correspond to less overdispersion (var = mean + mean^2/theta), and theta=np.inf corresponds to a Poisson model.

clip float | None (default: None)

Determines if and how residuals are clipped:

  • If None, residuals are clipped to the interval [-sqrt(n_obs), sqrt(n_obs)], where n_obs is the number of cells in the dataset (default behavior).

  • If any scalar c, residuals are clipped to the interval [-c, c]. Set clip=np.inf for no clipping.

n_top_genes int | None (default: None)

Number of highly-variable genes to keep. Mandatory if flavor='seurat_v3' or flavor='pearson_residuals'.

batch_key str | None (default: None)

If specified, highly-variable genes are selected within each batch separately and merged. This simple process avoids the selection of batch-specific genes and acts as a lightweight batch correction method. Genes are first sorted by how many batches they are a HVG. If flavor='pearson_residuals', ties are broken by the median rank (across batches) based on within-batch residual variance.

chunksize int (default: 1000)

If flavor='pearson_residuals', this dertermines how many genes are processed at once while computing the residual variance. Choosing a smaller value will reduce the required memory.

flavor Literal['pearson_residuals'] (default: 'pearson_residuals')

Choose the flavor for identifying highly variable genes. In this experimental version, only ‘pearson_residuals’ is functional.

check_values bool (default: True)

If True, checks if counts in selected layer are integers as expected by this function, and return a warning if non-integers are found. Otherwise, proceed without checking. Setting this to False can speed up code for large datasets.

layer str | None (default: None)

Layer to use as input instead of X. If None, X is used.

subset bool (default: False)

If True, subset the data to highly-variable genes after finding them. Otherwise merely indicate highly variable genes in adata.var (see below).

inplace bool (default: True)

If True, update adata with results. Otherwise, return results. See below for details of what is returned.

Return type:

DataFrame | None

Returns:

If inplace=True, adata.var is updated with the following fields. Otherwise, returns the same fields as DataFrame.

highly_variablebool

boolean indicator of highly-variable genes.

meansfloat

means per gene.

variancesfloat

variance per gene.

residual_variancesfloat

For flavor='pearson_residuals', residual variance per gene. Averaged in the case of multiple batches.

highly_variable_rankfloat

For flavor='pearson_residuals', rank of the gene according to residual. variance, median rank in the case of multiple batches.

highly_variable_nbatchesint

If batch_key given, denotes in how many batches genes are detected as HVG.

highly_variable_intersectionbool

If batch_key given, denotes the genes that are highly variable in all batches.

Notes

Experimental version of sc.pp.highly_variable_genes()