scanpy.experimental.pp.recipe_pearson_residuals(adata, *, theta=100, clip=None, n_top_genes=1000, batch_key=None, chunksize=1000, n_comps=50, random_state=0, kwargs_pca={}, check_values=True, inplace=True)[source]#

Full pipeline for HVG selection and normalization by analytic Pearson residuals [Lause et al., 2021].

Applies gene selection based on Pearson residuals. On the resulting subset, Pearson residual normalization and PCA are performed.

Expects raw count input.

adata AnnData

The annotated data matrix of shape n_obs × n_vars. Rows correspond to cells and columns to genes.

theta float (default: 100)

The negative binomial overdispersion parameter theta for Pearson residuals. Higher values correspond to less overdispersion (var = mean + mean^2/theta), and theta=np.inf corresponds to a Poisson model.

clip float | None (default: None)

Determines if and how residuals are clipped:

  • If None, residuals are clipped to the interval [-sqrt(n_obs), sqrt(n_obs)], where n_obs is the number of cells in the dataset (default behavior).

  • If any scalar c, residuals are clipped to the interval [-c, c]. Set clip=np.inf for no clipping.

n_top_genes int (default: 1000)

Number of highly-variable genes to keep. Mandatory if flavor='seurat_v3' or flavor='pearson_residuals'.

batch_key str | None (default: None)

If specified, highly-variable genes are selected within each batch separately and merged. This simple process avoids the selection of batch-specific genes and acts as a lightweight batch correction method. Genes are first sorted by how many batches they are a HVG. If flavor='pearson_residuals', ties are broken by the median rank (across batches) based on within-batch residual variance.

chunksize int (default: 1000)

If flavor='pearson_residuals', this dertermines how many genes are processed at once while computing the residual variance. Choosing a smaller value will reduce the required memory.

n_comps int | None (default: 50)

Number of principal components to compute in the PCA step.

random_state float | None (default: 0)

Random seed for setting the initial states for the optimization in the PCA step.

kwargs_pca dict (default: {})

Dictionary of further keyword arguments passed on to scanpy.pp.pca().

check_values bool (default: True)

If True, checks if counts in selected layer are integers as expected by this function, and return a warning if non-integers are found. Otherwise, proceed without checking. Setting this to False can speed up code for large datasets.

inplace bool (default: True)

If True, update adata with results. Otherwise, return results. See below for details of what is returned.

Return type:

tuple[AnnData, DataFrame] | None


If inplace=False, separately returns the gene selection results (as DataFrame) and Pearson residual-based PCA results (as AnnData). If inplace=True, updates adata with the following fields for gene selection results:


boolean indicator of highly-variable genes.


means per gene.


variances per gene.


Pearson residual variance per gene. Averaged in the case of multiple batches.


Rank of the gene according to residual variance, median rank in the case of multiple batches.


If batch_key is given, this denotes in how many batches genes are detected as HVG.


If batch_key is given, this denotes the genes that are highly variable in all batches.

The following fields contain Pearson residual-based PCA results and normalization settings:


The subset of highly variable genes, normalized by Pearson residuals.


The used value of the overdisperion parameter theta.


The used value of the clipping parameter.


PCA representation of data after gene selection and Pearson residual normalization.


The principal components containing the loadings. When inplace=True this will contain empty rows for the genes not selected during HVG selection.


Ratio of explained variance.


Explained variance, equivalent to the eigenvalues of the covariance matrix.