scanpy.experimental.pp.normalize_pearson_residuals_pca(adata, *, theta=100, clip=None, n_comps=50, random_state=0, kwargs_pca=mappingproxy({}), mask_var=_empty, use_highly_variable=None, check_values=True, inplace=True)[source]#

Applies analytic Pearson residual normalization and PCA, based on Lause et al. [2021].

The residuals are based on a negative binomial offset model with overdispersion theta shared across genes. By default, residuals are clipped to sqrt(n_obs), overdispersion theta=100 is used, and PCA is run with 50 components.

Operates on the subset of highly variable genes in adata.var['highly_variable'] by default. Expects raw count input.

adata AnnData

The annotated data matrix of shape n_obs × n_vars. Rows correspond to cells and columns to genes.

theta float (default: 100)

The negative binomial overdispersion parameter theta for Pearson residuals. Higher values correspond to less overdispersion (var = mean + mean^2/theta), and theta=np.inf corresponds to a Poisson model.

clip float | None (default: None)

Determines if and how residuals are clipped:

  • If None, residuals are clipped to the interval [-sqrt(n_obs), sqrt(n_obs)], where n_obs is the number of cells in the dataset (default behavior).

  • If any scalar c, residuals are clipped to the interval [-c, c]. Set clip=np.inf for no clipping.

n_comps int | None (default: 50)

Number of principal components to compute in the PCA step.

random_state float (default: 0)

Random seed for setting the initial states for the optimization in the PCA step.

kwargs_pca Mapping[str, Any] (default: mappingproxy({}))

Dictionary of further keyword arguments passed on to scanpy.pp.pca().

mask_var ndarray | str | None | Empty (default: _empty)

To run only on a certain set of genes given by a boolean array or a string referring to an array in var. By default, uses .var['highly_variable'] if available, else everything.

use_highly_variable bool | None (default: None)

Whether to use highly variable genes only, stored in .var['highly_variable']. By default uses them if they have been determined beforehand.

Deprecated since version 1.10.0: Use mask_var instead

check_values bool (default: True)

If True, checks if counts in selected layer are integers as expected by this function, and return a warning if non-integers are found. Otherwise, proceed without checking. Setting this to False can speed up code for large datasets.

inplace bool (default: True)

If True, update adata with results. Otherwise, return results. See below for details of what is returned.

Return type:

AnnData | None


If inplace=False, returns the Pearson residual-based PCA results (as AnnData object). If inplace=True, updates adata with the following fields:


The subset of highly variable genes, normalized by Pearson residuals.


The used value of the overdisperion parameter theta.


The used value of the clipping parameter.


PCA representation of data after gene selection (if applicable) and Pearson residual normalization.


The principal components containing the loadings. When inplace=True and use_highly_variable=True, this will contain empty rows for the genes not selected.


Ratio of explained variance.


Explained variance, equivalent to the eigenvalues of the covariance matrix.