scanpy.experimental.pp.normalize_pearson_residuals_pca

scanpy.experimental.pp.normalize_pearson_residuals_pca#

scanpy.experimental.pp.normalize_pearson_residuals_pca(adata, *, theta=100, clip=None, n_comps=50, random_state=0, kwargs_pca=mappingproxy({}), mask_var=_empty, use_highly_variable=None, check_values=True, inplace=True)[source]#

Applies analytic Pearson residual normalization and PCA, based on Lause et al. [2021].

The residuals are based on a negative binomial offset model with overdispersion theta shared across genes. By default, residuals are clipped to sqrt(n_obs), overdispersion theta=100 is used, and PCA is run with 50 components.

Operates on the subset of highly variable genes in adata.var['highly_variable'] by default. Expects raw count input.

Parameters:
adata AnnData

The annotated data matrix of shape n_obs × n_vars. Rows correspond to cells and columns to genes.

theta float (default: 100)

The negative binomial overdispersion parameter theta for Pearson residuals. Higher values correspond to less overdispersion (var = mean + mean^2/theta), and theta=np.inf corresponds to a Poisson model.

clip float | None (default: None)

Determines if and how residuals are clipped:

  • If None, residuals are clipped to the interval [-sqrt(n_obs), sqrt(n_obs)], where n_obs is the number of cells in the dataset (default behavior).

  • If any scalar c, residuals are clipped to the interval [-c, c]. Set clip=np.inf for no clipping.

n_comps int | None (default: 50)

Number of principal components to compute in the PCA step.

random_state float (default: 0)

Random seed for setting the initial states for the optimization in the PCA step.

kwargs_pca Mapping[str, Any] (default: mappingproxy({}))

Dictionary of further keyword arguments passed on to scanpy.pp.pca().

mask_var ndarray | str | None | Empty (default: _empty)

To run only on a certain set of genes given by a boolean array or a string referring to an array in var. By default, uses .var['highly_variable'] if available, else everything.

use_highly_variable bool | None (default: None)

Whether to use highly variable genes only, stored in .var['highly_variable']. By default uses them if they have been determined beforehand.

Deprecated since version 1.10.0: Use mask_var instead

check_values bool (default: True)

If True, checks if counts in selected layer are integers as expected by this function, and return a warning if non-integers are found. Otherwise, proceed without checking. Setting this to False can speed up code for large datasets.

inplace bool (default: True)

If True, update adata with results. Otherwise, return results. See below for details of what is returned.

Return type:

AnnData | None

Returns:

If inplace=False, returns the Pearson residual-based PCA results (as AnnData object). If inplace=True, updates adata with the following fields:

.uns['pearson_residuals_normalization']['pearson_residuals_df']

The subset of highly variable genes, normalized by Pearson residuals.

.uns['pearson_residuals_normalization']['theta']

The used value of the overdisperion parameter theta.

.uns['pearson_residuals_normalization']['clip']

The used value of the clipping parameter.

.obsm['X_pca']

PCA representation of data after gene selection (if applicable) and Pearson residual normalization.

.varm['PCs']

The principal components containing the loadings. When inplace=True and use_highly_variable=True, this will contain empty rows for the genes not selected.

.uns['pca']['variance_ratio']

Ratio of explained variance.

.uns['pca']['variance']

Explained variance, equivalent to the eigenvalues of the covariance matrix.