scanpy.experimental.pp.recipe_pearson_residuals

scanpy.experimental.pp.recipe_pearson_residuals#

scanpy.experimental.pp.recipe_pearson_residuals(adata, *, theta=100, clip=None, n_top_genes=1000, batch_key=None, chunksize=1000, n_comps=50, random_state=0, kwargs_pca={}, check_values=True, inplace=True)[source]#

Full pipeline for HVG selection and normalization by analytic Pearson residuals ([Lause21]).

Applies gene selection based on Pearson residuals. On the resulting subset, Pearson residual normalization and PCA are performed.

Expects raw count input.

Parameters:
adata AnnData

The annotated data matrix of shape n_obs × n_vars. Rows correspond to cells and columns to genes.

theta float (default: 100)

The negative binomial overdispersion parameter theta for Pearson residuals. Higher values correspond to less overdispersion (var = mean + mean^2/theta), and theta=np.Inf corresponds to a Poisson model.

clip float | None (default: None)

Determines if and how residuals are clipped:

  • If None, residuals are clipped to the interval [-sqrt(n_obs), sqrt(n_obs)], where n_obs is the number of cells in the dataset (default behavior).

  • If any scalar c, residuals are clipped to the interval [-c, c]. Set clip=np.Inf for no clipping.

n_top_genes int (default: 1000)

Number of highly-variable genes to keep. Mandatory if flavor='seurat_v3' or flavor='pearson_residuals'.

batch_key str | None (default: None)

If specified, highly-variable genes are selected within each batch separately and merged. This simple process avoids the selection of batch-specific genes and acts as a lightweight batch correction method. Genes are first sorted by how many batches they are a HVG. If flavor='pearson_residuals', ties are broken by the median rank (across batches) based on within-batch residual variance.

chunksize int (default: 1000)

If flavor='pearson_residuals', this dertermines how many genes are processed at once while computing the residual variance. Choosing a smaller value will reduce the required memory.

n_comps int | None (default: 50)

Number of principal components to compute in the PCA step.

random_state float | None (default: 0)

Random seed for setting the initial states for the optimization in the PCA step.

kwargs_pca dict (default: {})

Dictionary of further keyword arguments passed on to scanpy.pp.pca().

check_values bool (default: True)

If True, checks if counts in selected layer are integers as expected by this function, and return a warning if non-integers are found. Otherwise, proceed without checking. Setting this to False can speed up code for large datasets.

inplace bool (default: True)

If True, update adata with results. Otherwise, return results. See below for details of what is returned.

Return type:

tuple[AnnData, DataFrame] | None

Returns:

If inplace=False, separately returns the gene selection results (as DataFrame) and Pearson residual-based PCA results (as AnnData). If inplace=True, updates adata with the following fields for gene selection results:

.var['highly_variable']bool

boolean indicator of highly-variable genes.

.var['means']float

means per gene.

.var['variances']float

variances per gene.

.var['residual_variances']float

Pearson residual variance per gene. Averaged in the case of multiple batches.

.var['highly_variable_rank']float

Rank of the gene according to residual variance, median rank in the case of multiple batches.

.var['highly_variable_nbatches']int

If batch_key is given, this denotes in how many batches genes are detected as HVG.

.var['highly_variable_intersection']bool

If batch_key is given, this denotes the genes that are highly variable in all batches.

The following fields contain Pearson residual-based PCA results and normalization settings:

.uns['pearson_residuals_normalization']['pearson_residuals_df']

The subset of highly variable genes, normalized by Pearson residuals.

.uns['pearson_residuals_normalization']['theta']

The used value of the overdisperion parameter theta.

.uns['pearson_residuals_normalization']['clip']

The used value of the clipping parameter.

.obsm['X_pca']

PCA representation of data after gene selection and Pearson residual normalization.

.varm['PCs']

The principal components containing the loadings. When inplace=True this will contain empty rows for the genes not selected during HVG selection.

.uns['pca']['variance_ratio']

Ratio of explained variance.

.uns['pca']['variance']

Explained variance, equivalent to the eigenvalues of the covariance matrix.