scanpy.experimental.pp.recipe_pearson_residuals
- scanpy.experimental.pp.recipe_pearson_residuals(adata, *, theta=100, clip=None, n_top_genes=1000, batch_key=None, chunksize=1000, n_comps=50, random_state=0, kwargs_pca={}, check_values=True, inplace=True)
Full pipeline for HVG selection and normalization by analytic Pearson residuals ([Lause21]).
Applies gene selection based on Pearson residuals. On the resulting subset, Pearson residual normalization and PCA are performed.
Expects raw count input.
- Parameters:
- adata :
AnnData
The annotated data matrix of shape
n_obs
×n_vars
. Rows correspond to cells and columns to genes.- theta :
float
(default:100
) The negative binomial overdispersion parameter
theta
for Pearson residuals. Higher values correspond to less overdispersion (var = mean + mean^2/theta
), andtheta=np.Inf
corresponds to a Poisson model.- clip :
Optional
[float
] (default:None
) Determines if and how residuals are clipped:
If
None
, residuals are clipped to the interval[-sqrt(n_obs), sqrt(n_obs)]
, wheren_obs
is the number of cells in the dataset (default behavior).If any scalar
c
, residuals are clipped to the interval[-c, c]
. Setclip=np.Inf
for no clipping.
- n_top_genes :
int
(default:1000
) Number of highly-variable genes to keep. Mandatory if
flavor='seurat_v3'
orflavor='pearson_residuals'
.- batch_key :
Optional
[str
] (default:None
) If specified, highly-variable genes are selected within each batch separately and merged. This simple process avoids the selection of batch-specific genes and acts as a lightweight batch correction method. Genes are first sorted by how many batches they are a HVG. If
flavor='pearson_residuals'
, ties are broken by the median rank (across batches) based on within-batch residual variance.- chunksize :
int
(default:1000
) If
flavor='pearson_residuals'
, this dertermines how many genes are processed at once while computing the residual variance. Choosing a smaller value will reduce the required memory.- n_comps :
Optional
[int
] (default:50
) Number of principal components to compute in the PCA step.
- random_state :
Optional
[float
] (default:0
) Random seed for setting the initial states for the optimization in the PCA step.
- kwargs_pca :
dict
(default:{}
) Dictionary of further keyword arguments passed on to
scanpy.pp.pca()
.- check_values :
bool
(default:True
) If
True
, checks if counts in selected layer are integers as expected by this function, and return a warning if non-integers are found. Otherwise, proceed without checking. Setting this toFalse
can speed up code for large datasets.- inplace :
bool
(default:True
) If
True
, updateadata
with results. Otherwise, return results. See below for details of what is returned.
- adata :
- Return type:
- Returns:
: If
inplace=False
, separately returns the gene selection results (asDataFrame
) and Pearson residual-based PCA results (asAnnData
). Ifinplace=True
, updatesadata
with the following fields for gene selection results:.var['highly_variable']
boolboolean indicator of highly-variable genes.
.var['means']
floatmeans per gene.
.var['variances']
floatvariances per gene.
.var['residual_variances']
floatPearson residual variance per gene. Averaged in the case of multiple batches.
.var['highly_variable_rank']
floatRank of the gene according to residual variance, median rank in the case of multiple batches.
.var['highly_variable_nbatches']
intIf batch_key is given, this denotes in how many batches genes are detected as HVG.
.var['highly_variable_intersection']
boolIf batch_key is given, this denotes the genes that are highly variable in all batches.
The following fields contain Pearson residual-based PCA results and normalization settings:
.uns['pearson_residuals_normalization']['pearson_residuals_df']
The subset of highly variable genes, normalized by Pearson residuals.
.uns['pearson_residuals_normalization']['theta']
The used value of the overdisperion parameter theta.
.uns['pearson_residuals_normalization']['clip']
The used value of the clipping parameter.
.obsm['X_pca']
PCA representation of data after gene selection and Pearson residual normalization.
.varm['PCs']
The principal components containing the loadings. When
inplace=True
this will contain empty rows for the genes not selected during HVG selection..uns['pca']['variance_ratio']
Ratio of explained variance.
.uns['pca']['variance']
Explained variance, equivalent to the eigenvalues of the covariance matrix.