scanpy.experimental.pp.normalize_pearson_residuals_pca#
- scanpy.experimental.pp.normalize_pearson_residuals_pca(adata, *, theta=100, clip=None, n_comps=50, random_state=0, kwargs_pca=mappingproxy({}), mask_var=_empty, use_highly_variable=None, check_values=True, inplace=True)[source]#
Applies analytic Pearson residual normalization and PCA, based on Lause et al. [2021].
The residuals are based on a negative binomial offset model with overdispersion
theta
shared across genes. By default, residuals are clipped tosqrt(n_obs)
, overdispersiontheta=100
is used, and PCA is run with 50 components.Operates on the subset of highly variable genes in
adata.var['highly_variable']
by default. Expects raw count input.- Parameters:
- adata
AnnData
The annotated data matrix of shape
n_obs
×n_vars
. Rows correspond to cells and columns to genes.- theta
float
(default:100
) The negative binomial overdispersion parameter
theta
for Pearson residuals. Higher values correspond to less overdispersion (var = mean + mean^2/theta
), andtheta=np.inf
corresponds to a Poisson model.- clip
float
|None
(default:None
) Determines if and how residuals are clipped:
If
None
, residuals are clipped to the interval[-sqrt(n_obs), sqrt(n_obs)]
, wheren_obs
is the number of cells in the dataset (default behavior).If any scalar
c
, residuals are clipped to the interval[-c, c]
. Setclip=np.inf
for no clipping.
- n_comps
int
|None
(default:50
) Number of principal components to compute in the PCA step.
- random_state
float
(default:0
) Random seed for setting the initial states for the optimization in the PCA step.
- kwargs_pca
Mapping
[str
,Any
] (default:mappingproxy({})
) Dictionary of further keyword arguments passed on to
scanpy.pp.pca()
.- mask_var
ndarray
|str
|None
|Empty
(default:_empty
) To run only on a certain set of genes given by a boolean array or a string referring to an array in
var
. By default, uses.var['highly_variable']
if available, else everything.- use_highly_variable
bool
|None
(default:None
) Whether to use highly variable genes only, stored in
.var['highly_variable']
. By default uses them if they have been determined beforehand.Deprecated since version 1.10.0: Use
mask_var
instead- check_values
bool
(default:True
) If
True
, checks if counts in selected layer are integers as expected by this function, and return a warning if non-integers are found. Otherwise, proceed without checking. Setting this toFalse
can speed up code for large datasets.- inplace
bool
(default:True
) If
True
, updateadata
with results. Otherwise, return results. See below for details of what is returned.
- adata
- Return type:
- Returns:
If
inplace=False
, returns the Pearson residual-based PCA results (asAnnData
object). Ifinplace=True
, updatesadata
with the following fields:.uns['pearson_residuals_normalization']['pearson_residuals_df']
The subset of highly variable genes, normalized by Pearson residuals.
.uns['pearson_residuals_normalization']['theta']
The used value of the overdisperion parameter theta.
.uns['pearson_residuals_normalization']['clip']
The used value of the clipping parameter.
.obsm['X_pca']
PCA representation of data after gene selection (if applicable) and Pearson residual normalization.
.varm['PCs']
The principal components containing the loadings. When
inplace=True
anduse_highly_variable=True
, this will contain empty rows for the genes not selected..uns['pca']['variance_ratio']
Ratio of explained variance.
.uns['pca']['variance']
Explained variance, equivalent to the eigenvalues of the covariance matrix.