scanpy.pp.pca#
- scanpy.pp.pca(data, n_comps=None, *, layer=None, zero_center=True, svd_solver=None, random_state=0, return_info=False, mask_var=_empty, use_highly_variable=None, dtype='float32', chunked=False, chunk_size=None, copy=False)[source]#
Principal component analysis [Pedregosa et al., 2011].
Computes PCA coordinates, loadings and variance decomposition. Uses the implementation of scikit-learn [Pedregosa et al., 2011].
Changed in version 1.5.0: In previous versions, computing a PCA on a sparse matrix would make a dense copy of the array for mean centering. As of scanpy 1.5.0, mean centering is implicit. While results are extremely similar, they are not exactly the same. If you would like to reproduce the old results, pass a dense array.
- Parameters:
- data
AnnData|ndarray|spmatrix The (annotated) data matrix of shape
n_obs×n_vars. Rows correspond to cells and columns to genes.- n_comps
int|None(default:None) Number of principal components to compute. Defaults to 50, or 1 - minimum dimension size of selected representation.
- layer
str|None(default:None) If provided, which element of layers to use for PCA.
- zero_center
bool|None(default:True) If
True, compute standard PCA from covariance matrix. IfFalse, omit zero-centering variables (uses scikit-learnTruncatedSVDor dask-mlTruncatedSVD), which allows to handle sparse input efficiently. PassingNonedecides automatically based on sparseness of the data.- svd_solver
str|None(default:None) SVD solver to use:
NoneSee
chunkedandzero_centerdescriptions to determine which class will be used. Depending on the class and the type of X different values for default will be set. If scikit-learnPCAis used, will give'arpack', if scikit-learnTruncatedSVDis used, will give'randomized', if dask-mlPCAorIncrementalPCAis used, will give'auto', if dask-mlTruncatedSVDis used, will give'tsqr''arpack'for the ARPACK wrapper in SciPy (
svds()) Not available with dask arrays.'randomized'for the randomized algorithm due to Halko (2009). For dask arrays, this will use
svd_compressed().'auto'chooses automatically depending on the size of the problem.
'lobpcg'An alternative SciPy solver. Not available with dask arrays.
'tsqr'Only available with dask arrays. “tsqr” algorithm from Benson et. al. (2013).
Changed in version 1.9.3: Default value changed from
'arpack'to None.Changed in version 1.4.5: Default value changed from
'auto'to'arpack'.Efficient computation of the principal components of a sparse matrix currently only works with the
'arpack’ or'lobpcg'solvers.If X is a dask array, dask-ml classes
PCA,IncrementalPCA, orTruncatedSVDwill be used. Otherwise their scikit-learn counterpartsPCA,IncrementalPCA, orTruncatedSVDwill be used.- random_state
int|RandomState|None(default:0) Change to use different initial states for the optimization.
- return_info
bool(default:False) Only relevant when not passing an
AnnData: see “Returns”.- mask_var
ndarray[Any,dtype[bool]] |str|None|Empty(default:_empty) To run only on a certain set of genes given by a boolean array or a string referring to an array in
var. By default, uses.var['highly_variable']if available, else everything.- use_highly_variable
bool|None(default:None) Whether to use highly variable genes only, stored in
.var['highly_variable']. By default uses them if they have been determined beforehand.Deprecated since version 1.10.0: Use
mask_varinstead- layer
Layer of
adatato use as expression values.- dtype
Union[dtype[Any],None,type[Any],_SupportsDType[dtype[Any]],str,tuple[Any,int],tuple[Any,Union[SupportsIndex,Sequence[SupportsIndex]]],list[Any],_DTypeDict,tuple[Any,Any]] (default:'float32') Numpy data type string to which to convert the result.
- chunked
bool(default:False) If
True, perform an incremental PCA on segments ofchunk_size. The incremental PCA automatically zero centers and ignores settings ofrandom_seedandsvd_solver. Uses sklearnIncrementalPCAor dask-mlIncrementalPCA. IfFalse, perform a full PCA and use sklearnPCAor dask-mlPCA- chunk_size
int|None(default:None) Number of observations to include in each chunk. Required if
chunked=Truewas passed.- copy
bool(default:False) If an
AnnDatais passed, determines whether a copy is returned. Is ignored otherwise.
- data
- Return type:
- Returns:
If
datais array-like andreturn_info=Falsewas passed, this function returns the PCA representation ofdataas an array of the same type as the input array.Otherwise, it returns
Noneifcopy=False, else an updatedAnnDataobject. Sets the following fields:.obsm['X_pca']spmatrix|ndarray(shape(adata.n_obs, n_comps))PCA representation of data.
.varm['PCs']ndarray(shape(adata.n_vars, n_comps))The principal components containing the loadings.
.uns['pca']['variance_ratio']ndarray(shape(n_comps,))Ratio of explained variance.
.uns['pca']['variance']ndarray(shape(n_comps,))Explained variance, equivalent to the eigenvalues of the covariance matrix.