scanpy.pp.pca#
- scanpy.pp.pca(data, n_comps=None, *, layer=None, obsm=None, zero_center=True, svd_solver=None, chunked=False, chunk_size=None, rng=None, return_info=False, mask_var='highly_variable', use_highly_variable=None, dtype='float32', key_added=None (sc.settings.preset=<Preset.ScanpyV1: 'scanpy-v1'> – changes in 2.0), copy=False)[source]#
Principal component analysis [Pedregosa et al., 2011].
Computes PCA coordinates, loadings and variance decomposition. Uses the following implementations (and defaults for
svd_solver):chunked=False,zero_center=Truesklearn
PCA('arpack')chunked=False,zero_center=Falsesklearn
TruncatedSVD('randomized')dask-ml
TruncatedSVD[2] ('tsqr')chunked=True(zero_centerignored)sklearn
IncrementalPCA('auto')dask-ml
IncrementalPCA[3] ('auto')Array type support# Array type
supported
… experimentally in dask
Array✅
✅
✅
✅
✅
❌
- Parameters:
- data
AnnData|ndarray|csr_array|csc_array|csr_matrix|csc_matrix The (annotated) data matrix of shape
n_obs×n_vars. Rows correspond to cells and columns to genes.- n_comps
int|None(default:None) Number of principal components to compute. Defaults to 50, or 1 - minimum dimension size of selected representation.
- layer
str|None(default:None) If provided, which element of
layersto use for PCA instead ofX.- obsm
str|None(default:None) If provided, which element of
obsmto use for PCA instead ofX.- zero_center
bool(default:True) If
True, compute (or approximate) PCA from covariance matrix. IfFalse, performa a truncated SVD instead of PCA.Our default PCA algorithms (see
svd_solver) support implicit zero-centering, and therefore efficiently operating on sparse data.- svd_solver
Literal['auto','full','tsqr','randomized'] |Literal['tsqr','randomized'] |Literal['auto','full','randomized'] |Literal['arpack','covariance_eigh'] |Literal['arpack','covariance_eigh'] |Literal['arpack','randomized'] |Literal['covariance_eigh'] |None(default:None) SVD solver to use. See table above to see which solver class is used based on
chunkedandzero_center, as well as the default solver for each class whensvd_solver=None.Efficient computation of the principal components of a sparse matrix currently only works with the
'arpack’ or'covariance_eigh’ solver.NoneChoose automatically based on solver class (see table above).
'arpack'ARPACK wrapper in SciPy (
svds()). Not available for dask arrays.'covariance_eigh'Classic eigendecomposition of the covariance matrix, suited for tall-and-skinny matrices. With dask, array must be CSR or dense and chunked as
(N, adata.shape[1]).'randomized'Randomized algorithm from Halko et al. [2009]. For dask arrays, this will use
svd_compressed().'auto'Choose automatically depending on the size of the problem: Will use
'full'for small shapes and'randomized'for large shapes.'tsqr'“tall-and-skinny QR” algorithm from Benson et al. [2013]. Only available for dense dask arrays.
Changed in version 1.9.3: Default value changed from
'arpack'to None.Changed in version 1.4.5: Default value changed from
'auto'to'arpack'.- chunked
bool(default:False) If
True, perform an incremental PCA on segments ofchunk_size. Automatically zero centers and ignores settings ofzero_center,random_seedandsvd_solver. IfFalse, perform a full PCA/truncated SVD (seesvd_solverandzero_center). See table above for which solver class is used.- chunk_size
int|None(default:None) Number of observations to include in each chunk. Required if
chunked=Truewas passed.- rng
int|integer|Sequence[int] |SeedSequence|Generator|BitGenerator|None(default:None) Random number generation to control stochasticity.
If a type:
SeedLikevalue, it’s used to seed a new random number generator; If anumpy.random.Generator,rng’s state will be directly advanced; IfNone, a non-reproducible random number generator is used. Seenumpy.random.default_rng()for more details.The default value matches legacy scanpy behavior and will change to
Nonein scanpy 2.0.- return_info
bool(default:False) Only relevant when not passing an
AnnData: see “Returns”.- mask_var
ndarray[tuple[Any,...],dtype[bool]] |str|None|Default(default:'highly_variable') To run only on a certain set of genes given by a boolean array or a string referring to an array in
var. By default, uses.var['highly_variable']if available, else everything.- use_highly_variable
bool|None(default:None) Whether to use highly variable genes only, stored in
.var['highly_variable']. By default uses them if they have been determined beforehand.Deprecated since version 1.10.0: Use
mask_varinstead- layer
Layer of
adatato use as expression values.- dtype
numpy.typing.DTypeLike(default:'float32') Numpy data type string to which to convert the result.
- key_added
str|None|Default(default:None (sc.settings.preset=<Preset.ScanpyV1: 'scanpy-v1'> – changes in 2.0)) If not specified, the embedding is stored as
obsm['X_pca'], the loadings asvarm['PCs'], and the the parameters inuns['pca']. If specified, the embedding is stored asobsm[key_added], the loadings asvarm[key_added], and the the parameters inuns[key_added].- copy
bool(default:False) If an
AnnDatais passed, determines whether a copy is returned. Is ignored otherwise.
- data
- Return type:
AnnData|ndarray|csr_array|csc_array|csr_matrix|csc_matrix|None- Returns:
If
datais array-like andreturn_info=Falsewas passed, this function returns the PCA representation ofdataas an array of the same type as the input array.Otherwise, it returns
Noneifcopy=False, else an updatedAnnDataobject. Sets the following fields:.obsm['X_pca' | key_added]csr_matrix|csc_matrix|ndarray(shape(adata.n_obs, n_comps))PCA representation of data.
.varm['PCs' | key_added]ndarray(shape(adata.n_vars, n_comps))The principal components containing the loadings when `obsm=None`.
.uns['pca' | key_added]['components']ndarray(shape(adata.obsm[obsm].shape[1], n_comps))The principal components containing the loadings when `obsm=”…”`.
.uns['pca' | key_added]['variance_ratio']ndarray(shape(n_comps,))Ratio of explained variance.
.uns['pca' | key_added]['variance']ndarray(shape(n_comps,))Explained variance, equivalent to the eigenvalues of the covariance matrix.