scanpy.pp.pca

scanpy.pp.pca#

scanpy.pp.pca(data, n_comps=None, *, layer=None, obsm=None, zero_center=True, svd_solver=None, chunked=False, chunk_size=None, random_state=0, return_info=False, mask_var=_empty, use_highly_variable=None, dtype='float32', key_added=None, copy=False)[source]#

Principal component analysis [Pedregosa et al., 2011].

Computes PCA coordinates, loadings and variance decomposition. Uses the following implementations (and defaults for svd_solver):

	`ndarray`, `spmatrix`, or `sparray`	`dask.array.Array`
`chunked=False`, `zero_center=True`	sklearn `PCA` (`'arpack'`)	dense: dask-ml `PCA`[1] (`'auto'`) sparse or `svd_solver='covariance_eigh'`: custom implementation (`'covariance_eigh'`)
`chunked=False`, `zero_center=False`	sklearn `TruncatedSVD` (`'randomized'`)	dask-ml `TruncatedSVD`[2] (`'tsqr'`)
`chunked=True` (`zero_center` ignored)	sklearn `IncrementalPCA` (`'auto'`)	dask-ml `IncrementalPCA`[3] (`'auto'`)

Array type support#
Array type	supported	… experimentally in dask `Array`
`numpy.ndarray`	✅	✅
`scipy.sparse.csr_array` / `csr_matrix`	✅	✅
`scipy.sparse.csc_array` / `csc_matrix`	✅	❌

Parameters:

The (annotated) data matrix of shape n_obs × n_vars. Rows correspond to cells and columns to genes.

n_comps int | None (default: None)

Number of principal components to compute. Defaults to 50, or 1 - minimum dimension size of selected representation.

layer str | None (default: None)

If provided, which element of layers to use for PCA instead of X.

obsm str | None (default: None)

If provided, which element of obsm to use for PCA instead of X.

zero_center bool (default: True)

If True, compute (or approximate) PCA from covariance matrix. If False, performa a truncated SVD instead of PCA.

Our default PCA algorithms (see svd_solver) support implicit zero-centering, and therefore efficiently operating on sparse data.

SVD solver to use. See table above to see which solver class is used based on chunked and zero_center, as well as the default solver for each class when svd_solver=None.

Efficient computation of the principal components of a sparse matrix currently only works with the 'arpack’ or 'covariance_eigh’ solver.

None: Choose automatically based on solver class (see table above).
'arpack': ARPACK wrapper in SciPy (svds()). Not available for dask arrays.
'covariance_eigh': Classic eigendecomposition of the covariance matrix, suited for tall-and-skinny matrices. With dask, array must be CSR or dense and chunked as (N, adata.shape[1]).
'randomized': Randomized algorithm from Halko et al. [2009]. For dask arrays, this will use svd_compressed().
'auto': Choose automatically depending on the size of the problem: Will use 'full' for small shapes and 'randomized' for large shapes.
'tsqr': “tall-and-skinny QR” algorithm from Benson et al. [2013]. Only available for dense dask arrays.

Changed in version 1.9.3: Default value changed from 'arpack' to None.

Changed in version 1.4.5: Default value changed from 'auto' to 'arpack'.

chunked bool (default: False)

If True, perform an incremental PCA on segments of chunk_size. Automatically zero centers and ignores settings of zero_center, random_seed and svd_solver. If False, perform a full PCA/truncated SVD (see svd_solver and zero_center). See table above for which solver class is used.

chunk_size int | None (default: None)

Number of observations to include in each chunk. Required if chunked=True was passed.

random_state int | RandomState | None (default: 0)

Change to use different initial states for the optimization.

return_info bool (default: False)

Only relevant when not passing an AnnData: see “Returns”.

mask_var ndarray[tuple[Any, ...], dtype[bool]] | str | None | Empty (default: _empty)

To run only on a certain set of genes given by a boolean array or a string referring to an array in var. By default, uses .var['highly_variable'] if available, else everything.

use_highly_variable bool | None (default: None)

Whether to use highly variable genes only, stored in .var['highly_variable']. By default uses them if they have been determined beforehand.

Deprecated since version 1.10.0: Use mask_var instead

layer

Layer of adata to use as expression values.

dtype numpy.typing.DTypeLike (default: 'float32')

Numpy data type string to which to convert the result.

key_added str | None (default: None)

If not specified, the embedding is stored as obsm['X_pca'], the loadings as varm['PCs'], and the the parameters in uns['pca']. If specified, the embedding is stored as obsm[key_added], the loadings as varm[key_added], and the the parameters in uns[key_added].

copy bool (default: False)

If an AnnData is passed, determines whether a copy is returned. Is ignored otherwise.

Return type:

Returns:

If data is array-like and return_info=False was passed, this function returns the PCA representation of data as an array of the same type as the input array.

Otherwise, it returns None if copy=False, else an updated AnnData object. Sets the following fields:

.obsm['X_pca' | key_added]csr_matrix | csc_matrix | ndarray (shape (adata.n_obs, n_comps)): PCA representation of data.
.varm['PCs' | key_added]ndarray (shape (adata.n_vars, n_comps)): The principal components containing the loadings when `obsm=None`.
.uns['pca' | key_added]['components']ndarray (shape (adata.obsm[obsm].shape[1], n_comps)): The principal components containing the loadings when `obsm=”…”`.
.uns['pca' | key_added]['variance_ratio']ndarray (shape (n_comps,)): Ratio of explained variance.
.uns['pca' | key_added]['variance']ndarray (shape (n_comps,)): Explained variance, equivalent to the eigenvalues of the covariance matrix.

scanpy.pp.pca

Contents

scanpy.pp.pca#