scanpy.pp.pca#
- scanpy.pp.pca(data, n_comps=None, *, layer=None, zero_center=True, svd_solver=None, random_state=0, return_info=False, mask_var=_empty, use_highly_variable=None, dtype='float32', chunked=False, chunk_size=None, key_added=None, copy=False)[source]#
Principal component analysis [Pedregosa et al., 2011].
Computes PCA coordinates, loadings and variance decomposition. Uses the implementation of scikit-learn [Pedregosa et al., 2011].
Changed in version 1.5.0: In previous versions, computing a PCA on a sparse matrix would make a dense copy of the array for mean centering. As of scanpy 1.5.0, mean centering is implicit. While results are extremely similar, they are not exactly the same. If you would like to reproduce the old results, pass a dense array.
- Parameters:
- data
AnnData
|ndarray
|csr_matrix
|csc_matrix
The (annotated) data matrix of shape
n_obs
×n_vars
. Rows correspond to cells and columns to genes.- n_comps
int
|None
(default:None
) Number of principal components to compute. Defaults to 50, or 1 - minimum dimension size of selected representation.
- layer
str
|None
(default:None
) If provided, which element of layers to use for PCA.
- zero_center
bool
|None
(default:True
) If
True
, compute standard PCA from covariance matrix. IfFalse
, omit zero-centering variables (uses scikit-learnTruncatedSVD
or dask-mlTruncatedSVD
), which allows to handle sparse input efficiently. PassingNone
decides automatically based on sparseness of the data.- svd_solver
Union
[Literal
['auto'
,'full'
,'tsqr'
,'randomized'
],Literal
['tsqr'
,'randomized'
],Literal
['auto'
,'full'
,'randomized'
],Literal
['arpack'
,'covariance_eigh'
],Literal
['arpack'
,'randomized'
],Literal
['covariance_eigh'
],None
] (default:None
) SVD solver to use:
None
See
chunked
andzero_center
descriptions to determine which class will be used. Depending on the class and the type of X different values for default will be set. For sparse dask arrays, will use'covariance_eigh'
. If scikit-learnPCA
is used, will give'arpack'
, if scikit-learnTruncatedSVD
is used, will give'randomized'
, if dask-mlPCA
orIncrementalPCA
is used, will give'auto'
, if dask-mlTruncatedSVD
is used, will give'tsqr'
'arpack'
for the ARPACK wrapper in SciPy (
svds()
) Not available with dask arrays.'covariance_eigh'
Classic eigendecomposition of the covariance matrix, suited for tall-and-skinny matrices. With dask, array must be CSR and chunked as (N, adata.shape[1]).
'randomized'
for the randomized algorithm due to Halko (2009). For dask arrays, this will use
svd_compressed()
.'auto'
chooses automatically depending on the size of the problem.
'tsqr'
Only available with dense dask arrays. “tsqr” algorithm from Benson et. al. (2013).
Changed in version 1.9.3: Default value changed from
'arpack'
to None.Changed in version 1.4.5: Default value changed from
'auto'
to'arpack'
.Efficient computation of the principal components of a sparse matrix currently only works with the
'arpack
’ or'covariance_eigh
’ solver.If X is a sparse dask array, a custom
'covariance_eigh'
solver will be used. If X is a dense dask array, dask-ml classesPCA
,IncrementalPCA
, orTruncatedSVD
will be used. Otherwise their scikit-learn counterpartsPCA
,IncrementalPCA
, orTruncatedSVD
will be used.- random_state
int
|RandomState
|None
(default:0
) Change to use different initial states for the optimization.
- return_info
bool
(default:False
) Only relevant when not passing an
AnnData
: see “Returns”.- mask_var
ndarray
[Any
,dtype
[bool
]] |str
|None
|Empty
(default:_empty
) To run only on a certain set of genes given by a boolean array or a string referring to an array in
var
. By default, uses.var['highly_variable']
if available, else everything.- use_highly_variable
bool
|None
(default:None
) Whether to use highly variable genes only, stored in
.var['highly_variable']
. By default uses them if they have been determined beforehand.Deprecated since version 1.10.0: Use
mask_var
instead- layer
Layer of
adata
to use as expression values.- dtype
Union
[dtype
[Any
],None
,type
[Any
],_SupportsDType
[dtype
[Any
]],str
,tuple
[Any
,int
],tuple
[Any
,SupportsIndex
|Sequence
[SupportsIndex
]],list
[Any
],_DTypeDict
,tuple
[Any
,Any
]] (default:'float32'
) Numpy data type string to which to convert the result.
- chunked
bool
(default:False
) If
True
, perform an incremental PCA on segments ofchunk_size
. The incremental PCA automatically zero centers and ignores settings ofrandom_seed
andsvd_solver
. Uses sklearnIncrementalPCA
or dask-mlIncrementalPCA
. IfFalse
, perform a full PCA and use sklearnPCA
or dask-mlPCA
- chunk_size
int
|None
(default:None
) Number of observations to include in each chunk. Required if
chunked=True
was passed.- key_added
str
|None
(default:None
) If not specified, the embedding is stored as
obsm
['X_pca']
, the loadings asvarm
['PCs']
, and the the parameters inuns
['pca']
. If specified, the embedding is stored asobsm
[key_added]
, the loadings asvarm
[key_added]
, and the the parameters inuns
[key_added]
.- copy
bool
(default:False
) If an
AnnData
is passed, determines whether a copy is returned. Is ignored otherwise.
- data
- Return type:
AnnData
|ndarray
|csr_matrix
|csc_matrix
|None
- Returns:
If
data
is array-like andreturn_info=False
was passed, this function returns the PCA representation ofdata
as an array of the same type as the input array.Otherwise, it returns
None
ifcopy=False
, else an updatedAnnData
object. Sets the following fields:.obsm['X_pca' | key_added]
csr_matrix
|csc_matrix
|ndarray
(shape(adata.n_obs, n_comps)
)PCA representation of data.
.varm['PCs' | key_added]
ndarray
(shape(adata.n_vars, n_comps)
)The principal components containing the loadings.
.uns['pca' | key_added]['variance_ratio']
ndarray
(shape(n_comps,)
)Ratio of explained variance.
.uns['pca' | key_added]['variance']
ndarray
(shape(n_comps,)
)Explained variance, equivalent to the eigenvalues of the covariance matrix.