scanpy.pp.pca#
- scanpy.pp.pca(data, n_comps=None, *, layer=None, zero_center=True, svd_solver=None, chunked=False, chunk_size=None, random_state=0, return_info=False, mask_var=_empty, use_highly_variable=None, dtype='float32', key_added=None, copy=False)[source]#
Principal component analysis [Pedregosa et al., 2011].
Computes PCA coordinates, loadings and variance decomposition. Uses the following implementations (and defaults for
svd_solver
):chunked=False
,zero_center=True
sklearn
PCA
('arpack'
)chunked=False
,zero_center=False
sklearn
TruncatedSVD
('randomized'
)dask-ml
TruncatedSVD
[2] ('tsqr'
)chunked=True
(zero_center
ignored)sklearn
IncrementalPCA
('auto'
)dask-ml
IncrementalPCA
[3] ('auto'
)- Parameters:
- data
AnnData
|ndarray
|csr_matrix
|csc_matrix
The (annotated) data matrix of shape
n_obs
×n_vars
. Rows correspond to cells and columns to genes.- n_comps
int
|None
(default:None
) Number of principal components to compute. Defaults to 50, or 1 - minimum dimension size of selected representation.
- layer
str
|None
(default:None
) If provided, which element of layers to use for PCA.
- zero_center
bool
(default:True
) If
True
, compute (or approximate) PCA from covariance matrix. IfFalse
, performa a truncated SVD instead of PCA.Our default PCA algorithms (see
svd_solver
) support implicit zero-centering, and therefore efficiently operating on sparse data.- svd_solver
Union
[Literal
['auto'
,'full'
,'tsqr'
,'randomized'
],Literal
['tsqr'
,'randomized'
],Literal
['auto'
,'full'
,'randomized'
],Literal
['arpack'
,'covariance_eigh'
],Literal
['arpack'
,'randomized'
],Literal
['covariance_eigh'
],None
] (default:None
) SVD solver to use. See table above to see which solver class is used based on
chunked
andzero_center
, as well as the default solver for each class whensvd_solver=None
.Efficient computation of the principal components of a sparse matrix currently only works with the
'arpack
’ or'covariance_eigh
’ solver.None
Choose automatically based on solver class (see table above).
'arpack'
ARPACK wrapper in SciPy (
svds()
). Not available for dask arrays.'covariance_eigh'
Classic eigendecomposition of the covariance matrix, suited for tall-and-skinny matrices. With dask, array must be CSR or dense and chunked as
(N, adata.shape[1])
.'randomized'
Randomized algorithm from Halko et al. [2009]. For dask arrays, this will use
svd_compressed()
.'auto'
Choose automatically depending on the size of the problem: Will use
'full'
for small shapes and'randomized'
for large shapes.'tsqr'
“tall-and-skinny QR” algorithm from Benson et al. [2013]. Only available for dense dask arrays.
Changed in version 1.9.3: Default value changed from
'arpack'
to None.Changed in version 1.4.5: Default value changed from
'auto'
to'arpack'
.- chunked
bool
(default:False
) If
True
, perform an incremental PCA on segments ofchunk_size
. Automatically zero centers and ignores settings ofzero_center
,random_seed
andsvd_solver
. IfFalse
, perform a full PCA/truncated SVD (seesvd_solver
andzero_center
). See table above for which solver class is used.- chunk_size
int
|None
(default:None
) Number of observations to include in each chunk. Required if
chunked=True
was passed.- random_state
int
|RandomState
|None
(default:0
) Change to use different initial states for the optimization.
- return_info
bool
(default:False
) Only relevant when not passing an
AnnData
: see “Returns”.- mask_var
ndarray
[tuple
[int
,...
],dtype
[bool
]] |str
|None
|Empty
(default:_empty
) To run only on a certain set of genes given by a boolean array or a string referring to an array in
var
. By default, uses.var['highly_variable']
if available, else everything.- use_highly_variable
bool
|None
(default:None
) Whether to use highly variable genes only, stored in
.var['highly_variable']
. By default uses them if they have been determined beforehand.Deprecated since version 1.10.0: Use
mask_var
instead- layer
Layer of
adata
to use as expression values.- dtype
Union
[dtype
[Any
],None
,type
[Any
],_SupportsDType
[dtype
[Any
]],str
,tuple
[Any
,int
],tuple
[Any
,SupportsIndex
|Sequence
[SupportsIndex
]],list
[Any
],_DTypeDict
,tuple
[Any
,Any
]] (default:'float32'
) Numpy data type string to which to convert the result.
- key_added
str
|None
(default:None
) If not specified, the embedding is stored as
obsm
['X_pca']
, the loadings asvarm
['PCs']
, and the the parameters inuns
['pca']
. If specified, the embedding is stored asobsm
[key_added]
, the loadings asvarm
[key_added]
, and the the parameters inuns
[key_added]
.- copy
bool
(default:False
) If an
AnnData
is passed, determines whether a copy is returned. Is ignored otherwise.
- data
- Return type:
AnnData
|ndarray
|csr_matrix
|csc_matrix
|None
- Returns:
If
data
is array-like andreturn_info=False
was passed, this function returns the PCA representation ofdata
as an array of the same type as the input array.Otherwise, it returns
None
ifcopy=False
, else an updatedAnnData
object. Sets the following fields:.obsm['X_pca' | key_added]
csr_matrix
|csc_matrix
|ndarray
(shape(adata.n_obs, n_comps)
)PCA representation of data.
.varm['PCs' | key_added]
ndarray
(shape(adata.n_vars, n_comps)
)The principal components containing the loadings.
.uns['pca' | key_added]['variance_ratio']
ndarray
(shape(n_comps,)
)Ratio of explained variance.
.uns['pca' | key_added]['variance']
ndarray
(shape(n_comps,)
)Explained variance, equivalent to the eigenvalues of the covariance matrix.