scanpy.get.aggregate(adata, by, func, *, axis=None, mask=None, dof=1, layer=None, obsm=None, varm=None)[source]#

Aggregate data matrix based on some categorical grouping.

This function is useful for pseudobulking as well as plotting.

Aggregation to perform is specified by func, which can be a single metric or a list of metrics. Each metric is computed over the group and results in a new layer in the output AnnData object.

If none of layer, obsm, or varm are passed in, X will be used for aggregation data.

adata AnnData

AnnData to be aggregated.

by str | Collection[str]

Key of the column to be grouped-by.

func Union[Literal['count_nonzero', 'mean', 'sum', 'var'], Iterable[Literal['count_nonzero', 'mean', 'sum', 'var']]]

How to aggregate.

axis Optional[Literal['obs', 0, 'var', 1]] (default: None)

Axis on which to find group by column.

mask ndarray[Any, dtype[bool]] | str | None (default: None)

Boolean mask (or key to column containing mask) to apply along the axis.

dof int (default: 1)

Degrees of freedom for variance. Defaults to 1.

layer str | None (default: None)

If not None, key for aggregation data.

obsm str | None (default: None)

If not None, key for aggregation data.

varm str | None (default: None)

If not None, key for aggregation data.

Return type:



Aggregated AnnData.


Calculating mean expression and number of nonzero entries per cluster:

>>> import scanpy as sc, pandas as pd
>>> pbmc = sc.datasets.pbmc3k_processed().raw.to_adata()
>>> pbmc.shape
(2638, 13714)
>>> aggregated = sc.get.aggregate(pbmc, by="louvain", func=["mean", "count_nonzero"])
>>> aggregated
AnnData object with n_obs × n_vars = 8 × 13714
    obs: 'louvain'
    var: 'n_cells'
    layers: 'mean', 'count_nonzero'

We can group over multiple columns:

>>> pbmc.obs["percent_mito_binned"] = pd.cut(pbmc.obs["percent_mito"], bins=5)
>>> sc.get.aggregate(pbmc, by=["louvain", "percent_mito_binned"], func=["mean", "count_nonzero"])
AnnData object with n_obs × n_vars = 40 × 13714
    obs: 'louvain', 'percent_mito_binned'
    var: 'n_cells'
    layers: 'mean', 'count_nonzero'

Note that this filters out any combination of groups that wasn’t present in the original data.