scanpy.tl.marker_gene_overlap

scanpy.tl.marker_gene_overlap#

scanpy.tl.marker_gene_overlap(adata, reference_markers, *, key='rank_genes_groups', method='overlap_count', normalize=None, top_n_markers=None, adj_pval_threshold=None, key_added='marker_gene_overlap', inplace=False)[source]#

Calculate an overlap score between data-derived marker genes and provided markers

Marker gene overlap scores can be quoted as overlap counts, overlap coefficients, or jaccard indices. The method returns a pandas dataframe which can be used to annotate clusters based on marker gene overlaps.

This function was written by Malte Luecken.

Parameters:
adata AnnData

The annotated data matrix.

reference_markers dict[str, set] | dict[str, list]

A marker gene dictionary object. Keys should be strings with the cell identity name and values are sets or lists of strings which match format of adata.var_name.

key str (default: 'rank_genes_groups')

The key in adata.uns where the rank_genes_groups output is stored. By default this is 'rank_genes_groups'.

method Literal['overlap_count', 'overlap_coef', 'jaccard'] (default: 'overlap_count')

(default: overlap_count) Method to calculate marker gene overlap. 'overlap_count' uses the intersection of the gene set, 'overlap_coef' uses the overlap coefficient, and 'jaccard' uses the Jaccard index.

normalize Optional[Literal['reference', 'data']] (default: None)

Normalization option for the marker gene overlap output. This parameter can only be set when method is set to 'overlap_count'. 'reference' normalizes the data by the total number of marker genes given in the reference annotation per group. 'data' normalizes the data by the total number of marker genes used for each cluster.

top_n_markers int | None (default: None)

The number of top data-derived marker genes to use. By default the top 100 marker genes are used. If adj_pval_threshold is set along with top_n_markers, then adj_pval_threshold is ignored.

adj_pval_threshold float | None (default: None)

A significance threshold on the adjusted p-values to select marker genes. This can only be used when adjusted p-values are calculated by sc.tl.rank_genes_groups(). If adj_pval_threshold is set along with top_n_markers, then adj_pval_threshold is ignored.

key_added str (default: 'marker_gene_overlap')

Name of the .uns field that will contain the marker overlap scores.

inplace bool (default: False)

Return a marker gene dataframe or store it inplace in adata.uns.

Returns:

Returns pandas.DataFrame if inplace=False, else returns an AnnData object where it sets the following field:

adata.uns[key_added]pandas.DataFrame (dtype float)

Marker gene overlap scores. Default for key_added is 'marker_gene_overlap'.

Examples

>>> import scanpy as sc
>>> adata = sc.datasets.pbmc68k_reduced()
>>> sc.pp.pca(adata, svd_solver='arpack')
>>> sc.pp.neighbors(adata)
>>> sc.tl.leiden(adata)
>>> sc.tl.rank_genes_groups(adata, groupby='leiden')
>>> marker_genes = {
...     'CD4 T cells': {'IL7R'},
...     'CD14+ Monocytes': {'CD14', 'LYZ'},
...     'B cells': {'MS4A1'},
...     'CD8 T cells': {'CD8A'},
...     'NK cells': {'GNLY', 'NKG7'},
...     'FCGR3A+ Monocytes': {'FCGR3A', 'MS4A7'},
...     'Dendritic Cells': {'FCER1A', 'CST3'},
...     'Megakaryocytes': {'PPBP'}
... }
>>> marker_matches = sc.tl.marker_gene_overlap(adata, marker_genes)