scanpy.tl.marker_gene_overlap
- scanpy.tl.marker_gene_overlap(adata, reference_markers, *, key='rank_genes_groups', method='overlap_count', normalize=None, top_n_markers=None, adj_pval_threshold=None, key_added='marker_gene_overlap', inplace=False)
Calculate an overlap score between data-deriven marker genes and provided markers
Marker gene overlap scores can be quoted as overlap counts, overlap coefficients, or jaccard indices. The method returns a pandas dataframe which can be used to annotate clusters based on marker gene overlaps.
This function was written by Malte Luecken.
- Parameters:
- adata :
AnnData
The annotated data matrix.
- reference_markers :
Union
[Dict
[str
,set
],Dict
[str
,list
]] A marker gene dictionary object. Keys should be strings with the cell identity name and values are sets or lists of strings which match format of
adata.var_name
.- key :
str
(default:'rank_genes_groups'
) The key in
adata.uns
where the rank_genes_groups output is stored. By default this is'rank_genes_groups'
.- method :
Literal
['overlap_count'
,'overlap_coef'
,'jaccard'
] (default:'overlap_count'
) (default:
overlap_count
) Method to calculate marker gene overlap.'overlap_count'
uses the intersection of the gene set,'overlap_coef'
uses the overlap coefficient, and'jaccard'
uses the Jaccard index.- normalize :
Optional
[Literal
['reference'
,'data'
]] (default:None
) Normalization option for the marker gene overlap output. This parameter can only be set when
method
is set to'overlap_count'
.'reference'
normalizes the data by the total number of marker genes given in the reference annotation per group.'data'
normalizes the data by the total number of marker genes used for each cluster.- top_n_markers :
Optional
[int
] (default:None
) The number of top data-derived marker genes to use. By default the top 100 marker genes are used. If
adj_pval_threshold
is set along withtop_n_markers
, thenadj_pval_threshold
is ignored.- adj_pval_threshold :
Optional
[float
] (default:None
) A significance threshold on the adjusted p-values to select marker genes. This can only be used when adjusted p-values are calculated by
sc.tl.rank_genes_groups()
. Ifadj_pval_threshold
is set along withtop_n_markers
, thenadj_pval_threshold
is ignored.- key_added :
str
(default:'marker_gene_overlap'
) Name of the
.uns
field that will contain the marker overlap scores.- inplace :
bool
(default:False
) Return a marker gene dataframe or store it inplace in
adata.uns
.
- adata :
- Returns:
: A pandas dataframe with the marker gene overlap scores if
inplace=False
. Forinplace=True
adata.uns
is updated with an additional field specified by thekey_added
parameter (default = ‘marker_gene_overlap’).
Examples
>>> import scanpy as sc >>> adata = sc.datasets.pbmc68k_reduced() >>> sc.pp.pca(adata, svd_solver='arpack') >>> sc.pp.neighbors(adata) >>> sc.tl.louvain(adata) >>> sc.tl.rank_genes_groups(adata, groupby='louvain') >>> marker_genes = { ... 'CD4 T cells': {'IL7R'}, ... 'CD14+ Monocytes': {'CD14', 'LYZ'}, ... 'B cells': {'MS4A1'}, ... 'CD8 T cells': {'CD8A'}, ... 'NK cells': {'GNLY', 'NKG7'}, ... 'FCGR3A+ Monocytes': {'FCGR3A', 'MS4A7'}, ... 'Dendritic Cells': {'FCER1A', 'CST3'}, ... 'Megakaryocytes': {'PPBP'} ... } >>> marker_matches = sc.tl.marker_gene_overlap(adata, marker_genes)