scanpy.tl.marker_gene_overlap
- scanpy.tl.marker_gene_overlap(adata, reference_markers, *, key='rank_genes_groups', method='overlap_count', normalize=None, top_n_markers=None, adj_pval_threshold=None, key_added='marker_gene_overlap', inplace=False)
Calculate an overlap score between data-deriven marker genes and provided markers
Marker gene overlap scores can be quoted as overlap counts, overlap coefficients, or jaccard indices. The method returns a pandas dataframe which can be used to annotate clusters based on marker gene overlaps.
This function was written by Malte Luecken.
- Parameters:
- adata :
AnnData The annotated data matrix.
- reference_markers :
Union[Dict[str,set],Dict[str,list]] A marker gene dictionary object. Keys should be strings with the cell identity name and values are sets or lists of strings which match format of
adata.var_name.- key :
str(default:'rank_genes_groups') The key in
adata.unswhere the rank_genes_groups output is stored. By default this is'rank_genes_groups'.- method :
Literal['overlap_count','overlap_coef','jaccard'] (default:'overlap_count') (default:
overlap_count) Method to calculate marker gene overlap.'overlap_count'uses the intersection of the gene set,'overlap_coef'uses the overlap coefficient, and'jaccard'uses the Jaccard index.- normalize :
Optional[Literal['reference','data']] (default:None) Normalization option for the marker gene overlap output. This parameter can only be set when
methodis set to'overlap_count'.'reference'normalizes the data by the total number of marker genes given in the reference annotation per group.'data'normalizes the data by the total number of marker genes used for each cluster.- top_n_markers :
Optional[int] (default:None) The number of top data-derived marker genes to use. By default the top 100 marker genes are used. If
adj_pval_thresholdis set along withtop_n_markers, thenadj_pval_thresholdis ignored.- adj_pval_threshold :
Optional[float] (default:None) A significance threshold on the adjusted p-values to select marker genes. This can only be used when adjusted p-values are calculated by
sc.tl.rank_genes_groups(). Ifadj_pval_thresholdis set along withtop_n_markers, thenadj_pval_thresholdis ignored.- key_added :
str(default:'marker_gene_overlap') Name of the
.unsfield that will contain the marker overlap scores.- inplace :
bool(default:False) Return a marker gene dataframe or store it inplace in
adata.uns.
- adata :
- Returns:
: A pandas dataframe with the marker gene overlap scores if
inplace=False. Forinplace=Trueadata.unsis updated with an additional field specified by thekey_addedparameter (default = ‘marker_gene_overlap’).
Examples
>>> import scanpy as sc >>> adata = sc.datasets.pbmc68k_reduced() >>> sc.pp.pca(adata, svd_solver='arpack') >>> sc.pp.neighbors(adata) >>> sc.tl.louvain(adata) >>> sc.tl.rank_genes_groups(adata, groupby='louvain') >>> marker_genes = { ... 'CD4 T cells': {'IL7R'}, ... 'CD14+ Monocytes': {'CD14', 'LYZ'}, ... 'B cells': {'MS4A1'}, ... 'CD8 T cells': {'CD8A'}, ... 'NK cells': {'GNLY', 'NKG7'}, ... 'FCGR3A+ Monocytes': {'FCGR3A', 'MS4A7'}, ... 'Dendritic Cells': {'FCER1A', 'CST3'}, ... 'Megakaryocytes': {'PPBP'} ... } >>> marker_matches = sc.tl.marker_gene_overlap(adata, marker_genes)