scanpy.pp.filter_genes_dispersion#
- scanpy.pp.filter_genes_dispersion(data, *, flavor='seurat', min_disp=None, max_disp=None, min_mean=None, max_mean=None, n_bins=20, n_top_genes=None, log=True, subset=True, copy=False)[source]#
Extract highly variable genes [Satija et al., 2015, Zheng et al., 2017].
Deprecated since version 1.3.6: Use
highly_variable_genes()
instead. The new function is equivalent to the present function, except thatthe new function always expects logarithmized data
subset=False
in the new function, it suffices to merely annotate the genes, tools likepp.pca
will detect the annotationyou can now call:
sc.pl.highly_variable_genes(adata)
copy
is replaced byinplace
If trying out parameters, pass the data matrix instead of AnnData.
Depending on
flavor
, this reproduces the R-implementations of Seurat [Satija et al., 2015] and Cell Ranger [Zheng et al., 2017].The normalized dispersion is obtained by scaling with the mean and standard deviation of the dispersions for genes falling into a given bin for mean expression of genes. This means that for each bin of mean expression, highly variable genes are selected.
Use
flavor='cell_ranger'
with care and in the same way as inrecipe_zheng17()
.- Parameters:
- data
AnnData
|spmatrix
|ndarray
The (annotated) data matrix of shape
n_obs
×n_vars
. Rows correspond to cells and columns to genes.- flavor
Literal
['seurat'
,'cell_ranger'
] (default:'seurat'
) Choose the flavor for computing normalized dispersion. If choosing ‘seurat’, this expects non-logarithmized data – the logarithm of mean and dispersion is taken internally when
log
is at its default valueTrue
. For ‘cell_ranger’, this is usually called for logarithmized data – in this case you should setlog
toFalse
. In their default workflows, Seurat passes the cutoffs whereas Cell Ranger passesn_top_genes
.- min_mean
float
|None
(default:None
) - max_mean
float
|None
(default:None
) - min_disp
float
|None
(default:None
) - max_disp
float
|None
(default:None
) If
n_top_genes
unequalsNone
, these cutoffs for the means and the normalized dispersions are ignored.- n_bins
int
(default:20
) Number of bins for binning the mean gene expression. Normalization is done with respect to each bin. If just a single gene falls into a bin, the normalized dispersion is artificially set to 1. You’ll be informed about this if you set
settings.verbosity = 4
.- n_top_genes
int
|None
(default:None
) Number of highly-variable genes to keep.
- log
bool
(default:True
) Use the logarithm of the mean to variance ratio.
- subset
bool
(default:True
) Keep highly-variable genes only (if True) else write a bool array for h ighly-variable genes while keeping all genes
- copy
bool
(default:False
) If an
AnnData
is passed, determines whether a copy is returned.
- data
- Return type:
- Returns:
If an AnnData
adata
is passed, returns or updatesadata
depending oncopy
. It filters theadata
and adds the annotations- meansadata.var
Means per gene. Logarithmized when
log
isTrue
.- dispersionsadata.var
Dispersions per gene. Logarithmized when
log
isTrue
.- dispersions_normadata.var
Normalized dispersions per gene. Logarithmized when
log
isTrue
.
If a data matrix
X
is passed, the annotation is returned asnp.recarray
with the same information stored in fields:gene_subset
,means
,dispersions
,dispersion_norm
.