scanpy.experimental.pp.highly_variable_genes#
- scanpy.experimental.pp.highly_variable_genes(adata, *, theta=100, clip=None, n_top_genes=None, batch_key=None, chunksize=1000, flavor='pearson_residuals', check_values=True, layer=None, subset=False, inplace=True)[source]#
Select highly variable genes using analytic Pearson residuals [Lause et al., 2021].
In Lause et al. [2021], Pearson residuals of a negative binomial offset model are computed (with overdispersion
theta
shared across genes). By default, overdispersiontheta=100
is used and residuals are clipped tosqrt(n_obs)
. Finally, genes are ranked by residual variance.Expects raw count input.
- Parameters:
- adata
AnnData
The annotated data matrix of shape
n_obs
×n_vars
. Rows correspond to cells and columns to genes.- theta
float
(default:100
) The negative binomial overdispersion parameter
theta
for Pearson residuals. Higher values correspond to less overdispersion (var = mean + mean^2/theta
), andtheta=np.inf
corresponds to a Poisson model.- clip
float
|None
(default:None
) Determines if and how residuals are clipped:
If
None
, residuals are clipped to the interval[-sqrt(n_obs), sqrt(n_obs)]
, wheren_obs
is the number of cells in the dataset (default behavior).If any scalar
c
, residuals are clipped to the interval[-c, c]
. Setclip=np.inf
for no clipping.
- n_top_genes
int
|None
(default:None
) Number of highly-variable genes to keep. Mandatory if
flavor='seurat_v3'
orflavor='pearson_residuals'
.- batch_key
str
|None
(default:None
) If specified, highly-variable genes are selected within each batch separately and merged. This simple process avoids the selection of batch-specific genes and acts as a lightweight batch correction method. Genes are first sorted by how many batches they are a HVG. If
flavor='pearson_residuals'
, ties are broken by the median rank (across batches) based on within-batch residual variance.- chunksize
int
(default:1000
) If
flavor='pearson_residuals'
, this dertermines how many genes are processed at once while computing the residual variance. Choosing a smaller value will reduce the required memory.- flavor
Literal
['pearson_residuals'
] (default:'pearson_residuals'
) Choose the flavor for identifying highly variable genes. In this experimental version, only ‘pearson_residuals’ is functional.
- check_values
bool
(default:True
) If
True
, checks if counts in selected layer are integers as expected by this function, and return a warning if non-integers are found. Otherwise, proceed without checking. Setting this toFalse
can speed up code for large datasets.- layer
str
|None
(default:None
) Layer to use as input instead of
X
. IfNone
,X
is used.- subset
bool
(default:False
) If
True
, subset the data to highly-variable genes after finding them. Otherwise merely indicate highly variable genes inadata.var
(see below).- inplace
bool
(default:True
) If
True
, updateadata
with results. Otherwise, return results. See below for details of what is returned.
- adata
- Return type:
- Returns:
If
inplace=True
,adata.var
is updated with the following fields. Otherwise, returns the same fields asDataFrame
.- highly_variable
bool
boolean indicator of highly-variable genes.
- means
float
means per gene.
- variances
float
variance per gene.
- residual_variances
float
For
flavor='pearson_residuals'
, residual variance per gene. Averaged in the case of multiple batches.- highly_variable_rank
float
For
flavor='pearson_residuals'
, rank of the gene according to residual. variance, median rank in the case of multiple batches.- highly_variable_nbatches
int
If
batch_key
given, denotes in how many batches genes are detected as HVG.- highly_variable_intersection
bool
If
batch_key
given, denotes the genes that are highly variable in all batches.
- highly_variable
Notes
Experimental version of
sc.pp.highly_variable_genes()