scanpy.experimental.pp.highly_variable_genes#
- scanpy.experimental.pp.highly_variable_genes(adata, *, theta=100, clip=None, n_top_genes=None, batch_key=None, chunksize=1000, flavor='pearson_residuals', check_values=True, layer=None, subset=False, inplace=True)[source]#
- Select highly variable genes using analytic Pearson residuals [Lause et al., 2021]. - In Lause et al. [2021], Pearson residuals of a negative binomial offset model are computed (with overdispersion - thetashared across genes). By default, overdispersion- theta=100is used and residuals are clipped to- sqrt(n_obs). Finally, genes are ranked by residual variance.- Expects raw count input. - Parameters:
- adata AnnData
- The annotated data matrix of shape - n_obs×- n_vars. Rows correspond to cells and columns to genes.
- theta float(default:100)
- The negative binomial overdispersion parameter - thetafor Pearson residuals. Higher values correspond to less overdispersion (- var = mean + mean^2/theta), and- theta=np.infcorresponds to a Poisson model.
- clip float|None(default:None)
- Determines if and how residuals are clipped: - If - None, residuals are clipped to the interval- [-sqrt(n_obs), sqrt(n_obs)], where- n_obsis the number of cells in the dataset (default behavior).
- If any scalar - c, residuals are clipped to the interval- [-c, c]. Set- clip=np.inffor no clipping.
 
- n_top_genes int|None(default:None)
- Number of highly-variable genes to keep. Mandatory if - flavor='seurat_v3'or- flavor='pearson_residuals'.
- batch_key str|None(default:None)
- If specified, highly-variable genes are selected within each batch separately and merged. This simple process avoids the selection of batch-specific genes and acts as a lightweight batch correction method. Genes are first sorted by how many batches they are a HVG. If - flavor='pearson_residuals', ties are broken by the median rank (across batches) based on within-batch residual variance.
- chunksize int(default:1000)
- If - flavor='pearson_residuals', this dertermines how many genes are processed at once while computing the residual variance. Choosing a smaller value will reduce the required memory.
- flavor Literal['pearson_residuals'] (default:'pearson_residuals')
- Choose the flavor for identifying highly variable genes. In this experimental version, only ‘pearson_residuals’ is functional. 
- check_values bool(default:True)
- If - True, checks if counts in selected layer are integers as expected by this function, and return a warning if non-integers are found. Otherwise, proceed without checking. Setting this to- Falsecan speed up code for large datasets.
- layer str|None(default:None)
- Layer to use as input instead of - X. If- None,- Xis used.
- subset bool(default:False)
- If - True, subset the data to highly-variable genes after finding them. Otherwise merely indicate highly variable genes in- adata.var(see below).
- inplace bool(default:True)
- If - True, update- adatawith results. Otherwise, return results. See below for details of what is returned.
 
- adata 
- Return type:
- Returns:
- If - inplace=True,- adata.varis updated with the following fields. Otherwise, returns the same fields as- DataFrame.- highly_variablebool
- boolean indicator of highly-variable genes. 
- meansfloat
- means per gene. 
- variancesfloat
- variance per gene. 
- residual_variancesfloat
- For - flavor='pearson_residuals', residual variance per gene. Averaged in the case of multiple batches.
- highly_variable_rankfloat
- For - flavor='pearson_residuals', rank of the gene according to residual. variance, median rank in the case of multiple batches.
- highly_variable_nbatchesint
- If - batch_keygiven, denotes in how many batches genes are detected as HVG.
- highly_variable_intersectionbool
- If - batch_keygiven, denotes the genes that are highly variable in all batches.
 
- highly_variable
 - Notes - Experimental version of - sc.pp.highly_variable_genes()