scrublet(adata, adata_sim=None, sim_doublet_ratio=2.0, expected_doublet_rate=0.05, stdev_doublet_rate=0.02, synthetic_doublet_umi_subsampling=1.0, knn_dist_metric='euclidean', normalize_variance=True, log_transform=False, mean_center=True, n_prin_comps=30, use_approx_neighbors=True, get_doublet_neighbor_parents=False, n_neighbors=None, threshold=None, verbose=True, copy=False, random_state=0)¶
Predict doublets using Scrublet [Wolock19].
Predict cell doublets using a nearest-neighbor classifier of observed transcriptomes and simulated doublets. Works best if the input is a raw (unnormalized) counts matrix from a single sample or a collection of similar samples from the same experiment. This function is a wrapper around functions that pre-process using Scanpy and directly call functions of Scrublet(). You may also undertake your own preprocessing, simulate doublets with scanpy.external.pp.scrublet_simulate_doublets(), and run the core scrublet function scanpy.external.pp.scrublet.scrublet().
More information and bug reports here.
- adata :
The annotated data matrix of shape
n_vars. Rows correspond to cells and columns to genes. Expected to be un-normalised where adata_sim is not supplied, in which case doublets will be simulated and pre-processing applied to both objects. If adata_sim is supplied, this should be the observed transcriptomes processed consistently (filtering, transform, normalisaton, hvg) with adata_sim.
- adata_sim :
(Advanced use case) Optional annData object generated by sc.external.pp.scrublet_simulate_doublets(), with same number of vars as adata. This should have been built from adata_obs after filtering genes and cells and selcting highly-variable genes.
- sim_doublet_ratio :
Number of doublets to simulate relative to the number of observed transcriptomes.
- expected_doublet_rate :
Where adata_sim not suplied, the estimated doublet rate for the experiment.
- stdev_doublet_rate :
Where adata_sim not suplied, uncertainty in the expected doublet rate.
- synthetic_doublet_umi_subsampling :
Where adata_sim not suplied, rate for sampling UMIs when creating synthetic doublets. If 1.0, each doublet is created by simply adding the UMI counts from two randomly sampled observed transcriptomes. For values less than 1, the UMI counts are added and then randomly sampled at the specified rate.
- knn_dist_metric :
Distance metric used when finding nearest neighbors. For list of valid values, see the documentation for annoy (if
use_approx_neighborsis True) or sklearn.neighbors.NearestNeighbors (if
- normalize_variance :
If True, normalize the data such that each gene has a variance of 1.
sklearn.decomposition.TruncatedSVDwill be used for dimensionality reduction, unless
- log_transform :
Whether to use :func:
~scanpy.pp.log1pto log-transform the data prior to PCA.
- mean_center :
If True, center the data such that each gene has a mean of 0.
sklearn.decomposition.PCAwill be used for dimensionality reduction.
- n_prin_comps :
Number of principal components used to embed the transcriptomes prior to k-nearest-neighbor graph construction.
- use_approx_neighbors :
Use approximate nearest neighbor method (annoy) for the KNN classifier.
- get_doublet_neighbor_parents :
If True, return (in .uns) the parent transcriptomes that generated the doublet neighbors of each observed transcriptome. This information can be used to infer the cell states that generated a given doublet state.
- n_neighbors :
Number of neighbors used to construct the KNN graph of observed transcriptomes and simulated doublets. If
None, this is automatically set to
np.round(0.5 * np.sqrt(n_obs)).
- threshold :
Doublet score threshold for calling a transcriptome a doublet. If
None, this is set automatically by looking for the minimum between the two modes of the
doublet_scores_sim_histogram. It is best practice to check the threshold visually using the
doublet_scores_sim_histogram and/or based on co-localization of predicted doublets in a 2-D embedding.
- verbose :
If True, print progress updates.
- copy :
True, return a copy of the input
adatawith Scrublet results added. Otherwise, Scrublet results are added in place.
- random_state :
Initial state for doublet simulation and nearest neighbors.
- adata :
- Return type
copy=Trueit returns or else adds fields to
adata. Those fields:
Doublet scores for each observed transcriptome
Boolean indicating predicted doublet status
Doublet scores for each simulated doublet transcriptome
.obs_namesused to generate each simulated doublet transcriptome
Dictionary of Scrublet parameters