scanpy.pp.scrublet#
- scanpy.pp.scrublet(adata, adata_sim=None, *, batch_key=None, sim_doublet_ratio=2.0, expected_doublet_rate=0.05, stdev_doublet_rate=0.02, synthetic_doublet_umi_subsampling=1.0, knn_dist_metric='euclidean', normalize_variance=True, log_transform=False, mean_center=True, n_prin_comps=30, use_approx_neighbors=None, get_doublet_neighbor_parents=False, n_neighbors=None, threshold=None, verbose=True, copy=False, random_state=0)[source]#
Predict doublets using Scrublet [Wolock et al., 2019].
Predict cell doublets using a nearest-neighbor classifier of observed transcriptomes and simulated doublets. Works best if the input is a raw (unnormalized) counts matrix from a single sample or a collection of similar samples from the same experiment. This function is a wrapper around functions that pre-process using Scanpy and directly call functions of Scrublet(). You may also undertake your own preprocessing, simulate doublets with
scrublet_simulate_doublets()
, and run the core scrublet functionscrublet()
withadata_sim
set.- Parameters:
- adata
AnnData
The annotated data matrix of shape
n_obs
×n_vars
. Rows correspond to cells and columns to genes. Expected to be un-normalised where adata_sim is not supplied, in which case doublets will be simulated and pre-processing applied to both objects. If adata_sim is supplied, this should be the observed transcriptomes processed consistently (filtering, transform, normalisaton, hvg) with adata_sim.- adata_sim
AnnData
|None
(default:None
) (Advanced use case) Optional annData object generated by
scrublet_simulate_doublets()
, with same number of vars as adata. This should have been built from adata_obs after filtering genes and cells and selcting highly-variable genes.- batch_key
str
|None
(default:None
) Optional
obs
column name discriminating between batches.- sim_doublet_ratio
float
(default:2.0
) Number of doublets to simulate relative to the number of observed transcriptomes.
- expected_doublet_rate
float
(default:0.05
) Where adata_sim not suplied, the estimated doublet rate for the experiment.
- stdev_doublet_rate
float
(default:0.02
) Where adata_sim not suplied, uncertainty in the expected doublet rate.
- synthetic_doublet_umi_subsampling
float
(default:1.0
) Where adata_sim not suplied, rate for sampling UMIs when creating synthetic doublets. If 1.0, each doublet is created by simply adding the UMI counts from two randomly sampled observed transcriptomes. For values less than 1, the UMI counts are added and then randomly sampled at the specified rate.
- knn_dist_metric
Union
[Literal
['cityblock'
,'cosine'
,'euclidean'
,'l1'
,'l2'
,'manhattan'
],Literal
['braycurtis'
,'canberra'
,'chebyshev'
,'correlation'
,'dice'
,'hamming'
,'jaccard'
,'kulsinski'
,'mahalanobis'
,'minkowski'
,'rogerstanimoto'
,'russellrao'
,'seuclidean'
,'sokalmichener'
,'sokalsneath'
,'sqeuclidean'
,'yule'
],Callable
[[ndarray
,ndarray
],float
]] (default:'euclidean'
) Distance metric used when finding nearest neighbors. For list of valid values, see the documentation for annoy (if
use_approx_neighbors
is True) or sklearn.neighbors.NearestNeighbors (ifuse_approx_neighbors
is False).- normalize_variance
bool
(default:True
) If True, normalize the data such that each gene has a variance of 1.
sklearn.decomposition.TruncatedSVD
will be used for dimensionality reduction, unlessmean_center
is True.- log_transform
bool
(default:False
) Whether to use
log1p()
to log-transform the data prior to PCA.- mean_center
bool
(default:True
) If True, center the data such that each gene has a mean of 0.
sklearn.decomposition.PCA
will be used for dimensionality reduction.- n_prin_comps
int
(default:30
) Number of principal components used to embed the transcriptomes prior to k-nearest-neighbor graph construction.
- use_approx_neighbors
bool
|None
(default:None
) Use approximate nearest neighbor method (annoy) for the KNN classifier.
- get_doublet_neighbor_parents
bool
(default:False
) If True, return (in .uns) the parent transcriptomes that generated the doublet neighbors of each observed transcriptome. This information can be used to infer the cell states that generated a given doublet state.
- n_neighbors
int
|None
(default:None
) Number of neighbors used to construct the KNN graph of observed transcriptomes and simulated doublets. If
None
, this is automatically set tonp.round(0.5 * np.sqrt(n_obs))
.- threshold
float
|None
(default:None
) Doublet score threshold for calling a transcriptome a doublet. If
None
, this is set automatically by looking for the minimum between the two modes of thedoublet_scores_sim_
histogram. It is best practice to check the threshold visually using thedoublet_scores_sim_
histogram and/or based on co-localization of predicted doublets in a 2-D embedding.- verbose
bool
(default:True
) If
True
, log progress updates.- copy
bool
(default:False
) If
True
, return a copy of the inputadata
with Scrublet results added. Otherwise, Scrublet results are added in place.- random_state
int
|RandomState
|None
(default:0
) Initial state for doublet simulation and nearest neighbors.
- adata
- Return type:
- Returns:
if
copy=True
it returns or else adds fields toadata
. Those fields:.obs['doublet_score']
Doublet scores for each observed transcriptome
.obs['predicted_doublet']
Boolean indicating predicted doublet status
.uns['scrublet']['doublet_scores_sim']
Doublet scores for each simulated doublet transcriptome
.uns['scrublet']['doublet_parents']
Pairs of
.obs_names
used to generate each simulated doublet transcriptome.uns['scrublet']['parameters']
Dictionary of Scrublet parameters
See also
scrublet_simulate_doublets()
Run Scrublet’s doublet simulation separately for advanced usage.
scrublet_score_distribution()
Plot histogram of doublet scores for observed transcriptomes and simulated doublets.