scanpy.external.pp.bbknn#
- scanpy.external.pp.bbknn(adata, *, batch_key='batch', use_rep='X_pca', approx=True, use_annoy=True, metric='euclidean', copy=False, neighbors_within_batch=3, n_pcs=50, trim=None, annoy_n_trees=10, pynndescent_n_neighbors=30, pynndescent_random_state=0, use_faiss=True, set_op_mix_ratio=1.0, local_connectivity=1, **kwargs)[source]#
Batch balanced kNN [Polański et al., 2019].
Batch balanced kNN alters the kNN procedure to identify each cell’s top neighbours in each batch separately instead of the entire cell pool with no accounting for batch. The nearest neighbours for each batch are then merged to create a final list of neighbours for the cell. Aligns batches in a quick and lightweight manner.
For use in the scanpy workflow as an alternative to
neighbors()
.Note
This is just a wrapper of
bbknn.bbknn()
: up to date docstring, more information and bug reports there.- Parameters:
- adata
AnnData
Needs the PCA computed and stored in
adata.obsm["X_pca"]
.- batch_key
str
(default:'batch'
) adata.obs
column name discriminating between your batches.- use_rep
str
(default:'X_pca'
) The dimensionality reduction in
.obsm
to use for neighbour detection. Defaults to PCA.- approx
bool
(default:True
) If
True
, use approximate neighbour finding - annoy or PyNNDescent. This results in a quicker run time for large datasets while also potentially increasing the degree of batch correction.- use_annoy
bool
(default:True
) Only used when
approx=True
. IfTrue
, will use annoy for neighbour finding. IfFalse
, will use pyNNDescent instead.- metric
str
|Callable
|DistanceMetric
(default:'euclidean'
) What distance metric to use. The options depend on the choice of neighbour algorithm.
”euclidean”, the default, is always available.
Annoy supports “angular”, “manhattan” and “hamming”.
PyNNDescent supports metrics listed in
pynndescent.distances.named_distances
and custom functions, including compiled Numba code.>>> import pynndescent >>> pynndescent.distances.named_distances.keys() dict_keys(['euclidean', 'l2', 'sqeuclidean', 'manhattan', 'taxicab', 'l1', 'chebyshev', 'linfinity', 'linfty', 'linf', 'minkowski', 'seuclidean', 'standardised_euclidean', 'wminkowski', ...])
KDTree supports members of
sklearn.neighbors.KDTree
’svalid_metrics
list, or parameterisedDistanceMetric
objects:>>> import sklearn.neighbors >>> sklearn.neighbors.KDTree.valid_metrics ['euclidean', 'l2', 'minkowski', 'p', 'manhattan', 'cityblock', 'l1', 'chebyshev', 'infinity']
Note
check the relevant documentation for up-to-date lists.
- copy
bool
(default:False
) If
True
, return a copy instead of writing to the supplied adata.- neighbors_within_batch
int
(default:3
) How many top neighbours to report for each batch; total number of neighbours in the initial k-nearest-neighbours computation will be this number times the number of batches. This then serves as the basis for the construction of a symmetrical matrix of connectivities.
- n_pcs
int
(default:50
) How many dimensions (in case of PCA, principal components) to use in the analysis.
- trim
int
|None
(default:None
) Trim the neighbours of each cell to these many top connectivities. May help with population independence and improve the tidiness of clustering. The lower the value the more independent the individual populations, at the cost of more conserved batch effect. If
None
, sets the parameter value automatically to 10 timesneighbors_within_batch
times the number of batches. Set to 0 to skip.- annoy_n_trees
int
(default:10
) Only used with annoy neighbour identification. The number of trees to construct in the annoy forest. More trees give higher precision when querying, at the cost of increased run time and resource intensity.
- pynndescent_n_neighbors
int
(default:30
) Only used with pyNNDescent neighbour identification. The number of neighbours to include in the approximate neighbour graph. More neighbours give higher precision when querying, at the cost of increased run time and resource intensity.
- pynndescent_random_state
int
(default:0
) Only used with pyNNDescent neighbour identification. The RNG seed to use when creating the graph.
- use_faiss
bool
(default:True
) If
approx=False
and the metric is “euclidean”, use the faiss package to compute nearest neighbours if installed. This improves performance at a minor cost to numerical precision as faiss operates on float32.- set_op_mix_ratio
float
(default:1.0
) UMAP connectivity computation parameter, float between 0 and 1, controlling the blend between a connectivity matrix formed exclusively from mutual nearest neighbour pairs (0) and a union of all observed neighbour relationships with the mutual pairs emphasised (1)
- local_connectivity
int
(default:1
) UMAP connectivity computation parameter, how many nearest neighbors of each cell are assumed to be fully connected (and given a connectivity value of 1)
- adata
- Return type:
- Returns:
The
adata
with the batch-corrected graph.