Scanpy – Single-Cell Analysis in Python¶

Scanpy is a scalable toolkit for analyzing single-cell gene expression data. It includes preprocessing, visualization, clustering, pseudotime and trajectory inference and differential expression testing. The Python-based implementation efficiently deals with datasets of more than one million cells.
Report issues and see the code on GitHub. If Scanpy is useful for your research, consider citing Genome Biology (2018).
Note
Also see the release notes of anndata
.
Version 1.4 February 5, 2019¶
Major updates:
- one can now
import scanpy as sc
instead ofimport scanpy.api as sc
, see here 1.3.7 - a new plotting gallery for visualizing marker genes, see here 1.3.6 thanks to F Ramirez
- tutorials are integrated on ReadTheDocs, see simple clustering and simple trajectory inference 1.3.6
- a fully distributed preprocessing backend 1.3.3 thanks to T White and the Laserson Lab
- changed default compression to
None
inwrite_h5ad()
to speed up read and write, disk space use is usually less critical anndata 0.6.16 - performance gains in
write_h5ad()
due to better handling of strings and categories anndata 0.6.19 thanks to S Rybakov
Two new possibilities for interactive exploration of analysis results:
- CZI’s cellxgene directly reads
.h5ad
files thanks to the cellxgene developers - the UCSC Single Cell Browser requires exporting via
cellbrowser()
1.3.6 thanks to M Haeussler
Further updates:
highly_variable_genes()
supersedesfilter_genes_dispersion()
, it gives the same results but, by default, expects logarithmized data and doesn’t subset 1.3.6 thanks to S Rybakovcombat()
reimplements Combat for batch effect correction [Johnson07] [Leek12], heavily based on the Python implementation of [Pedersen12], but with performance improvements, see here 1.3.7 thanks to M Langeleiden()
wraps the recent graph clustering package by [Traag18] 1.3.4 thanks to K Polanskibbknn()
wraps the recent batch correction package [Park18] 1.3.4 thanks to K Polanskiphenograph()
wraps the graph clustering package Phenograph [Levine15] 1.3.7 thanks to A Mousacalculate_qc_metrics()
caculates a number of quality control metrics, similar tocalculateQCMetrics
from Scater [McCarthy17] 1.3.4 thanks to I Virshupread_10x_h5()
throws more stringent errors and doesn’t require speciying default genomes anymore, see here and here 1.3.8 thanks to I Vishrupread_10x_h5()
andread_10x_mtx()
read Cell Ranger 3.0 outputs, see here 1.3.3 thanks to Q Gong
Version 1.3 September 3, 2018¶
RNA velocity in single cells [Manno18]:
- Scanpy and AnnData support loom’s layers so that computations for single-cell RNA velocity [Manno18] become feasible thanks to S Rybakov and V Bergen
- the package scvelo perfectly harmonizes with Scanpy and is able to process loom files with splicing information produced by Velocyto [Manno18], it runs a lot faster than the count matrix analysis of Velocyto and provides several conceptual developments (preprint to come)
Plotting of marker genes and quality control, see this section and scroll down, a few examples are
dotplot()
for visualizing genes across conditions and clusters, see here thanks to F Ramirezheatmap()
for pretty heatmaps, see here thanks to F Ramirezviolin()
now produces very compact overview figures with many panels, see here thanks to F Ramirezhighest_expr_genes()
for quality control, see here; plot genes with highest mean fraction of cells, similar toplotQC
of Scater [McCarthy17] thanks to F Ramirez
There is a section on imputation:
magic()
for imputation using data diffusion [vanDijk18] thanks to S Gigantedca()
for imputation and latent space construction using an autoencoder [Eraslan18]
Version 1.2 June 8, 2018¶
paga()
improved, see theislab/paga; the default model changed, restore the previous default model by passingmodel='v1.0'
Version 1.1 May 31, 2018¶
set_figure_params()
by default passesvector_friendly=True
and allows you to produce reasonablly sized pdfs by rasterizing large scatter plotsdraw_graph()
now defaults to the ForceAtlas2 layout [Jacomy14] [Chippada18], which is often more visually appealing and whose computation is much faster thanks to S Wollockscatter()
also plots along variables axis thanks to MD Lueckenpca()
andlog1p()
support chunk processing thanks to S Rybakovregress_out()
is back to multiprocessing thanks to F Ramirezread()
reads compressed text files thanks to G Eraslanmitochondrial_genes()
for querying mito genes thanks to FG Brundumnn_correct()
for batch correction [Haghverdi18] [Kang18]phate()
for low-dimensional embedding [Moon17] thanks to S Gigantesandbag()
,cyclone()
for scoring genes [Scialdone15] [Fechtner18]
Version 1.0 March 28, 2018¶
Scanpy is much faster and more memory efficient. Preprocess, cluster and visualize 1.3M cells in 6 h, 130K cells in 14 min and 68K cells in 3 min.
The API gained a preprocessing function neighbors()
and a
class Neighbors()
to which all basic graph computations are
delegated.
Upgrading to 1.0 isn’t fully backwards compatible in the following changes:
- the graph-based tools
louvain()
dpt()
draw_graph()
umap()
diffmap()
paga()
now require prior computation of the graph:sc.pp.neighbors(adata, n_neighbors=5); sc.tl.louvain(adata)
instead of previouslysc.tl.louvain(adata, n_neighbors=5)
- install
numba
viaconda install numba
, which replaces cython - the default connectivity measure (dpt will look different using default
settings) changed. setting
method='gauss'
insc.pp.neighbors
uses gauss kernel connectivities and reproduces the previous behavior, see, for instance this example - namings of returned annotation have changed for less bloated AnnData objects, which means that some of the unstructured annotation of old AnnData files is not recognized anymore
- replace occurances of
group_by
withgroupby
(consistency withpandas
) - it is worth checking out the notebook examples to see changes, e.g., here
- upgrading scikit-learn from 0.18 to 0.19 changed the implementation of PCA, some results might therefore look slightly different
Further changes are:
- UMAP [McInnes18] can serve as a first visualization of the data just as tSNE,
in contrast to tSNE, UMAP directly embeds the single-cell graph and is faster;
UMAP is now also used for measuring connectivities and computing neighbors,
see
neighbors()
- graph abstraction: AGA is renamed to PAGA:
paga()
; now, it only measures connectivities between partitions of the single-cell graph, pseudotime and clustering need to be computed separately vialouvain()
anddpt()
, the connectivity measure has been improved - logistic regression for finding marker genes
rank_genes_groups()
with parametermethod='logreg'
louvain()
now provides a better implementation for reclustering viarestrict_to
- scanpy no longer modifies rcParams upon import, call
settings.set_figure_params
to set the ‘scanpy style’ - default cache directory is
./cache/
, setsettings.cachedir
to change this; nested directories in this are now avoided - show edges in scatter plots based on graph visualization
draw_graph()
andumap()
by passingedges=True
downsample_counts()
for downsampling counts thanks to MD Luecken- default ‘louvain_groups’ are now called ‘louvain’
- ‘X_diffmap’ now contains the zero component, plotting remains unchanged
Version 0.4.4 February 26, 2018¶
- embed cells using
umap()
[McInnes18]: examples - score sets of genes, e.g. for cell cycle, using
score_genes()
[Satija15]: notebook
Version 0.4.3 February 9, 2018¶
clustermap()
: heatmap from hierarchical clustering, based onseaborn.clustermap()
[Waskom16]- only return
matplotlib.Axis
in plotting functions ofsc.pl
whenshow=False
, otherwiseNone
Version 0.4 December 23, 2017¶
- export to SPRING [Weinreb17] for interactive visualization of data: tutorial, docs
Version 0.3.2 November 29, 2017¶
- finding marker genes via
rank_genes_groups_violin()
improved: example
Version 0.3 November 16, 2017¶
AnnData
can beconcatenate()
d.AnnData
is available as a separate package- results of PAGA are simplified
Version 0.2.9 October 25, 2017¶
Initial release of partition-based graph abstraction (PAGA).
Version 0.2.1 July 24, 2017¶
Scanpy now includes preprocessing, visualization, clustering, pseudotime and trajectory inference, differential expression testing and simulation of gene regulatory networks. The implementation efficiently deals with datasets of more than one million cells.
Version 0.1 May 1, 2017¶
Scanpy computationally outperforms the Cell Ranger R kit and allows reproducing most of Seurat’s guided clustering tutorial.