PyPI bioconda Docs Build Status

Scanpy – Single-Cell Analysis in Python

Scanpy is a scalable toolkit for analyzing single-cell gene expression data. It includes preprocessing, visualization, clustering, pseudotime and trajectory inference and differential expression testing. The Python-based implementation efficiently deals with datasets of more than one million cells.

Discuss usage on Discourse. See the code and discuss development on GitHub. If Scanpy is useful for your research, consider citing Genome Biology (2018).


Also see the release notes of anndata.

Post v1.4 July 20, 2019

New functionality:

  • scanpy.get adds helper functions for extracting data in convenient formats 1.4.4 PR 619 thanks to I Virshup

  • combat() supports additional covariates which may include adjustment variables or biological condition PR 618 1.4.2 thanks to G Eraslan

  • highly_variable_genes() has a batch_key option which performs HVG selection in each batch separately to avoid selecting genes that vary strongly across batches PR 622 1.4.2 thanks to G Eraslan

  • Scanpy has a command line interface again. Invoking it with scanpy somecommand [args] calls scanpy-somecommand [args], except for builtin commands (currently scanpy settings) PR 604 thanks to P Angerer

  • ebi_expression_atlas() allows convenient download of EBI expression atlas 1.4.1 thanks to I Virshup

  • marker_gene_overlap() computes overlaps of marker genes 1.4.1 thanks to M Luecken

  • filter_rank_genes_groups() filters out genes based on fold change and fraction of cells expressing genes 1.4.1 thanks to F Ramirez

  • normalize_total() replaces normalize_per_cell(), is more efficient and provides a parameter to only normalize using a fraction of expressed genes 1.4.1 thanks to S Rybakov

  • downsample_counts() has been sped up, changed default value of replace parameter to False PR 474 1.4.1 thanks to I Virshup

  • embedding_density() allows plots of cell densities on embeddings PR 543 1.4.1 thanks to M Luecken

  • palantir() interfaces Palantir [Setty18] PR 493 1.4.1 thanks to A Mousa

Bug fixes:

  • Stopped deprecations warnings from AnnData 0.6.22 1.4.4 thanks to I Virshup

  • rank_genes_groups() t-test implementation doesn’t return NaN when variance is 0, also changed to scipy’s implementation PR 621 1.4.2 thanks to I Virshup

  • umap() with init_pos='paga' detects correct dtype 1.4.2 thanks to A Wolf

  • neighbors() correctly infers n_neighbors again from params, which was temporarily broken in v1.4.2 1.4.3 thanks to I Virshup

  • louvain() and leiden() auto-generate key_added=louvain_R upon passing restrict_to, which was temporarily changed in v1.4.1 1.4.2 thanks to A Wolf

Code design:

  • neighbors() and umap() got rid of UMAP legacy code and introduced UMAP as a dependency PR 576 1.4.2 thanks to S Rybakov

  • calculate_qc_metrics() is single threaded by default for datasets under 300,000 cells – allowing cached compilation PR 615 1.4.3 thanks to I Virshup

  • normalie_total() gains param exclude_highly_expressed, and fraction is renamed to max_fraction with better docs thanks to A Wolf

  • .layers support of scatter plots 1.4.1 thanks to F Ramirez

  • fix double-logarithmization in compute of log fold change in rank_genes_groups() 1.4.1 thanks to A Muñoz-Rojas

  • fix return sections of docs 1.4.1 thanks to P Angerer

Version 1.4 February 5, 2019

Major updates:

Two new possibilities for interactive exploration of analysis results:

Further updates:

Version 1.3 September 3, 2018

RNA velocity in single cells [Manno18]:

  • Scanpy and AnnData support loom’s layers so that computations for single-cell RNA velocity [Manno18] become feasible thanks to S Rybakov and V Bergen

  • the package scvelo perfectly harmonizes with Scanpy and is able to process loom files with splicing information produced by Velocyto [Manno18], it runs a lot faster than the count matrix analysis of Velocyto and provides several conceptual developments (preprint to come)

Plotting of marker genes and quality control, see this section and scroll down, a few examples are

  • dotplot() for visualizing genes across conditions and clusters, see here thanks to F Ramirez

  • heatmap() for pretty heatmaps, see PR 175 thanks to F Ramirez

  • violin() produces very compact overview figures with many panels, see here thanks to F Ramirez

  • highest_expr_genes() for quality control, see PR 169; plot genes with highest mean fraction of cells, similar to plotQC of Scater [McCarthy17] thanks to F Ramirez

There is a section on imputation:

Version 1.2 June 8, 2018

  • paga() improved, see theislab/paga; the default model changed, restore the previous default model by passing model='v1.0'

Version 1.1 May 31, 2018

Version 1.0 March 28, 2018

Scanpy is much faster and more memory efficient. Preprocess, cluster and visualize 1.3M cells in 6 h, 130K cells in 14 min and 68K cells in 3 min.

The API gained a preprocessing function neighbors() and a class Neighbors() to which all basic graph computations are delegated.

Upgrading to 1.0 isn’t fully backwards compatible in the following changes:

  • the graph-based tools louvain() dpt() draw_graph() umap() diffmap() paga() require prior computation of the graph: sc.pp.neighbors(adata, n_neighbors=5); instead of previously, n_neighbors=5)

  • install numba via conda install numba, which replaces cython

  • the default connectivity measure (dpt will look different using default settings) changed. setting method='gauss' in sc.pp.neighbors uses gauss kernel connectivities and reproduces the previous behavior, see, for instance this example

  • namings of returned annotation have changed for less bloated AnnData objects, which means that some of the unstructured annotation of old AnnData files is not recognized anymore

  • replace occurances of group_by with groupby (consistency with pandas)

  • it is worth checking out the notebook examples to see changes, e.g., here

  • upgrading scikit-learn from 0.18 to 0.19 changed the implementation of PCA, some results might therefore look slightly different

Further changes are:

  • UMAP [McInnes18] can serve as a first visualization of the data just as tSNE, in contrast to tSNE, UMAP directly embeds the single-cell graph and is faster; UMAP is also used for measuring connectivities and computing neighbors, see neighbors()

  • graph abstraction: AGA is renamed to PAGA: paga(); now, it only measures connectivities between partitions of the single-cell graph, pseudotime and clustering need to be computed separately via louvain() and dpt(), the connectivity measure has been improved

  • logistic regression for finding marker genes rank_genes_groups() with parameter method='logreg'

  • louvain() provides a better implementation for reclustering via restrict_to

  • scanpy no longer modifies rcParams upon import, call settings.set_figure_params to set the ‘scanpy style’

  • default cache directory is ./cache/, set settings.cachedir to change this; nested directories in this are avoided

  • show edges in scatter plots based on graph visualization draw_graph() and umap() by passing edges=True

  • downsample_counts() for downsampling counts thanks to MD Luecken

  • default ‘louvain_groups’ are called ‘louvain’

  • ‘X_diffmap’ contains the zero component, plotting remains unchanged

Version 0.4.4 February 26, 2018

Version 0.4.3 February 9, 2018

  • clustermap(): heatmap from hierarchical clustering, based on seaborn.clustermap() [Waskom16]

  • only return matplotlib.Axis in plotting functions of when show=False, otherwise None

Version 0.4.2 January 7, 2018

  • amendments in PAGA and its plotting functions

Version 0.4 December 23, 2017

Version 0.3.2 November 29, 2017

  • finding marker genes via rank_genes_groups_violin() improved: example

Version 0.3 November 16, 2017

Version 0.2.9 October 25, 2017

Initial release of partition-based graph abstraction (PAGA).

Version 0.2.1 July 24, 2017

Scanpy includes preprocessing, visualization, clustering, pseudotime and trajectory inference, differential expression testing and simulation of gene regulatory networks. The implementation efficiently deals with datasets of more than one million cells.

Version 0.1 May 1, 2017

Scanpy computationally outperforms the Cell Ranger R kit and allows reproducing most of Seurat’s guided clustering tutorial.