Indexing, Search & Compression

A key computer science challenge of this project is to enable sequence and metadata search across thousands of whole genomes using high performance distributed search indexes. Additional challenges specific to cancer genomes include the ability to search for particular patterns of variants.

Research Statement

  1. A compressed distributed reference genome index with very quick search capability
  2. A stream mapper on a distributed index for real-time variant calling
  3. An index on short read data collections that enables similarity search capability
  4. Indexed search techniques for unaligned reads
  • Management Team

  • Collaborators

    • Jared Simpson, Collaborator
    • Bonnie Berger, Collaborator


This core enables the efficient compression and fast searching of cancer genome sequences. It will have an immediate benefit to the Collaboratory by allowing more sequence data to be stored in the same physical capacity, as well as to the entire genomics community, which is facing a rate of increase in NGS sequencing data that well exceeds the rate of advance in raw storage capacity.

Its novelty lies in the integration of computational methods for handling large scale sequence data through:

  1. parallelization
  2. streaming and on-line computing
  3. I/O efficiency (in the context of data structures)
  4. compression


  • DeeZ

    Reference-based compression by local assembly


    Ultra-Sensitive Detection of Single Nucleotide Variants and Indels in Circulating Tumour DNA

  • mrsFAST

    Compact, SNP-aware mapper for high performance sequencing applications

  • Cypripi

    Exact genotyping of CYP2D6 using high-throughput sequencing data

  • CTPsingle

    Clonality inference from low coverage single-sample tumors

Latest Publications & Presentations


Numanagić I, Bonfield JK, Hach F, Voges J, Ostermann J, Alberti C, Mattavelli M, Sahinalp SC, (2016)

Comparison of high-throughput sequencing data compression tools.

Nature methods, 2016;():


Compression - State of the art

Presenter: Cenk Sahinalp

Date: February 2016

Meeting: MPEG-114, San Diego; San Diego, USA


DeeZ: reference based compression by local assembly

Presenter: Ibrahim Numanagic

Date: April 2015

Meeting: Biological Data Sciences - Cold Spring Harbor Lab; Cold Spring Harbor, USA