String Graph Construction Using Incremental Hashing.

Bioinformatics (Oxford, England)

PubMedID: 25183486

Ben-Bassat I, Chor B. String Graph Construction Using Incremental Hashing. Bioinformatics. 2014;.
New sequencing technologies generate larger amount of short reads data at decreasing cost. De novo sequence assembly is the problem of combining these reads back to the original genome sequence, without relying on a reference genome. This presents algorithmic and computational challenges, especially for long and repetitive genome sequences. Most existing approaches to the assembly problem operate in the framework of de Bruijn graphs. Yet, a number of recent works employ the paradigm of string graph, using a variety of methods for storing and processing suffixes and prefixes, like suffix arrays, the Burrows-Wheeler transform, or the FM index. Our work is motivated by a search for new approaches to constructing the string graph, using alternative yet simple data structures and algorithmic concepts.

We introduce a novel, hash based method for constructing the string graph. We employ incremental hashing, and specifically a modification of the Karp-Rabin fingerprint, and Bloom filters. Using these probabilistic methods might create false positive and false negative edges during the algorithm's execution, but these are all detected and corrected. The advantages of the proposed approach over existing methods are its simplicity, and in the incorporation of established probabilistic techniques in the context of de novo genome sequencing. Our preliminary implementation is favorably comparable to the first string graph construction of Simpson and Durbin (2010) (but not to subsequent improvements). Further research and optimizations will hopefully enable the algorithm to be incorporated, with noticeable performance improvement, in state of the art string graph based assemblers.

A beta version of all source code used in this work can be downloaded from