A strong string section

Algorithms on Strings, Trees and Sequences
June 12, 1998

Sequences of symbols, or strings, are ubiquitous. This page is covered in text, strings of English letters. Word processors were used to write the articles on this page, and they will have had all sorts of features for doing useful things with strings. With small strings, cheapskate algorithms running on fast computers can get away with being badly designed - and many are. But there are many scientific and commercial uses of extremely long strings, and where many sets of strings must be processed together. In these areas, one needs a far more thorough approach. A simple algorithm might never finish, and a sophisticated algorithm might make subtle mistakes.

Dan Gusfield has written a serious book on a wide range of algorithms for strings. It would be tempting to say it is a specialist book, most suitable for postgraduates, except that string algorithms are needed almost everywhere. In particular, biological strings from genetics are enormous and they are rarely exact. A whole range of efficient, exact and inexact string processing approaches is required.

The twisted ladder-like molecule of DNA is now very familiar. The rungs of the ladder are matched pairs of bases, adenine (A), cytosine (C), guanine (G) and thymine (T). The DNA genetic code of an organism is its sequence of A, C, G or T bases - and when the DNA molecule splits down the middle, each "half rung" of the ladder is used as a template to build a copy of the sequence as a new full DNA ladder. A biologist does experiments to determine strings of bases: for example, TTAGGG occurs in every human chromosome. Typically these strings have come from many fragments of cut-up DNA and will be in an unknown order. One then has to do sequence assembly, a best match of overlaps to determine their most likely overall sequence.

Some DNA extracted from a fossil might be from a dinosaur, or it might be contamination from human or yeast DNA. String databases will be used, with collections of sequences from other researchers, to find out what the identity is. Or perhaps the biological question is concerned with the development of a new cancer treatment, or to do some DNA fingerprinting in a forensic test. The Human Genome Project guarantees there will be many people interested in this book.

Strings are very often represented by trees. Gusfield gives a considerable and fascinating range of applications of suffix tree algorithms: suffix trees are one of the standard tools that should be on-hand for all practitioners. Here's an example you can try, straight from the book. In chemistry it is important to be able to identify circular molecules. If a molecule is represented as a string, the string could start anywhere around the circle, and so would not be unique. Of all the possible strings, then, find the lexically earliest string, because this will be unique for a given circular molecule. Suffix trees provide an elegant solution, and they find it in linear time, which is faster than doing a general lexical sort.

Coincidentally, trees are also fundamental in biology. We might want to compare the evolutionary distance of DNA sequences by comparing phylogenetic trees. Or we might want to find the smallest tree that explains some taxonomic data.

Another coincidence is that DNA can itself compute. Rather than get computers to understand DNA, you can get DNA to do biological computation. In 1994 Leonard Adleman showed how to solve a routing problem with eight cities using recombinant DNA. Though fundamentally exciting, unfortunately the approach requires tonnes of DNA to work when applied to more realistically sized problems.

When you are processing enormous strings, the performance of the algorithms is crucial. An obvious algorithm might run out of memory (or even DNA) or take practically forever. Therefore a very central concern is the analysis of the time and space requirements to do string processing. Gusfield provides some very good complexity arguments and proofs of correctness, but he falls short of the modern standards of rigorous software engineering. Algorithms are sketched in pseudo-code and have only a hand-waving relation to his proofs.

In his defence, though, Gusfield argues that the vagueness in the algorithms encourages one to think about the principles, and that is no bad thing. Also, this approach ensures the book's algorithms will not go out of date, as they would have done if they had been written out in the language du jour. However, for the idle, many of the algorithms, written in C, should soon be available on the World Wide Web.

The limitation, and advantage, of this book is that it has been written from within a bio-informatics perspective. Programmers and mathematicians working in a bio-informatics laboratory will find it a very useful reference, but I do not think it will be so useful for students learning the subject. Or at least, a computational biology student needs to have lots of other courses to provide background and wider, top-down understanding.

Although there is very interesting discussion of bio-sequence database searching and use, some readers will regret there is nothing on database implementation, especially given some of the biologically motivated issues such as trying to do confidential searches. But that is detail. There is more than enough stuff on strings, dynamic programming and trees for serious programmers. It is very difficult to cover such an interdisciplinary area in a way that will satisfy all readers, though the book's glossary certainly helps. Gusfield has done very well, having steered towards a more mathematical approach - which is essential if we are going to program properly, or even to be better equipped to recognise DNA-computer hype.

The exercises at the end of each chapter, over 400 of them, are a distinctive part of the book. It is easy to avoid exercises, but these are well designed and full of interesting biological and computational details. There are nearly 500 references, including a few popular articles, and a good index.

The readers of this book will be serious programmers, but of course anybody working in biocomputing will find the book of immense practical, scientific and commercial importance. Since there is virtually no area of programming where you can escape from strings, you should get the book, whether you want to do some string processing, fundamental computing research, or want to impress a biotech firm.

Harold Thimbleby is professor of computing research, Middlesex University.

Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology

Author - Dan Gusfield
ISBN - 0 521 58519 8
Publisher - Cambridge University Press
Price - £40.00
Pages - 534

You've reached your article limit.

Register to continue

Get a month's unlimited access to THE content online. Just register and complete your career summary.

Registration is free and only takes a moment. Once registered you can read a total of 3 articles each month, plus:

  • Sign up for the editor's highlights
  • Receive World University Rankings news first
  • Get job alerts, shortlist jobs and save job searches
  • Participate in reader discussions and post comments
Register

Have your say

Log in or register to post comments