Database indexing for faster approximate matching of
DNA and protein strings
Ela Hunt, Dept. of Computing Science, University of Glasgow,
Glasgow G12 8QQ, UK
We are developing new database technologies for indexed access to DNA and
protein sequences. Indexing is beneficial for large data collections, and
the gain from indexing increases with the collection size. So far, long strings
which cannot be split into shorter words, like DNA and proteins, have eluded
indexing, and significant research effort centres on creating new technologies
which can provide indexed access to those data types.
The suffix tree is a candidate index structure for approximate matching,
but, so far, all attempts to build suffix trees on disk in excess of RAM
size have failed. We recently demonstrated that trees larger than RAM are
possible, and carried out tree construction tests for 200-300 MB of DNA and
protein. Further, we adapted a known approximate matching algorithm to the
biological context of similarity matching, and we measured the indexing gain
achieved for our large disk-resident indexes. It turns out that DNA indexing
woul be beneficial, but that there is still too little protein data to warrant
protein indexing using suffix trees. We are currently preparing large approximate
matching tests, with half of human genome first, and then with all publicly
available DNA data (20 GB).
|
|