Blue Gene 2002

 IBM and NeSC workshop on Protein Science

    National e-Science Centre, Edinburgh,  March 15-16 2002

   
   
Database indexing for faster approximate matching of DNA and protein strings
Ela Hunt, Dept. of Computing Science, University of Glasgow, Glasgow G12 8QQ, UK  

We are developing new database technologies for indexed access to DNA and protein sequences. Indexing is beneficial for large data collections, and the gain from indexing increases with the collection size. So far, long strings which cannot be split into shorter words, like DNA and proteins, have eluded indexing, and significant research effort centres on creating new technologies which can provide indexed access to those data types.
The suffix tree is a candidate index structure for approximate matching, but, so far, all attempts to build suffix trees on disk in excess of RAM size have failed. We recently demonstrated that trees larger than RAM are possible, and carried out tree construction tests for 200-300 MB of DNA and protein. Further, we adapted a known approximate matching algorithm to the biological context of similarity matching, and we measured the indexing gain achieved for our large disk-resident indexes. It turns out that DNA indexing woul be beneficial, but that there is still too little protein data to warrant protein indexing using suffix trees. We are currently preparing large approximate matching tests, with half of human genome first, and then with all publicly available DNA data (20 GB).


SPONSORS
National e-Science Centre (NeSC)
The University of Edinburgh
SYMPOSIUM INFORMATION
IBM logo BlueGene logo NeSC logo