Bild eines Drachen Homepage von Jörg Rüdenauer Bild eines Raben

Studium Software Worte Ich
Zurück Home

How to implement a SequenceDB

After a quick browse through the Javadoc of the Biojava API, it becomes apparent that a class that can perform BLAST searches should be implementing the Interface SeqSimilaritySearcher from the package org.biojava.bio.search. Unfortunately, you need to give a SequenceDB to the important method (ok, you need that SequenceDB for the parsing of the search results, too - but that's another issue). Now I wanted to do the search on protein files in FASTA format, like the nr. How to get a SequenceDB for that?

In the current CVS of Biojava, you'll find a package called org.biojava.bio.program.indexdb, which could contain handy classes to solve the problem - perhaps even quite fast, at least it's called 'index' :-). Unfortunately, there's no documentation for that package, and I assume that it could well be in development, i.e. unstable.

So I settled for an easier solution using the SeqIOTools class from the package org.biojava.bio.seq.io. This has the major drawback of being exceptionally slow for random access, since it always walks through the file sequentially to find a certain sequence. So I propose that, if you want to use a SequenceDB seriously, you try to use the indexdb package (or find a third solution). But for our purpose of just performing a BLAST search, it suffices - and the method of building the SequenceDB can very easily be changed in my solution.

The FastaProt2SequenceDBWrapper class

Writing a wrapper that uses a FASTA file and lets it look like a SequenceDB is fairly simple if you just use the SeqIOTools, i.e., if you neither want fast random access nor want to change the file (though the latter could probably easily be added). Firstly, you inherit from the AbstractSequenceDB class of the package org.biojava.bio.seq.db, that helps you with a few standard routines. So we have

public class FastaProt2SequenceDBWrapper extends AbstractSequenceDB {

Unfortunately, the method sequenceIterator() doesn't define exceptions to be thrown. On the other hand, I certainly didn't want to load the whole file into memory. So I decided to check for the existence of the file in the constructor. Of course, the file could still be deleted in between the construction of an instance and the actual access to the file, but I don't see a possibility to handle that nicely...

public FastaProt2SequenceDBWrapper(String name) throws FileNotFoundException {

I've decided to use the Name of a sequence as the ID, that works. The IDs are stored in a set, which is lazily created.

public Set ids() { // Lazy creation of the set - only when it's needed, since the whole file // must be searched. if (this.IDs == null) { this.IDs = new HashSet(); SequenceIterator it = this.sequenceIterator(); while (it.hasNext()) { try { this.IDs.add(SearchFactory.createID(it.nextSequence().getName())); } catch (org.biojava.bio.BioException e) { // Something went wrong reading the file. But perhaps it was only // one sequence, so let's just ignore it and try to move on. // e.printStackTrace(); } } } // return the stored set return this.IDs; }

You'll notice a call to a 'SearchFactory' there. We'll get to that later. The 'main' function of the SequenceDB is also implemented easily by means of the SequenceIterator:

public Sequence getSequence(String id) throws org.biojava.bio.seq.db.IllegalIDException, org.biojava.bio.BioException { if (this.IDs != null) { // check the Set first, could avoid the search if (!this.IDs.contains(id)) { throw new org.biojava.bio.seq.db.IllegalIDException( "No Sequence with ID " + id + " in DB " + this.name); } } // search via Iterator SequenceIterator it = this.sequenceIterator(); while (it.hasNext()) { try { Sequence s = it.nextSequence(); if (SearchFactory.createID(s.getName()).equals(id)) { return s; } } catch (org.biojava.bio.BioException e) { e.printStackTrace(); throw(e); } } throw new org.biojava.bio.seq.db.IllegalIDException( "No Sequence with ID " + id + " in DB " + this.name); }

You see that for every query, the file will be searched sequentially until the iterator hits the sequence. A small improvement could be made by creating the ID-Set here if it doesn't exist yet, but that's not necessary for the BLAST search.
The iterator itself can be obtained easily by the SeqIOTools class, as has been previously mentioned.

public SequenceIterator sequenceIterator() { try { return SeqIOTools.readFastaProtein(new BufferedReader( new FileReader(SearchFactory.getBlastDBPath() + this.name))); } catch (FileNotFoundException e ) { // e.printStackTrace(); return null; } }

Of course, a new instance of FileReader must be created with each call of the method, so that multiple calls always start at the beginning of the file.
All that remains are the modifying methods of SequenceDB, but they need not be implemented. E.g. addSequence:

public void addSequence(Sequence seq) throws org.biojava.bio.BioException, org.biojava.utils.ChangeVetoException { throw new org.biojava.utils.ChangeVetoException( "Not yet supported by FastaProt2SequenceDBWrapper!"); }

The SearchFactory class

For the purpose of a proper design, I've added a 'SearchFactory' class, which encapsulates several decisions using static methods. E.g., there's a method to create a SequenceDB out of a FASTA protein file - it just uses the wrapper class described above. But you can change here to other ways (like the indexdb package) easily, and you could add similar methods for other formats.

There are other methods that return the path to the BLAST program file and the path to the data files. Currently, these are stored as constants in the source file, but a change to e.g. environment variables would present no difficulties. There is of course a method that returns the object performing the BLAST searches (we'll get to that in the next section), and finally there's a method creating IDs from a sequence name (it was already used above). For that, I just cut off the gi part, this seems to work.

public static String createID(String seqName) { int first = seqName.indexOf("|"); int second = seqName.indexOf("|", first + 1); return seqName.substring(second + 1, seqName.length()); }

Next: How to do the search
Back: The general idea
How to parse the results


URL dieser Seite: http://www.joerg-ruedenauer.de/Software/blast/blast1.html
Autor dieser Seite: Jörg Rüdenauer
Letzte Änderung am: 14.07.2002
Haftungsausschluss


L-Space now!     Valid XHTML 1.0!     Valid CSS!