2010-02-03

Breaking a promise

You would perhaps think from the above that I have training as biologist or in medicine, no, I'm carrying thirty years of IT, and it's hard, believe me. I long ago made the decision to never more maintain a software project myself, only submitting patches until the end of time … ah what a dream.

So, naïve as usual, I join OSDD.

The connect2decode project @osdd had already started -- this is about building a machine-readable knowledge base about all 3951 proteins that can be produced by the genome of a specific lab strain (H37Rv) of evil Mycobacterium tuberculosis. This strain is relatively well researched---lots of papers have been written about experiments with it, and it was the first mtb strain to be sequenced. The problem is the following: to identify possible drug targets, one wants to get the big picture on this bacterium, lots of experiments using microarrays are planned. But to make sense of the huge information flow from these experiments, one needs computer help. For such reasons, an ontology was developed---a categorization of all possible functions proteins can have, processes they take part in, and compartments they can be localized to. This is fed to a knowledge base, and it all, when finished, can answer good questions, or find patterns, or such. It's an application of artificial intelligence research.

I just see the UniProt data on H37Rv contains 250 proteins marked as high-confidence drug targets by targetTB. Is this not enough? Oh well.

When I first read about OSDD, a press release celebrated the first 700 volunteers for the project, most of them started annotation. First, I did some preliminary work on the Wiki, see my userpage. Then, a plea in the Google group asked for more people to join in. So I did. When I looked around, there remained only about 70-80 people in the annotation project. What happened? Apparently, they all had bailed out. When I started the first work, I saw many not even starting. When it came to finally inputting the data a second time (after peer-review on a spreadsheet) I slowly began to understand what made the project a senseless grinding machine. To understand this yourself, bear with some more explanations from me.

To get a complete annotation, reading of papers doesn't suffice. Papers cover less then ten per cent of mtb proteins. To the rescue comes structural biology software that can tell you for >90 per cent of proteins what they probably do, above all enzyme function---for the rest, it fails utterly, but that's not the point. The computed function can be labeled with an ontology annotation, so, voilà, 90 per cent of proteins can have annotations. No one cares that they aren't worth as much as real experiments, as long as they're labelled as computed. Such annotations are computed (using heavy clusters) and collected by the UniProt database. Now, the plan went like this (I'm guessing): we have so many volunteers who read papers and write annotations from it---if they don't find a paper, well then, just copy the electronic annotations from UniProt. Every protein will have an annotation this way. Yes, that's fine if copying doesn't take much. But it's only a few papers and 99 per cent copying. And copying not only once but twice. And, hell, that's where I bailed out, you could be finished already if you write a little filter for the freely available UniProt data. You could even stay up-to-date this way and apply the same to other mtb strains.

So what did I do?

It's software maintenance again …

Keine Kommentare:

Kommentar veröffentlichen