2010-02-24

did something happen?

Silly me. I thought my discovery of the IEA annotations would change something. But nothing happened. I'm asking myself: will there be anything added to the IEAs, at all? Should I go ahead and host a curated annotation database? No, just wait and see first. Now that Sir V knows you're IT, you maybe get a chance to help later when deploying and using the DB.

2010-02-06

Dduh

That's what I get for not researching thoroughly: the software isn't necessary because UniProt provides all annotations in GAF format. As this is text and one record per line, a simple grep does the trick. But it's even more embarrassing for the project leads, as the data is provided by UniProt just for those people that are part of an annotation project.

So, after additionally having provided them with a list of unannotated genes, let's see how it comes out.

2010-02-03

Breaking a promise

You would perhaps think from the above that I have training as biologist or in medicine, no, I'm carrying thirty years of IT, and it's hard, believe me. I long ago made the decision to never more maintain a software project myself, only submitting patches until the end of time … ah what a dream.

So, naïve as usual, I join OSDD.

The connect2decode project @osdd had already started -- this is about building a machine-readable knowledge base about all 3951 proteins that can be produced by the genome of a specific lab strain (H37Rv) of evil Mycobacterium tuberculosis. This strain is relatively well researched---lots of papers have been written about experiments with it, and it was the first mtb strain to be sequenced. The problem is the following: to identify possible drug targets, one wants to get the big picture on this bacterium, lots of experiments using microarrays are planned. But to make sense of the huge information flow from these experiments, one needs computer help. For such reasons, an ontology was developed---a categorization of all possible functions proteins can have, processes they take part in, and compartments they can be localized to. This is fed to a knowledge base, and it all, when finished, can answer good questions, or find patterns, or such. It's an application of artificial intelligence research.

I just see the UniProt data on H37Rv contains 250 proteins marked as high-confidence drug targets by targetTB. Is this not enough? Oh well.

When I first read about OSDD, a press release celebrated the first 700 volunteers for the project, most of them started annotation. First, I did some preliminary work on the Wiki, see my userpage. Then, a plea in the Google group asked for more people to join in. So I did. When I looked around, there remained only about 70-80 people in the annotation project. What happened? Apparently, they all had bailed out. When I started the first work, I saw many not even starting. When it came to finally inputting the data a second time (after peer-review on a spreadsheet) I slowly began to understand what made the project a senseless grinding machine. To understand this yourself, bear with some more explanations from me.

To get a complete annotation, reading of papers doesn't suffice. Papers cover less then ten per cent of mtb proteins. To the rescue comes structural biology software that can tell you for >90 per cent of proteins what they probably do, above all enzyme function---for the rest, it fails utterly, but that's not the point. The computed function can be labeled with an ontology annotation, so, voilà, 90 per cent of proteins can have annotations. No one cares that they aren't worth as much as real experiments, as long as they're labelled as computed. Such annotations are computed (using heavy clusters) and collected by the UniProt database. Now, the plan went like this (I'm guessing): we have so many volunteers who read papers and write annotations from it---if they don't find a paper, well then, just copy the electronic annotations from UniProt. Every protein will have an annotation this way. Yes, that's fine if copying doesn't take much. But it's only a few papers and 99 per cent copying. And copying not only once but twice. And, hell, that's where I bailed out, you could be finished already if you write a little filter for the freely available UniProt data. You could even stay up-to-date this way and apply the same to other mtb strains.

So what did I do?

It's software maintenance again …
So, I changed my blog from livejournal to Google's blogger. Honestly, reachablity as measured by searching for it was poor, anyway.

So, let's see if that changes.

The other reason for starting afresh was that I became engaged with The Effort Against Tuberculosis[tm] (but it's IMHO not really only M. tuberculosis that is a nuisance but M. bovis, M. paratuberculosis, M. ulcerans, and M. leprae and maybe others as well) in that I joined the OSDD (Open Source Drug Development) collaboration in december.

The OSDD, as my feeble understanding is, is supported by Indian institutions which makes sense as they have a lot of TB cases and no inexpensive treatment either. I wouldn't want to go through the (western) gold standard treatment, either, as it involves three heavy medications and half a year of side effects. So, a better drug would be quite a progress to us westerners, too. No need to go into why companies haven't picked that up, they sell their stuff fine, so why should they? This all means, some effort is necessary NOW.