2010-12-31
New Year - new version
2010-11-11
MTB-GOA v6
2010-09-10
MTB-GOA v5
2010-09-03
MTB-GOA v3
2010-08-30
Website up, MTB-GOA v2 too
2010-08-28
M.tb. sulfur metabolism at reactome
2010-08-12
2010-08-03
900 papers
The numbers: 6,200 annotations of 899 papers about 2,306 proteins (58% of the genome)
So, it is done. Factum est. All papers until and including 2008 about Mycobacterium tuberculosis proteins (modulo some microarray-only and two-dozen without informative abstract) are annotated in GAF format. To achieve the last step, the correct format, I learned ruby and patched the bioruby package on the fly, which was fun---so I guess I stay with ruby (and bioruby) for some time.
Instead of doing The Right Thing[tm] now, which is quality control of the annotations, I'm thinking about fixing the GO parsing in bioruby. Guess what it will be.
2010-06-16
just had to comment on an article from the 2009 Hindustan Times with an interesting story: an Indian expat wants to reestablish himself in his birth country. First thing he does, he writes a report and tries to capitalize all science assets he finds---without syncing with any political peers. Consequently, he is frowned upon and loses his job. Last I saw was his blog spewing one-sided information at his previous employer-to-be. Now, what does this tell the reader? First, before returning "home", check political compatibility; if not matching, first find peers with power before embarking on touchy subjects. It also proves having a standard scientific education does not create political sensibility in a person.
600 papers, 3700 annotations on 1600 genes!
2010-06-10
reaction
not totally unexpectedly, something has boiled over, and in Nature as that, no less. This online article gives some people voice who were not amused by some too early announcements (we wrote about that too). And no wonder the publishing houses grab any opportunity to discredit efforts that might make them obsolete if succesful. On the other hand, it could be an opportunity to win more moral ground by pointing at those that didn't participate. So, let's look at the interesting bits.
Nature writes Brahmachari's highly publicized announcement on 11 April that the project has comprehensively mapped, compiled and verified the genome of M.tb. but they don't give any reference to Brahmachari's alleged announcement. They would be hard pressed. It was all Indian mass media quoting Brahmachari and I give him the benefit of the doubt that he was misquoted. I mean, not even the author of the Nature comment appears to have a grasp about what the first phase of TBGO was about.
As we read on, however, clouds accumulate. Brahmachari was misquoted another time? It depends on how succesful Raghava was with his neural nets. Probably he had some result, and that was overhyped by Brahmachari. I can now see that.
Next, a description of the TBGO sub project contains a link to some completely different thing on TBrowse, and Jayaraman doesn't notice it. Talk about quality paper. But Gohkale, the CSIR director, has no clue either. He apparently wasn't told that "the data" is still not available, the effort of the few dozen remaining participants is stalling, and no one does any real work except that bloody fool from Berlin who single-handedly annotates all 7-800 papers on the subject, and doesn't see a single rupee of the cake!
The rest of the article is opinions.
In my opinion, the real issue is not that some unlucky people frantically try to make the best of a seemingly hopeless situation (you mean bloody students?) and a whopping $32 million. I would never stay in the path of people with such visions. I would rather finish annotating (July), then write my part of the paper, and quietly switch to pathway work at reactome.org. I can guess, however, that there will be more on the subject, if not from the frightened houses.
2010-05-29
milestones
2010-04-14
I'll grab the opportunity to publish data about my private annotation effort which is about halfway through. I have looked at 350 papers of which 240 are now annotated, yielding 810 annotations of 350 gene products. Not even ten percent of the genome.
2010-03-16
a big sigh
Having made my big sigh, I'm now (the?) one who makes GO annotations from the literature on mycobacterial (as in tuberculosis) protein function. I intend to do a complete job, even if I'm not the only one, because I know my level of quality, and comparing two version will give a better one. This will take two three months. After that, let's hope someone has written the software we need...
2010-02-24
did something happen?
2010-02-06
Dduh
So, after additionally having provided them with a list of unannotated genes, let's see how it comes out.
2010-02-03
Breaking a promise
So, naïve as usual, I join OSDD.
The connect2decode project @osdd had already started -- this is about building a machine-readable knowledge base about all 3951 proteins that can be produced by the genome of a specific lab strain (H37Rv) of evil Mycobacterium tuberculosis. This strain is relatively well researched---lots of papers have been written about experiments with it, and it was the first mtb strain to be sequenced. The problem is the following: to identify possible drug targets, one wants to get the big picture on this bacterium, lots of experiments using microarrays are planned. But to make sense of the huge information flow from these experiments, one needs computer help. For such reasons, an ontology was developed---a categorization of all possible functions proteins can have, processes they take part in, and compartments they can be localized to. This is fed to a knowledge base, and it all, when finished, can answer good questions, or find patterns, or such. It's an application of artificial intelligence research.
I just see the UniProt data on H37Rv contains 250 proteins marked as high-confidence drug targets by targetTB. Is this not enough? Oh well.
When I first read about OSDD, a press release celebrated the first 700 volunteers for the project, most of them started annotation. First, I did some preliminary work on the Wiki, see my userpage. Then, a plea in the Google group asked for more people to join in. So I did. When I looked around, there remained only about 70-80 people in the annotation project. What happened? Apparently, they all had bailed out. When I started the first work, I saw many not even starting. When it came to finally inputting the data a second time (after peer-review on a spreadsheet) I slowly began to understand what made the project a senseless grinding machine. To understand this yourself, bear with some more explanations from me.
To get a complete annotation, reading of papers doesn't suffice. Papers cover less then ten per cent of mtb proteins. To the rescue comes structural biology software that can tell you for >90 per cent of proteins what they probably do, above all enzyme function---for the rest, it fails utterly, but that's not the point. The computed function can be labeled with an ontology annotation, so, voilà, 90 per cent of proteins can have annotations. No one cares that they aren't worth as much as real experiments, as long as they're labelled as computed. Such annotations are computed (using heavy clusters) and collected by the UniProt database. Now, the plan went like this (I'm guessing): we have so many volunteers who read papers and write annotations from it---if they don't find a paper, well then, just copy the electronic annotations from UniProt. Every protein will have an annotation this way. Yes, that's fine if copying doesn't take much. But it's only a few papers and 99 per cent copying. And copying not only once but twice. And, hell, that's where I bailed out, you could be finished already if you write a little filter for the freely available UniProt data. You could even stay up-to-date this way and apply the same to other mtb strains.
So what did I do?
It's software maintenance again …
So, let's see if that changes.
The other reason for starting afresh was that I became engaged with The Effort Against Tuberculosis[tm] (but it's IMHO not really only M. tuberculosis that is a nuisance but M. bovis, M. paratuberculosis, M. ulcerans, and M. leprae and maybe others as well) in that I joined the OSDD (Open Source Drug Development) collaboration in december.
The OSDD, as my feeble understanding is, is supported by Indian institutions which makes sense as they have a lot of TB cases and no inexpensive treatment either. I wouldn't want to go through the (western) gold standard treatment, either, as it involves three heavy medications and half a year of side effects. So, a better drug would be quite a progress to us westerners, too. No need to go into why companies haven't picked that up, they sell their stuff fine, so why should they? This all means, some effort is necessary NOW.