2010-12-31

New Year - new version

As the first result of analysis of the goslim produced using our annotations, we have more deeply associated those genes with GOID "Actinobacteria-type cell wall biogenesis" to more specific GOIDs. Also, some GOIDs were obsolete and needed reannotation. Finally, put in a few new papers and random changes, and you have the new release. Happy and healthy New Year!

2010-11-11

MTB-GOA v6

I took the time and reviewed the very first papers I started with annotating, and made some additional annotations. I had correctly figured that you only develop your sense for annotations with the months you put into it. Only two were blatantly wrong. So, expect some more additions from such sharp backlooks. Meanwhile, some human pathways from me will be visible with the reactome December release, and a reviewer for the M.tb. stuff was found too, as I hear.

2010-09-10

MTB-GOA v5

There were still formal problems with MTB-GOA, so versions 4 and 5 came in close succession. This allowed adding some features to the software too and so, generally, wetted my coding appetite. No promises!

2010-09-03

MTB-GOA v3

The third (charming) version has 87 duplicates removed and, more importantly, is the product of a Makefile. This means, the pipe is up on my side, and can now be fed further. Expect the next release not too soon.

2010-08-30

Website up, MTB-GOA v2 too

We now have a web um page for what this blog is about, the MTBBASE where you will find links to most of the things mentioned here, and much more nicely than I can do it here---apologies for not tweaking the blogger look but you certainly see where time goes into. Version 2 of the annotations (hopefully) addresses most of what I got fed back from the UniProt-GOA group at EBI, notably evidence codes and usage of column 8. The same will make further releases necessary, also because guidelines are in flux.

2010-08-28

M.tb. sulfur metabolism at reactome

Not the whole sulfur, but assimilation and cysteine biosynthesis pathways are now at reactome, and a picture too. So, what's next? On the human side, I'll tackle MoCo biosynthesis. On the bacillus, maybe look at which processes have most functional annotations and start from there?

2010-08-12

Wow, quality control quite involved me again with programming, and prototyping with ruby is really fast. What takes time at this moment of learning a new language is the unexpected language details (say quirks), and of course, OO design. So, most quality rules I know of are applied. I'm now waiting for marks from the EBI GOA group, and further rules to apply. This means, yes, both the Indians and the EBI got the data now, as planned. Meanwhile, stalled work on M.tb pathways in reactome format continues.

2010-08-03

900 papers

The numbers: 6,200 annotations of 899 papers about 2,306 proteins (58% of the genome)

So, it is done. Factum est. All papers until and including 2008 about Mycobacterium tuberculosis proteins (modulo some microarray-only and two-dozen without informative abstract) are annotated in GAF format. To achieve the last step, the correct format, I learned ruby and patched the bioruby package on the fly, which was fun---so I guess I stay with ruby (and bioruby) for some time.

Instead of doing The Right Thing[tm] now, which is quality control of the annotations, I'm thinking about fixing the GO parsing in bioruby. Guess what it will be.

2010-06-16

just had to comment on an article from the 2009 Hindustan Times with an interesting story: an Indian expat wants to reestablish himself in his birth country. First thing he does, he writes a report and tries to capitalize all science assets he finds---without syncing with any political peers. Consequently, he is frowned upon and loses his job. Last I saw was his blog spewing one-sided information at his previous employer-to-be. Now, what does this tell the reader? First, before returning "home", check political compatibility; if not matching, first find peers with power before embarking on touchy subjects. It also proves having a standard scientific education does not create political sensibility in a person.

600 papers, 3700 annotations on 1600 genes!

2010-06-10

reaction

not totally unexpectedly, something has boiled over, and in Nature as that, no less. This online article gives some people voice who were not amused by some too early announcements (we wrote about that too). And no wonder the publishing houses grab any opportunity to discredit efforts that might make them obsolete if succesful. On the other hand, it could be an opportunity to win more moral ground by pointing at those that didn't participate. So, let's look at the interesting bits.

Nature writes Brahmachari's highly publicized announcement on 11 April that the project has comprehensively mapped, compiled and verified the genome of M.tb. but they don't give any reference to Brahmachari's alleged announcement. They would be hard pressed. It was all Indian mass media quoting Brahmachari and I give him the benefit of the doubt that he was misquoted. I mean, not even the author of the Nature comment appears to have a grasp about what the first phase of TBGO was about.

As we read on, however, clouds accumulate. Brahmachari was misquoted another time? It depends on how succesful Raghava was with his neural nets. Probably he had some result, and that was overhyped by Brahmachari. I can now see that.

Next, a description of the TBGO sub project contains a link to some completely different thing on TBrowse, and Jayaraman doesn't notice it. Talk about quality paper. But Gohkale, the CSIR director, has no clue either. He apparently wasn't told that "the data" is still not available, the effort of the few dozen remaining participants is stalling, and no one does any real work except that bloody fool from Berlin who single-handedly annotates all 7-800 papers on the subject, and doesn't see a single rupee of the cake!

The rest of the article is opinions.

In my opinion, the real issue is not that some unlucky people frantically try to make the best of a seemingly hopeless situation (you mean bloody students?) and a whopping $32 million. I would never stay in the path of people with such visions. I would rather finish annotating (July), then write my part of the paper, and quietly switch to pathway work at reactome.org. I can guess, however, that there will be more on the subject, if not from the frightened houses.


2010-05-29

milestones

today quite an important milestone passed: my private GO annotations for M. tuberculosis reached 500 annotated papers. It is important because that was the rough number I projected first from my starting impression. However, 50 more are in the queue, and the final sweep still not done. So it could easily become 700. As to the other numbers, we have altogether 1950 annotations on 650 gene products, that is 15 per cent of the genome.

2010-04-14

The "genome map" is released! What? Ah, the ontology annotations were announced. The most informative take is at livemint and -- hold it, it says CSIR to unveil gene map (at www.osdd.net) -- so, no, not much really happened except that we know now that to-be-published annotations can be edited Wikipedia-style later. I'm quite sure, however, that most genes already have some annotation now in the putative official version.

I'll grab the opportunity to publish data about my private annotation effort which is about halfway through. I have looked at 350 papers of which 240 are now annotated, yielding 810 annotations of 350 gene products. Not even ten percent of the genome.

2010-03-16

a big sigh

There's a saying by Bismarck: "The 1st generation creates value, the 2nd manages it, the 3rd studies art history, the 4th degenerates." I would like to state a version regarding computer science: "The 1st generation creates algorithms, the 2nd writes and patches app software, the 3rd merely knows how to download and install safely, the 4th blindly believes in computer power." OTOH, there have been people who started at four and made it to position one, all in one person, so don't take the saying as a prediction of the future, although sometimes it looks like it.

Having made my big sigh, I'm now (the?) one who makes GO annotations from the literature on mycobacterial (as in tuberculosis) protein function. I intend to do a complete job, even if I'm not the only one, because I know my level of quality, and comparing two version will give a better one. This will take two three months. After that, let's hope someone has written the software we need...

2010-02-24

did something happen?

Silly me. I thought my discovery of the IEA annotations would change something. But nothing happened. I'm asking myself: will there be anything added to the IEAs, at all? Should I go ahead and host a curated annotation database? No, just wait and see first. Now that Sir V knows you're IT, you maybe get a chance to help later when deploying and using the DB.

2010-02-06

Dduh

That's what I get for not researching thoroughly: the software isn't necessary because UniProt provides all annotations in GAF format. As this is text and one record per line, a simple grep does the trick. But it's even more embarrassing for the project leads, as the data is provided by UniProt just for those people that are part of an annotation project.

So, after additionally having provided them with a list of unannotated genes, let's see how it comes out.

2010-02-03

Breaking a promise

You would perhaps think from the above that I have training as biologist or in medicine, no, I'm carrying thirty years of IT, and it's hard, believe me. I long ago made the decision to never more maintain a software project myself, only submitting patches until the end of time … ah what a dream.

So, naïve as usual, I join OSDD.

The connect2decode project @osdd had already started -- this is about building a machine-readable knowledge base about all 3951 proteins that can be produced by the genome of a specific lab strain (H37Rv) of evil Mycobacterium tuberculosis. This strain is relatively well researched---lots of papers have been written about experiments with it, and it was the first mtb strain to be sequenced. The problem is the following: to identify possible drug targets, one wants to get the big picture on this bacterium, lots of experiments using microarrays are planned. But to make sense of the huge information flow from these experiments, one needs computer help. For such reasons, an ontology was developed---a categorization of all possible functions proteins can have, processes they take part in, and compartments they can be localized to. This is fed to a knowledge base, and it all, when finished, can answer good questions, or find patterns, or such. It's an application of artificial intelligence research.

I just see the UniProt data on H37Rv contains 250 proteins marked as high-confidence drug targets by targetTB. Is this not enough? Oh well.

When I first read about OSDD, a press release celebrated the first 700 volunteers for the project, most of them started annotation. First, I did some preliminary work on the Wiki, see my userpage. Then, a plea in the Google group asked for more people to join in. So I did. When I looked around, there remained only about 70-80 people in the annotation project. What happened? Apparently, they all had bailed out. When I started the first work, I saw many not even starting. When it came to finally inputting the data a second time (after peer-review on a spreadsheet) I slowly began to understand what made the project a senseless grinding machine. To understand this yourself, bear with some more explanations from me.

To get a complete annotation, reading of papers doesn't suffice. Papers cover less then ten per cent of mtb proteins. To the rescue comes structural biology software that can tell you for >90 per cent of proteins what they probably do, above all enzyme function---for the rest, it fails utterly, but that's not the point. The computed function can be labeled with an ontology annotation, so, voilà, 90 per cent of proteins can have annotations. No one cares that they aren't worth as much as real experiments, as long as they're labelled as computed. Such annotations are computed (using heavy clusters) and collected by the UniProt database. Now, the plan went like this (I'm guessing): we have so many volunteers who read papers and write annotations from it---if they don't find a paper, well then, just copy the electronic annotations from UniProt. Every protein will have an annotation this way. Yes, that's fine if copying doesn't take much. But it's only a few papers and 99 per cent copying. And copying not only once but twice. And, hell, that's where I bailed out, you could be finished already if you write a little filter for the freely available UniProt data. You could even stay up-to-date this way and apply the same to other mtb strains.

So what did I do?

It's software maintenance again …
So, I changed my blog from livejournal to Google's blogger. Honestly, reachablity as measured by searching for it was poor, anyway.

So, let's see if that changes.

The other reason for starting afresh was that I became engaged with The Effort Against Tuberculosis[tm] (but it's IMHO not really only M. tuberculosis that is a nuisance but M. bovis, M. paratuberculosis, M. ulcerans, and M. leprae and maybe others as well) in that I joined the OSDD (Open Source Drug Development) collaboration in december.

The OSDD, as my feeble understanding is, is supported by Indian institutions which makes sense as they have a lot of TB cases and no inexpensive treatment either. I wouldn't want to go through the (western) gold standard treatment, either, as it involves three heavy medications and half a year of side effects. So, a better drug would be quite a progress to us westerners, too. No need to go into why companies haven't picked that up, they sell their stuff fine, so why should they? This all means, some effort is necessary NOW.