Programmed batch download

A forum for discussing Power User related features of FlyBase such as using Chado, GFF, FASTA files, etc...

Programmed batch download

Postby Richard » Tue Mar 11, 2008 7:49 am

Hi there.

Previous versions of flybase (up to v4.3) provided fucntionality for programmed batch downloads. It was for example possible to build urls like
http://flybase.bio.indiana.edu/cgi-bin/ ... =text/tsv& (now we have to use http://chervil.bio.indiana.edu:7092/cgi ... =text/tsv&)

Are there any plans to implement similiar functionality into the post 4.3 releases of flybase?
Richard
 
Posts: 4
Joined: Tue Mar 11, 2008 7:39 am

Re: Programmed batch download

Postby Josh Goodman » Thu Mar 13, 2008 4:12 pm

Hi Richard,

We do have one now that is similar to what you describe but it is undocumented and hasn't been fully vetted yet because of resources being focused on integrating the 10 new genomes. Now that this is mostly done, getting this service back up is high on our TODO list. What would help us is if you and other power users would let us know what it is you are looking for in this type of service. Are you happy with the old method or would you care for a REST service like the NCBI Eutils package? Options currently under evaluation include putting our existing old style service into production (i.e. testing and documentation) and/or deploying BioMart which has a nice API.

Whatever the final outcome is we hope to have something ready by mid to late summer. Thanks for taking the time to contact us about this.

Josh
Josh Goodman
Site Admin
 
Posts: 64
Joined: Mon Nov 26, 2007 2:39 pm

Re: Programmed batch download

Postby Richard » Tue Mar 25, 2008 11:14 am

Dear Josh.

Thanks for your prompt reply. I would say anything with at least the same functionality, as soon as possible would suit our needs best. On the longer term some API/Biomart/SQL interface would definitely be the best.

Looking forward to the update!

Regards
Richard
Richard
 
Posts: 4
Joined: Tue Mar 11, 2008 7:39 am

Re: Programmed batch download

Postby Josh Goodman » Tue Mar 25, 2008 1:36 pm

Hi Richard,

Thanks for your feedback, we really do appreciate it. One point I forgot to mention earlier is that we now offer direct access to our Chado database.

The example batch query you gave can be done using a SQL query like this:
Code: Select all
select f.uniquename as FBID,f.name as SYMBOL,f.timelastmodified as DATE,
        (select s.synonym_sgml
                from feature_synonym fs, synonym s, cvterm cvt
                where fs.is_current=true and cvt.name='fullname'
                      and f.feature_id=fs.feature_id
                      and fs.synonym_id=s.synonym_id
                      and s.type_id=cvt.cvterm_id limit 1) as FULLNAME
        from feature f
        where f.uniquename in ('FBgn0000014','FBgn0000015');


Another possible option for this specific example would be to mine our fb_synonym_fb_2008_03.tsv.gz file that is produced for each release. You don't get the last modified date with this approach.

Chado is a bit overwhelming so if you want to find other pieces of information please do let us know and we can try to help you formulate the SQL necessary.

Cheers,
Josh
Josh Goodman
Site Admin
 
Posts: 64
Joined: Mon Nov 26, 2007 2:39 pm

Re: Programmed batch download

Postby Richard » Tue Apr 22, 2008 6:40 am

Dear Josh.

Chado seems if it might to the trick! If you can save me time by helping to formulate the neede SQL - why not?
What we need is the following.
Given a list of CG numbers, for each CG number extract: The Flybaseid, Full name, Biological Process (with GO ids/urls), Cellular component (with GO ids/url), Molecular function (with GO ids/url), Protein domains (with interpro id/url)
(Previously we referenced these fields with "ID","NAM","FNC","CEL","ENZ","PDOM")

Not the most complex query, but if you can help me with it I can at least bring the software to where it was with flybase v4.3, and then hopefully learn enough to help myself with the rest.

Regards
Richard
Richard
 
Posts: 4
Joined: Tue Mar 11, 2008 7:39 am

Re: Programmed batch download

Postby Josh Goodman » Tue Apr 22, 2008 9:46 pm

To make this slightly easier to understand I'll break this into separate queries.

CG Symbol -> FBgn#
The first query resolves the annotation symbols to FlyBase IDs. The CG symbols are all stored as dbxref's in Chado. To ensure that you use the current symbol you need to filter out dbxref's from the 'FlyBase Annotation IDs' db and one where feature_dbxref.is_current=true.

Code: Select all
select dbx.accession as CG_SYMBOL, f.uniquename as FBID
          from feature f, feature_dbxref fdbx, dbxref dbx, db
          where dbx.accession='CG4832' and fdbx.is_current=true and db.name='FlyBase Annotation IDs' and
                    f.feature_id=fdbx.feature_id and fdbx.dbxref_id=dbx.dbxref_id and dbx.db_id=db.db_id;


Once you get the current FBgn# it is best to use this for all the subsequent queries otherwise you will end up with some hairy SQL every time.

Gene Full Name
The full gene name is stored in the synonym table which is linked to the feature table via feature_synonym.

Code: Select all
select distinct(s.synonym_sgml), f.uniquename
         from feature f, feature_synonym fs, synonym s, cvterm type
         where f.uniquename='FBgn0013765' and fs.is_current=true and type.name='fullname' and
                   f.feature_id=fs.feature_id and fs.synonym_id=s.synonym_id and s.type_id=type.cvterm_id;


A single full name entry can be returned multiple times in that query because it is attributed to many pubs via feature_synonym so we use DISTINCT to only return one.

GO IDs
To get the GO IDs for a gene we have to traverse to the cvterm and dbxref tables via feature_cvterm. This query will return all 3 types of GO terms.

Code: Select all
select distinct(db.name || ':' || dbx.accession) as GOID, cvt.name as term, fcv.is_not, cv.name
          from feature f, feature_cvterm fcv, cvterm cvt, dbxref dbx, db, cv
          where f.uniquename='FBgn0013765' and db.name='GO' and f.feature_id=fcv.feature_id and fcv.cvterm_id=cvt.cvterm_id and
                    cvt.cv_id=cv.cv_id and cvt.dbxref_id=dbx.dbxref_id and dbx.db_id=db.db_id;


InterPro domains
InterPro domains are stored as dbxref's so the query is similar to the annotation symbol query above.

Code: Select all
select f.uniquename as FBID, dbx.accession, dbx.description as domain
          from feature f, feature_dbxref fdbx, dbxref dbx, db
          where f.uniquename='FBgn0013765' and fdbx.is_current=true and upper(db.name)='INTERPRO' and
                f.feature_id=fdbx.feature_id and fdbx.dbxref_id=dbx.dbxref_id and dbx.db_id=db.db_id;


This should get you on your way but feel free to let us know if you still have questions.

Cheers,
Josh
Josh Goodman
Site Admin
 
Posts: 64
Joined: Mon Nov 26, 2007 2:39 pm

Re: Programmed batch download

Postby Richard » Tue May 06, 2008 11:29 am

Dear Josh.

I've updated our software and it works much faster than before :D .

Thanks for your help - working out the sql would've taken me a long time!

Richard
Richard
 
Posts: 4
Joined: Tue Mar 11, 2008 7:39 am


Return to Power Users

cron