Recently, we have noted an increase in users accessing FlyBase via scripts.
When done carelessly, this can overwhelm our server capacity, cause service delays, or even crash our website.
This kind of script is a main cause of FlyBase service degradation and outages!
On occasion, we have had to block calls from certain IP addresses,
in an effort to guarantee that FlyBase will be available for others.
This is also one reason we have started using a CAPTCHA gatekeeper.
That said, a well-written script doesn't have to give FlyBase any problems!
We assume that most script-users are unaware that they are causing any difficulties.
While using a Python etc. script to grab a bunch of FlyBase pages will (slowly) get the job done,
there are often faster ways to find what you need that do not stress FlyBase’s servers.
Here are some guidelines on how to be a good citizen when downloading data from our site:
-
Download bulk data files instead of entire FlyBase pages
Find out if the data you are looking for is in one of our bulk data files.
Downloading bulk data files from FlyBase has a low impact on our resources.
The FlyBase FTP site has bulk files containing genome data,
precomputed data files with cross-referenced records for stocks, alleles, clones, transposons, etc.
as well as Gene Ontology and disease associations to genes and other FlyBase data classes.
A full postreSQL dump of all FlyBase data,
using the chado schema, is also available at the FTP site.
All of these files are updated at each FlyBase release
(see schedule), so usually six times each year.
Direct links to many of these FTP resources can be found on our
Archived Data page.
If you have an ongoing need for the most up-to-date version of a certain slice of FlyBase data,
your solution may be a script that you use after each FlyBase release to download a bulk data file.
-
Try QueryBuilder
The FlyBase QueryBuilder Tool is a full-featured web interface
to the FlyBase database, but without using SQL. In QueryBuilder you can design queries that
can be as simple or complex as you need, and allow you to filter or intersect data lists in powerful ways.
There are pre-built queries for common requests, instructions for creating queries, tutorials, and examples.
-
Use an API
At FlyBase we have created API (Application Programming Interface) endpoints for many types of data.
If you need to download data associated with a large list of objects (genes, GO terms, etc.),
using an API endpoint within a script is an efficient programmatic way to download data for your list,
while still being considerate of FlyBase resources.
Most of our APIs return data in the JSON format.
API endpoints can also be used directly in a web browser, to retrieve data for a single instance of a gene,
ontology term or ribbon, protein domain, sequence region, etc.
PLEASE NOTE we request that scripts calling our API endpoints use a rate limit
of not more than 3 API calls per second.
This recommended call rate is much higher than what we ask (and can tolerate) from a web scraper (see below).
For more information on FlyBase APIs, including a list of endpoint URLs, please visit the
API Overview at GitHub.
-
Ask us for help!
We know our data better than anyone. Often, one of our developers can quickly design an SQL query
that will pull exactly the data you need directly from our database. If you think your data need might
be one that we can help you with this way, please contact us and ask.
In most cases your email will be answered within a day.
If bulk files, QueryBuilder, or APIs don't meet your needs, you may decide that you must use a script
to “scrape” data from our site.
The most important guideline we have for script authors is to please slow down.
Full page HTML requests (such as FlyBase reports) require a lot more server resources
than API calls or bulk file downloads.
Scripts that call several report pages each second can quickly overload our site.
Please limit the rate at which your script makes these requests to not more than ONE page each 5 seconds.
FlyBase maintains a page at GitHub with information on various ways to get large-scale data from our site.
Please visit here.
Most of these pointers (or some version of them) should apply to other database sites as well.
In particular, it's a good idea to look for APIs or bulk data files on a site before you write a script.
Search before you scrape!