Using AI to get FlyBase data

FlyBase News

Like many research communities, the FlyBase community has begun using AI tools. Researchers are asking ChatGPT, Claude, Gemini, Perplexity, and other systems about Drosophila genes, alleles, and phenotypes. At the same time, AI companies are sending automated crawlers to read FlyBase pages.

We want to explain what we are seeing, why AI traffic needs management, why answers from general-purpose AI tools may be incomplete or out of date, and what we are doing to improve how AI works with FlyBase.

AI is already a substantial and steady part of our traffic. During a representative week this June, requests from recognizable AI bots and agents made up about 11% of all traffic to FlyBase. That is roughly one in every nine requests, or about a quarter of a million AI requests every day. The proportion stayed steady throughout the week. This is also a conservative estimate, since it only includes AI traffic that openly identifies itself. The largest sources were Meta, OpenAI, Anthropic, ByteDance, and Perplexity.

The traffic falls into two main categories. Most of it, about 96%, is automated harvesting. These are crawlers reading FlyBase in bulk to train AI models or build the indexes used by AI search tools. A smaller share, about 4%, is live look-up traffic. In those cases, an AI assistant is trying to fetch FlyBase at the moment somewone asks a question, usually to ground its answer in our data.

Why we have to limit AI traffic

FlyBase is a publicly funded resource, and we run on finite infrastructure. AI crawlers do not only visit lightweight pages. They often go straight to some of our most expensive pages and tools, including gene and allele reports, the wiki, search result lists, legacy CGI tools, the JBrowse genome browser, and our API.

These are dynamic, database-backed pages. They take real computing time to generate. Serving more than a quarter of a million machine requests per day, on top of normal use by researchers, is not sustainable at full speed. Scaling up our infrastructure primarily to serve model-training crawlers would also not be a good use of public funds intended to support the research community.

So we manage this traffic. We rate-limit major AI crawlers so that no single crawler can consume the site. We also stop traffic that exceeds those limits, block known abusive scrapers, and keep bots away from our heaviest tools. By “stop,” we mean any request that does not get through, whether because of a block, a CAPTCHA, or a security challenge. For an automated agent, those all have the same practical result: it cannot access the page.

How strongly we push back depends on how a crawler behaves. A well-paced crawler such as OpenAI’s is served almost every time. A more aggressive crawler such as Perplexity’s is stopped about two-thirds of the time. Well-behaved AI requests within reasonable limits still get served. But the limits are real, and they are necessary to keep FlyBase fast and available for the people doing the science.

Why an AI assistant's answer about FlyBase may be wrong

This is the most important point for users.

First, a model's knowledge is frozen in time. A chatbot is trained on a snapshot of the web. FlyBase changes with every release. Gene symbols are updated, alleles and annotations are revised, and new data is added. If a model answers from memory, its answer may be months or years out of date. It may not tell you that.

Second, live look-ups do not always get through. When an AI assistant tries to read flybase.org in real time, that request hits the same protections described above. In our measurements, live AI look-ups were the AI traffic most likely to be stopped: about one in ten overall, and roughly one in three for Claude's live look-ups specifically. These requests were usually stopped by a security challenge that the automated fetcher could not complete.

When that happens, the assistant usually does not say, “I could not reach FlyBase.” Instead, it may answer from its older internal knowledge.

The practical result is simple: information about FlyBase that comes through a general-purpose chatbot's web interface cannot be guaranteed to be correct or current. Please treat it as a lead to verify, not as a citation.

If you want to work with FlyBase data locally, especially with tools such as Claude Code, Codex, or other local AI-assisted workflows, please use the bulk data files available from the Downloads menu at the top of flybase.org. These files are designed for this kind of use and are much easier on FlyBase infrastructure than repeatedly scraping pages from the live website.

What we are exploring

Rate-limiting general chatbots is not a complete solution. We recognize that AI can be genuinely useful, and we are actively exploring better ways to use it with FlyBase.

No single approach has been finalized yet, but one of the directions we are pursuing is to include a FlyBase-native "chat" assistant. This would be hosted by FlyBase and would have direct access to our databases, allowing it to answer using current data rather than scraped pages or stale model memory.

We are also evaluating other options, such as an MCP server that would let AI tools query FlyBase through a supported, structured interface instead of scraping pages.

Whichever path we take, the goal is the same: AI answers that are accurate, up to date, and delivered without overwhelming the public site. This work is ongoing, and we will share more as it develops.

Our advice for now

Use AI to orient yourself, but verify anything that matters. Before relying on a gene symbol, allele, phenotype, or reference from a chatbot, please check it against the record on flybase.org.

When FlyBase's own AI tools become available, those will be the methods we can properly support. For now, general AI tools can be useful starting points, but researchers should still verify important details on FlyBase.