Hi Randall,
Chado does require a lot of joins as shown on our
field mapping tables on the GMOD wiki.
Before I get into specifics of hardware let me give you some background information about FlyBase that influences how we operate and actually use Chado. FlyBase is made up of 3 sites, one at Cambridge University in the UK, another at Harvard University, and lastly one at Indiana University (IU). The first two sites are primarily tasked with curation and data management while IU is primarily responsible for the website and other public services of FlyBase. The data flow starts with curators at Cambridge and Harvard inputing data into the master Chado database at Harvard. Once every ~5 weeks Harvard freezes the database and sends a dump to IU from which we produce each release. Thus, the database servers at Harvard are geared for both reading and writing whereas IU is strictly a read only environment.
Another point I'd like to make is that while Chado is very good at storing and managing genome data one of its weaknesses can be query performance. This is a problem that is a general relational database issue rather than a strictly Chado one. The way we've gotten around this is by creating a denormalized search database and by pregenerating all the HTML reports for the website. This gives us the performance and scalability that our website requires. Thus, our Chado interaction is limited to a one time dump of all the data we need (in
ChadoXML format using
XORT) and then working from our highly optimized sources after that. This type of setup will obviously not work as well if you have a situation where you want live editing to be immediately reflected on the web site and search database.
Having said all that here is what I can tell you about the hardware requirements for Chado at IU.
Disk requirementsThe current release of FlyBase (FB2008_10) with 12 Drosophila genomes takes up ~40 GB of disk space once it is imported and indexed in PostgreSQL. I also generally figure another 10-20 GB of required disk space for temporary indices during loading and vacuuming. The recommendation here is to get the fastest and largest capacity you can given your budget. If you are looking at a system with 6 or less disks I would opt for a RAID 1+0 or 0+1 setup over RAID 5 for performance reasons.
MemoryThe more memory you can dedicate to PostgreSQL the better. Our servers typically have 4-6 GB of memory on a machine that does nothing but serve Chado and they only handle one query at a time. In order to use that memory we tweak the work_mem setting so that queries don't result in lots of hits to the disk. Keep in mind that work_mem is a per query parameter so if you want to put drupal on top of Chado you will need to lower that to a level you think is reasonable given your expected query load.
CPUGet as many single or multi core CPUs as you can afford. Most of our servers are older dual CPU systems in the 2.5-3 Ghz range and they can handle our existing load without any issues.
The Harvard group will be posting their hardware setup in a separate post.
Let us know if you have any other questions.
Josh
p.s.-I'd highly suggest coming to the
Jan 2009 GMOD meeting if you are just getting started with Chado. I and a few other FlyBase folks will be there.