Ah, good questions. If only I had answers.
As far as the data set, it is likely to be far larger than our memory capacity (perhaps as much as a few TBs... hopefully only some small subset of that).
As for the software, well... it's not entirely clear at this point. Most likely we will be writing most of the analysis tools ourselves (obviously borrowing heavily from existing libraries). Essentially we will be mirroring a fairly large database locally, and then writing scripts that parse that data looking at particular fields and attempting to find patterns.
It sounds like you guys have a good deal of experience in optimizing this stuff, which is wonderful. I've fallen so far out of the loop though that even very basic advice would be nice. Say, for example... what class of motherboard would I buy for various setups?