Rusty on hardware - building data-crunching PC - Tilted Forum Project Discussion Community

hiredgun · 03-25-2008, 06:37 AM

Five or six years ago I used to build PCs (budget gaming or general home use) but I have been out of the game so long that it feels a bit intimidating trying to get back into it. I am not so lazy that I will simply ask you all to come up with full builds, but I could certainly use a few pointers to help me back on my feet.

What I need to do is build a new PC whose sole purpose will be to process very large amounts of data. So graphics is a non-issue; a very cheap optical drive will suffice; and the box will definitely be running Linux (probably Debian or openSUSE) as the primary OS.

My instinct tells me that since intense data analysis is computationally intensive, I want to jump for a quad core or even dual quad core. Is this correct? (I am potentially worried that a Linux distro might not take full advantage of the most modern processors... so am I better off with a dual core?)

My guess is also that I should get a decent, but not necessarily gaming-strength amount of RAM - say 2GB.

For the hard drive, I would eventually like a hardware RAID array to insure against the loss of data, but for the moment a single ordinary high-speed drive will do. 7200rpm is what I should be looking for in terms of speed, right? Or are faster drives widely available now at decent prices?

For anyone who can't take the time to give specific hardware recommendations, I would also truly appreciate being pointed to any particular threads or articles that give a concise overview of the current state of hardware... what are the common features of today's motherboards, what chipsets are now in wide use, what kinds of RAM are available now... etc. If such a resource exists, that is.

Nimetic · 03-27-2008, 03:20 AM

Well, there's a lot of multi-processor linux servers out there - even using old style one 'core' per chip. On that basis I'd think linux will be fine with a multi-core CPU. I like solaris myself - but linux has been around for a couple of decades now.

Because I don't know what you intend to run, what might be the bottleneck of the system. In my world (database land) it's rare that the disk systems can keep more than a few CPUs fed. But that's partly because of the way the systems are set up, and how the queries are written (often badly).

So getting back to your system. Can you define
- how big your program is
- how big your data set is (will it fit in memory... might it fit in CPU cache)

These are the factors drive your decision, ie to spend on faster memory or more cores, or less cores with more cache etc.

Redjake · 03-27-2008, 02:48 PM

The main factor you need to consider is not hardware - it's what your software applications will utilize. No reason to get a quad-core CPU if your data-cruncher won't even use it.

Reminds me of the days when dual-CPU PIIIs came out and everyone jumped on them, only to realize Quake III was about the only game that would ever even touch the second CPU

Make sure your application will utilize the hardware, then we can spec it out.

Now is the right time to purchase - you can get an insane amount of power for cheap!!!!

hiredgun · 03-27-2008, 07:16 PM

Ah, good questions. If only I had answers.

As far as the data set, it is likely to be far larger than our memory capacity (perhaps as much as a few TBs... hopefully only some small subset of that).

As for the software, well... it's not entirely clear at this point. Most likely we will be writing most of the analysis tools ourselves (obviously borrowing heavily from existing libraries). Essentially we will be mirroring a fairly large database locally, and then writing scripts that parse that data looking at particular fields and attempting to find patterns.

It sounds like you guys have a good deal of experience in optimizing this stuff, which is wonderful. I've fallen so far out of the loop though that even very basic advice would be nice. Say, for example... what class of motherboard would I buy for various setups?

Martian · 03-28-2008, 05:36 PM

Are there any budget concerns here, or is it simply a case of 'build the machine that will parse this data in the shortest possible time?'

If you're mirroring off-site data to an on-site network drive, network setup is going to be a crucial factor. Local storage should probably be optimized for speed rather than capacity. Setting up a striped array of Western Digital Raptors will give you a nice fat pipe with low seek times, which should be helpful. Feed the system through a gigabit ethernet connection to try to minimize bottlenecking there.

Linux is your OS of choice if you're writing your own software, which I imagine you've already concluded. I don't know that Debian or Suse will be the best distro though; you're probably going to be better off with a minimalist distro like Slackware. I like Debian (and am writing this on a box running Ubuntu), but from the sounds of it you don't really need the package management or all the extras, so why not go with a distro that doesn't use all that stuff? Less is more here.

Linux is fully capable of multithreading, but bear in mind that your software will need to be optimised to take advantage of it. From a logical standpoint, two or four cores on one processor is the same as two or four processors with one core each. I've never done any coding in this vein, so I don't know what you'll need to do to take advantage of it, but I'd imagine that there's a library and documentation available on the interwebs. I'd research that; once you have an idea of how difficult it will be to optimise your code to take full advantage of multiple cores, you'll be able to get a better handle on how beneficial a dual or quad core system will be. If one ignores cost as a factor, the Intel processors far superior in the area of raw data processing than the AMD equivalents, however.

As for a motherboard, my initial thought is something built around an X38, such as the Asus P5E3. I'm going to have to do some research into architecture before I can be completely sure on that, though.

Nimetic · 03-29-2008, 12:03 AM

>>>Essentially we will be mirroring a fairly large database locally,

Ok. How often do you have to sync? This could a painful process. Gigabit ethernet is fast - but not sure if fast enough. It's worth calculating the transfer time.

>>>and then writing scripts that parse that data looking at particular fields and attempting to find patterns.

On this note.... Here's the questions. This is just informal weekend thoughts ok. I'm on my day off (and am just running on minimum coffee).

- How much data do you expect to hold in memory at one time.

- Can you realistically work on a small chunk of data at a time?
Independant of the rest (this has implications for your programming too)

- Will you be doing large hashes/sorts or other calculations that will spill to disk,
causing contention between reading an writing activity?

On CPUS... Suggest you look at the intel models. Unless you are scaling past 4cores.

The AMD/Intel commodity processors are well compared on tomshareware.com by the way. ie they have graphs. Anandtech is another site. Although like Toms... they only deal with retail/commodity processors.

03-25-2008, 06:37 AM	#1 (permalink)
hiredgun Addict	Rusty on hardware - building data-crunching PC Five or six years ago I used to build PCs (budget gaming or general home use) but I have been out of the game so long that it feels a bit intimidating trying to get back into it. I am not so lazy that I will simply ask you all to come up with full builds, but I could certainly use a few pointers to help me back on my feet. What I need to do is build a new PC whose sole purpose will be to process very large amounts of data. So graphics is a non-issue; a very cheap optical drive will suffice; and the box will definitely be running Linux (probably Debian or openSUSE) as the primary OS. My instinct tells me that since intense data analysis is computationally intensive, I want to jump for a quad core or even dual quad core. Is this correct? (I am potentially worried that a Linux distro might not take full advantage of the most modern processors... so am I better off with a dual core?) My guess is also that I should get a decent, but not necessarily gaming-strength amount of RAM - say 2GB. For the hard drive, I would eventually like a hardware RAID array to insure against the loss of data, but for the moment a single ordinary high-speed drive will do. 7200rpm is what I should be looking for in terms of speed, right? Or are faster drives widely available now at decent prices? For anyone who can't take the time to give specific hardware recommendations, I would also truly appreciate being pointed to any particular threads or articles that give a concise overview of the current state of hardware... what are the common features of today's motherboards, what chipsets are now in wide use, what kinds of RAM are available now... etc. If such a resource exists, that is.

03-27-2008, 02:48 PM	#3 (permalink)
Redjake I'm a family man - I run a family business. Location: Wilson, NC	The main factor you need to consider is not hardware - it's what your software applications will utilize. No reason to get a quad-core CPU if your data-cruncher won't even use it. Reminds me of the days when dual-CPU PIIIs came out and everyone jumped on them, only to realize Quake III was about the only game that would ever even touch the second CPU Make sure your application will utilize the hardware, then we can spec it out. Now is the right time to purchase - you can get an insane amount of power for cheap!!!! __________________ Off the record, on the q.t., and very hush-hush.

03-28-2008, 05:36 PM	#5 (permalink)
Martian Young Crumudgeon Location: Canada	Are there any budget concerns here, or is it simply a case of 'build the machine that will parse this data in the shortest possible time?' If you're mirroring off-site data to an on-site network drive, network setup is going to be a crucial factor. Local storage should probably be optimized for speed rather than capacity. Setting up a striped array of Western Digital Raptors will give you a nice fat pipe with low seek times, which should be helpful. Feed the system through a gigabit ethernet connection to try to minimize bottlenecking there. Linux is your OS of choice if you're writing your own software, which I imagine you've already concluded. I don't know that Debian or Suse will be the best distro though; you're probably going to be better off with a minimalist distro like Slackware. I like Debian (and am writing this on a box running Ubuntu), but from the sounds of it you don't really need the package management or all the extras, so why not go with a distro that doesn't use all that stuff? Less is more here. Linux is fully capable of multithreading, but bear in mind that your software will need to be optimised to take advantage of it. From a logical standpoint, two or four cores on one processor is the same as two or four processors with one core each. I've never done any coding in this vein, so I don't know what you'll need to do to take advantage of it, but I'd imagine that there's a library and documentation available on the interwebs. I'd research that; once you have an idea of how difficult it will be to optimise your code to take full advantage of multiple cores, you'll be able to get a better handle on how beneficial a dual or quad core system will be. If one ignores cost as a factor, the Intel processors far superior in the area of raw data processing than the AMD equivalents, however. As for a motherboard, my initial thought is something built around an X38, such as the Asus P5E3. I'm going to have to do some research into architecture before I can be completely sure on that, though. __________________ I wake up in the morning more tired than before I slept I get through cryin' and I'm sadder than before I wept I get through thinkin' now, and the thoughts have left my head I get through speakin' and I can't remember, not a word that I said - Ben Harper, Show Me A Little Shame

03-27-2008, 03:20 AM	#2 (permalink)
Nimetic Junkie Location: Melbourne, Australia	Well, there's a lot of multi-processor linux servers out there - even using old style one 'core' per chip. On that basis I'd think linux will be fine with a multi-core CPU. I like solaris myself - but linux has been around for a couple of decades now. Because I don't know what you intend to run, what might be the bottleneck of the system. In my world (database land) it's rare that the disk systems can keep more than a few CPUs fed. But that's partly because of the way the systems are set up, and how the queries are written (often badly). So getting back to your system. Can you define - how big your program is - how big your data set is (will it fit in memory... might it fit in CPU cache) These are the factors drive your decision, ie to spend on faster memory or more cores, or less cores with more cache etc.

03-27-2008, 07:16 PM	#4 (permalink)
hiredgun Addict	Ah, good questions. If only I had answers. As far as the data set, it is likely to be far larger than our memory capacity (perhaps as much as a few TBs... hopefully only some small subset of that). As for the software, well... it's not entirely clear at this point. Most likely we will be writing most of the analysis tools ourselves (obviously borrowing heavily from existing libraries). Essentially we will be mirroring a fairly large database locally, and then writing scripts that parse that data looking at particular fields and attempting to find patterns. It sounds like you guys have a good deal of experience in optimizing this stuff, which is wonderful. I've fallen so far out of the loop though that even very basic advice would be nice. Say, for example... what class of motherboard would I buy for various setups?

03-29-2008, 12:03 AM	#6 (permalink)
Nimetic Junkie Location: Melbourne, Australia	>>>Essentially we will be mirroring a fairly large database locally, Ok. How often do you have to sync? This could a painful process. Gigabit ethernet is fast - but not sure if fast enough. It's worth calculating the transfer time. >>>and then writing scripts that parse that data looking at particular fields and attempting to find patterns. On this note.... Here's the questions. This is just informal weekend thoughts ok. I'm on my day off (and am just running on minimum coffee). - How much data do you expect to hold in memory at one time. - Can you realistically work on a small chunk of data at a time? Independant of the rest (this has implications for your programming too) - Will you be doing large hashes/sorts or other calculations that will spill to disk, causing contention between reading an writing activity? On CPUS... Suggest you look at the intel models. Unless you are scaling past 4cores. The AMD/Intel commodity processors are well compared on tomshareware.com by the way. ie they have graphs. Anandtech is another site. Although like Toms... they only deal with retail/commodity processors.