Building Unorthodox Deep Learning GPU Machines

Today, I’m going to write a bit about our GPU cluster machine we call cogito at Driveline Baseball. I’m going to timebox these blog posts so I get them out and also save my sanity, so this first one will be focused primarily on our setup and how I procured the parts.

Future ones will contain benchmarks of the PEX 8780 (PLX) cards, NVLink adapters, and more.

8x RTX 3090s form the backbone of cogito

eBay Sales are All You Need

When you look at building a deep learning machine, almost all of the information out there is based around consumer-grade materials and parts that the typical gaming enthusiast is familiar with. Threadrippers, premium Intel CPUs, gaming boards catered to working with SLI bridges and PCIe risers, and so forth. These work, but there’s massive competition for them amongst gamers, power users, and deep learning enthusiasts alike.

You know what machines are NOT in high-demand, and are specifically being decommissioned all the time? Last-gen rackmounted server hardware. There is literally a huge industry built around ripping these things out of racks and reselling them online and at local PC recyclers. Driveline Baseball is built around these last-gen machines – we have two full racks on-site in Kent, WA and a full rack in our co-location facility in downtown Seattle.

But because they’re arcane and not something your typical power user knows about, they go ignored.

Before I extoll the virtues of these machines and this approach, let me first point out two very significant drawbacks:

  1. These machines are VERY loud – they are not built for comfort, they’re built to run 24/7 in a datacenter with people rarely near them
  2. They often require 208V power to optimally run – and in some cases, need it to run at all (your house has 120V by standard wiring)

With that out of the way, let’s focus on the machines I ended up buying to run our cluster. Even for rackmounted hardware, these things were weird:

I ran across this eBay listing and was intrigued. The price changed off and on but was typically around $500. I had never heard of Cirrascale, and when I did some Internet sleuthing, I found out that it was a GPU rental company that used to build their own hardware – and this must have been some of the decommissioned hardware in the past.

But the rails and sides didn’t look like anything I’d ever seen in a 4U setup. Without getting into the full story, I found out through trial and error these things were meant to be mounted vertically which was something new to me – though I eventually found technical documentation on their unique rack designs!

These were not merely rack mounted servers; in fact, if you look at the pictures on the eBay listing, you will see they have power blade-style connectors on the back. These machines were meant to be blade servers installed in custom enclosures and powered that way – this is why they were $500 and no one was buying them on eBay, despite the fact that their internals were worth WELL more than that:

  • CPU: 2x E5-2667
  • Motherboard: GA-7PESH4
  • RAM: 128 GB DDR3
  • PSU: 3x 1600W
  • GPU Riser Cables: 2x short, 2x long, all x16
  • PCIe Multiplexers: 2x custom PEX 8780-based cards, providing an additional 5x PCIe x16 slots each

While the CPUs immediately got thrown away and replaced by the highest-end Xeon CPU packages that this board supported (a pair of E5-2697 CPUs for less than $100 total, 24c/48t combined), the motherboard, power supplies, and GPU cables were all well-worth the price of admission… while the PLX cards were probably worth at least $500 each alone – the PEX 8780 chip by itself is several hundred dollars.

A look at one of the original “installation” methods…

Curious, I bought one and it ended up working fairly easily. Getting the GPUs into the custom carriers with the PLX cards was interesting, and I found Scott Ellis’ fascinating blog archive (formerly an engineer at Cirrascale) who wrote a lot about the architecture. He was kind enough to trade some emails with me and help me out with the BIOS settings to get it going.

These things were meant for a ton of air to be blown through them at all times, which meant the CPUs and PLX cards were passively cooled and thus prone to overheating in a normal installation. For the PLX card, we had to 3d print some shrouds to fit smaller fans on them (or just blow air from a box fan on them at all times), but for the CPUs, I custom drilled out AM4 brackets to fit the Xeon CPUs using Corsair AIO water cooling blocks:

For the GPUs, I bought RTX 3090s from Facebook Marketplace and got them working without much fanfare using Ubuntu Server 22.X LTS and various nVidia + CUDA install guides online. I ran across someone selling deprecated GPU rigs that were used in crypto mining and secured another 12x RTX 3090s from him along with open air rigs and 4x 1300W EVGA ATX-style PSUs.

Combining the two and gutting one of the Cirrascale machines (by this point I had bought the eBay seller’s entire stock since it was such a good deal on parts), I was able to build this setup with 8x RTX 3090s in an open air rig with 2x 1300W EVGA PSUs and 1x 2400W Delta Crypto PSU with breakout board from Parallel Miner to power many of the GPUs

(You link the PSUs together using 24-pin short blocks to make the PSU think it is plugged into an ATX board, or you enable ACPI “last state” to always be on and use “power strips” for 208V – many ways to do this that Google can help with)

Note that it runs next to one of the Cirrascale machines still intact!

Here’s another picture of the build – it involves plywood, horse stall mats, and much more nonsense:

I’ve done some limited testing on it to figure out how to use NVLink bridges and what the limitations of the PLX cards are (I cannot stress enough how much you should read Scott Ellis’ blog that I linked above – has a lot of technical writeups worthwhile), and I post them to my Twitter @drivelinekyle, so follow me there for more info.

Can you spot the NVLinked pair of cards?

Lest you think these are the only good deals available on eBay and reddit, I recently just bought three more 2U rackmounted ASUS ESC4000 G4 machines for ~$550 each that have dual Intel Gold 6138 CPUs (40c/80t), 96 GB of DDR4 RAM, dual 10GBE ports, 2x 1600W PSUs, and GPU carriers that will fit one RTX 3090 each for our on-site biomechanical modeling machines. They’ll both run on 120V (less efficient) and 208V (ideal) if needed – while 208V is best for these machines, you can find PSUs that will run on 120V.

The noise, however… can’t help you there.

As a parting note, these servers with dual CPU packages tend to have a ton of PCIe lanes, which is what you need if you want to run 4+ GPUs on your system. The x16 slots stand for 16 lanes of PCIe usage – so do the math on how many available lanes you need to run 8 GPUs…

In conclusion, there’s opportunity out there – just get familiar with server parts. Hammer ChatGPT with your questions. Learning more about computer architecture, PCIe lanes, and the fundamentals of computing will go a long way in reasoning from first principles rather than following what all the gamers buy.

And if you can’t afford RTX 3090s and think that this was a rich man’s build because of it – you should know that I first bought below-market Tesla P40s and hacked in small fans into their shrouds for $100 each, and used those instead. There are plenty of accelerator cards on the market no one is looking at… be resourceful!

2 thoughts on “Building Unorthodox Deep Learning GPU Machines”

  1. Someone had the same idea as me!
    I bought one of these servers about 6 months ago for the two PCIE switch daughter boards, but had to shelve this while I’m in the process of buying and moving into a new house.
    I plan on getting a proper case for front=>back airflow made up by protocase, When I get a design created I’ll pass it along if you want. Might end up being 5u And using ATX power supplies though for the noise as I don’t have a rack room.

    Also mine were actually PEX 8796s though, are you sure you have 8780 and not the 8796?
    Do you know anything about programming these switches? I haven’t found many resources online that really explain it well either and want to make sure the remaining x16 slot bifurcates to 4×4 automatically but as far as I can tell that’s entirely down to how its been programmed.

    I plan on putting 4x Optane SSDs into the remaining 4slots /16 lanes and using GPU Direct Storage (GDS) / CUDA cuFile api / DeepSpeed and either assign each ssd to a card as virtual memory expansion or just to minimise the impact of memory limitations by splitting the model across them so loading weights is more efficient by not hitting the host.
    I hope that has similar results to this: https://github.com/ggerganov/llama.cpp/discussions/638
    And allows me to use the sparse nature of models to run and train LARGE models or Many models in parallel without the ridiculous GPU memory required to run and train them normally.

    Please don’t buy all the optane up before I finish moving 😉

Leave a Reply

Your email address will not be published. Required fields are marked *

20 − seventeen =