Lessons learned from a very large server deployment»

It’s that time of year again for us. The time of year where we deploy massive numbers of servers in a very short period of time. This year we deployed around 1000 servers to 23 locations. That’s a lot of work to get done in two weeks, but we’ve managed to do it. There are things that went very well, things that merely went ok, and things that were terrible.

We’re on our third generation (well, third generation since I’ve been here) OS installation system. I’ve talked about how this works in other posts, but the short version is that without iPXE, none of this would have been possible. We ended up having a central image server per location, which made all the installs speedy (though we did manage to peg the install servers gigabit ethernet link multiple times). This also isolated us from any transient network issues that may have occured. This entire system worked wonderfully. For a very large majority of servers, we didn’t have to do anything other then turn them on. The entire process from bare metal to fully configured machine was automated, and all the time spent refining that process was well spent.

More recently then I would have liked, we’ve also developed similar systems to automatically configure all the IPMI controllers, and provide us with the ability to manage them all centrally. Having a ‘View KVM’ button within our server management system made troubleshooting broken installs that much simpler. There was no tracking down which IPMI controller went to which machine, nor remembering to firewall off all the IPMI controllers from the internet. Having a list of all the new machines in a location, and their status (configured or not) made confirming the physical installs were done simple.

While we use local mirrors for the initial OS install, we were still relying on one central mirror for our other software. This ended up being a source of delays (23 locations remember, the vast majority of them were not close enough to get great download speeds), as well as causing some failures. Next year we’ll definitely have to distribute this content globally. I’ll probably end up investigating geo-replicated file systems to make this a more transparent process.

Speaking of other software, we were surprised mid-deployment by a Cygwin update that broke sshd. We heavily rely on sshd for mangagement and configuration, so this led to a paniced debugging session. We ended up rolling back to the previous Cygwin release by using the ‘Cygwin Time Machine’. The lesson here is to freeze all your software dependencies before starting a large rollout. This only ended up breaking around 50 machines, so it wasn’t a ton of wasted work.

Unsurprisingly, the other big source of issues was the physical hardware itself. Most of our deployment was Supermicro. Supermicro (at least from our reseller) doesn’t seem to believe in issuing unique serial numbers for their hardware. Dell, for example, has the service tag prominently displayed on the outside of the machine. This means you can retrieve it from the OS, and go tell remote hands to look for it. Or, you can retrieve a list of all the known ones, and tell remote hands to go investigate any missing ones. With Supermicro, we ended up tracking all this by MAC address. We keep track of the MAC addresses of all the servers, and use our networking gear to tell us which MAC is on which switch port. We can then tell which switch ports the broken machines are on, and make remote hands trace cables. This process has it’s uses, but it’s not great for this kind of setup. I’m not certain how we can fix this in the future, it still requires some thought.

This has been the quickest and easiest deployment we’ve done so far. It’s not as quick and easy as I’d like, but we’re definitely moving in the right direction. We’ve come a long way from manually imaging, configuring, and shipping boxes. The only way forward is more automation!