In January one of the storage controllers of the DDN system malfunctioned. While the system operated normally, we had to schedule replacing it fairly quickly: If the backup controller had failed before the replacement, it would have caused unplanned break and probably some serious risk to the data. To ensure safety of the replacement, we decided to do this during a maintenance break. The break was scheduled for Tuesday February 9th.
DDN storage controllers can be replaced live. They operate as a pair called a 'couplet' in DDN lingo. Each half of the couplet acts as the primary controller for half of the attached disk storage and backup controller for the other half.
Failover is completely seamless. They're really amazing beasts.
The controller replacement likely had nothing at all to do with this. This would have happened regardless of whether or not anything was replaced, by virtue of the shutdown for maintenance itself.
The DDN controller replacement went quite smoothly and around 10 a.m. we were ready to bring the system back online. However, when restarting the Lustre filesystem, the metadata server reported anomalies in its filesystem and requested to do a filesystem check (fsck). Typically these are fairly routine operations, especially when the filesystem has been up for a long time. Any problems that the check finds are typically fixed automatically with no impact.
In this case, however, the tool could not fix all the problems it identified. A faulty inode persisted. Trying to bring the Lustre up resulted in a system crash (kernel panic) with this inode a very likely cause.
This is also where my heart dropped. They lost the metadata. In a lustre file system, that is what describes the layout of the actual files on the object storage targets.
It seems that the corruption wasn't necessarily to the underlying ldiskfs file system on the mdt, because they were able to perform a file-level backup.
The impressive thing is how they were able to MacGuyver together a 3TB ramdisk to accelerate the data transfer.
As a workaround we created ramdisks on a number of Taito cluster compute nodes, mounted them via iSCSI over the high-speed InfiniBand network to a server and pooled them together to make a sufficiently large filesystem for our needs.
Initially this was considered somewhat of a long shot, but it paid off: The approach clearly outperformed the other experiments and copied the most difficult large directories in hours instead of weeks. Combined with running multiple copies in parallel we were able to achieve well over 20k IOPS.
That is fucking genius! There is no other way to put that. Let this sink in for a second.
They took compute nodes of another cluster, created a bunch of ramdisks, then mounted said disks on the mds using ib_srp and aggregated them into a single volume!
Give the guy who came up with that a fucking medal! I mean seriously, that is some weapons-grade lateral thinking.
I think I might play around with our lustre testbed to try and implement a mdt snapshotting scheme via LVM. If I could snapshot the mdt, perform a file-level backup of the snapshotted target, and then destroy the snapshot, I'd be left with a map of the data on the file system, for DR that's managable in size.
Aside from the downtime, this story has a happy ending. To my production-focused mind, the length of the downtime was unacceptable. The problem is with systems of this scale, it can't really be avoided (unless they had the DR mdt backup waiting in the wings).
Impressive, though, nonetheless.