Why you should be careful managing your RAID

Shifting gears a bit from the normal fare, here!

I have computers at home. A few of them. There my desktop, which is an HP Pavilion case, but the HP motherboard died, so I swapped in a new one, so it really isn’t a HP Pavilion anymore. My girlfriend has a Lenovo mini-tower that’s probably 12 years old, but still gets the job done. My daughter has a Chromebook. There’s an old laptop hooked up to the TV in the basement, and then there’s my laptop.

But at the center of it all is the big boy in the basement. And by big boy, it has a really big case. It’s not that terribly powerful. My desktop has the same CPU and amount of memory in it.

But the big boy is the server for the house. It runs Linux, boots off of 300GB laptop drive, and had four 1TB drives configured in a RAID10. It runs KVM, the Linux Virtualization system, and I have a few VMs I run on it to do things.

There’s a VM that houses all my MP3s and video files, and runs Plex Media Server so I can get to all my stuff from anywhere in the world. Then there are a few other utility partitions.

But this story isn’t about the cool software, it’s way down at the bottom of the stack, where the disks are. See, I picked RAID10 not for the screaming performance, but because it provided protection against half of my disks failing. Sure, RAID1 would have gotten me that, but RAID10 has a significant performance boost.

It’s worked well. But fast forward to today. I was running a two-disk mirror in my desktop. The other day, I replaced those with a single SSD, which made a huge difference in performance. Now having two extra disks, I figured it would be a good idea to slap them in the server and get some more space.

So, I shut the server down, and installed them. Booted up, and found them. I repartitioned them for the RAID software, did an OS update, and rebooted it again.

Then I went to add them to the raid. No big deal, but they added in as spares. When I tried to grow the array onto the spare, if complained. The RAID10 wasn’t expandable! Crap!

But wait! I was only using 880GB of the 2TB available, which means I had enough space to make a new RAID1 with the two new drives, migrate all my data, then rebuild the other four disks as a RAID6.

Then flip the data back over, and join the two new disks to the RAID6! Easy! Double disk failure protection maintained, data expansion achieved.

Except I got careless. After a reboot, the drives re-ordered themselves, and when I went to remove the spares, I accidentally removed one of the disks with data on it. The array instantly started rebuilding onto a spare. No biggie. Zero interruption, zero data loss.

MDADM rebuilding my array

But, this rebuild took about five hours, so I went off and did other stuff.

Once it was done rebuilding, I issued a pvmove to get everything from the first RAID10 to new RAID1:

The PVmove has started

Now, the beauty of all of this is, after the initial reboot where I confused myself and pulled the wrong drive, the system (and the VMs running from disk images stored on this array) all stayed up. I could stream Infinity War no problemo.

Streaming movies off my media server while it moves itself.

So, once the data was moved onto my fancy new mirror, I broke down the original RAID10 and rebuilt it as a RAID6.

New RAID6

And now we move it all again. A pvmove from md1 to md0 ensued. Again, all the apps stayed up. Which is good, because I kicked off the data move before the RAID6 had fully initialized. So it took forever. Nearly 24 hours. Oops.

But it finished. Once it did, I tore down md1 and added the disks to md0. Voila! A six platter RAID6 array with 3.7 terabytes of usable space!

So, you may ask, why go through all this trouble?

Because two reasons:

  1. My luck is bad. The RAID10 with four drives provided protection from at least one disk failure, and two disks if the “right” disks died. RAID6 protects me against a double disk failure under all conditions. Depending on used capacity, I can lose two drives, have it rebalance down onto the four remaining, and then lose ANOTHER drive.
  2. Expansion. The RAID10 couldn’t be grown. This can. So in the future, I can replace these 1GB platters with larger disks. As long as I do it one disk at a time, I keep the ability to lose a drive while it incorporates the new disk, and once all six have been replaced I can grow the raid into the additional space.

So there you have it, the answer to the question nobody asks: “What does Andrew do when he’s not working on that car?”

Except I worked on the car, too. Check out my Instagram for photos of the polish and wax job I pulled off in the garage.

Leave a Reply