Linux RAID Reference

First off, volume management is done via the 'md' suite. 'mdadm' is the main utility used here.

Secondly, if you aren't too familiar with RAID concepts, you should consult http://www.lascon.co.uk/d008005.htm - it's the single best illustration of RAID, and particularly RAID-5, that I've ever seen.

Vocabulary

  • Degraded array: A disk array in which one or more volumes has failed, but the array is still operational. A standard RAID-5 array can survive the loss of any one disk, although it is in a degraded state. In typical configurations (mine), the loss of an additional drive before the first one has been replaced will cause the loss of all data in the array.

Showing all MD devices on a system

Use 'mdadm --detail --scan' to iterate through the various md devices active and show you what their statuses are.

Starting an array which is degraded

By default, an array which is missing devices will not start. If your intention is to start your array in a degraded state, then the command line below will work:

[root@ares ~]# mdadm --manage /dev/md2 --run mdadm: started array /dev/md/2

Troubleshooting Lost but Uncorrupted Devices

When I upgraded to Fedora Core 12, my md devices didn't come up after reboot. fdisk confirmed that the devices were still there, with intact partition tables, yet they couldn't be mounted or interacted with. After considerable tinkering, I discovered that what was really happening was that /dev/sdc1, etc, wasn't being created properly by udev and wasn't (I presume) associating correctly. If I fdisk'ed the devices and then rewrote the partition tables, md could see the partitions located on the freshly-rewritten drives. If I ran partprobe (which forces a re-read of the partition table), the md devices could be seen correctly. I experimented a bit with creating udev rules, and it didn't seem to want to work. While I'm sure there's something in udev that would be the CORRECT answer, my hack that works was to edit /etc/rc.d/rc.sysinit and insert a call to /sbin/partprobe just above the mdadm line that scans and assembles the array. My Linux box now boots all the way and I have my data back. So, while I would hope that someone will have a better fix for this issue, I offer this as a "it's stupid but it works" solution.
  • Checking partitions to infer array membership: You want to use mdadm --examine /dev/device. This will check a partition or device and scan for RAID signatures.

Canaries in the Coal Mine

Over the years, I've lost count of how many friends I've seen suffer a catastrophic failure with data loss because they trusted their RAID implementation. RAID is trustworthy, provided that you promptly replace drives which fail. What isn't trustworthy, is that you will know that a drive died! Generally, RAID systems will run in a degraded status without a hiccup, trusting that you will check the status periodically or that a pop-up message on the console will get noticed, etc. The problem is that there's no guarantee that the aforementioned status-checking is happening regularly, or even at all. The best way to guarantee that you know about any and all drive failures, is to create a small RAID-0 volume which will not survive the loss of any drive in your array. A filesystem mounted on this volume will crash your system when you lose a drive, which is (strangely enough) exactly what you want to have happen. This filesystem is the aforementioned canary in the coal mine. The entire point of its existence, is to take down your system so that you are guaranteed to know that your primary array is degraded and in need of attention. You'll know because, when restarting your system, startup will fail due to the downed RAID volume.

Spare me the "But I'll know immediately because my trustworthy system will send me an email or I'll hear a clunking sound or ... or something!" No, your best chance of knowing about a drive failure is if the system is designed to crash (but in an easily recoverable way) as soon as a drive is lost.

Another consideration: If you use some manner of hardware RAID, many hardware RAID systems use their own specific RAID setup. If your hardware controller, be it motherboard or RAID card, goes down and you can't secure a compatible replacement, you'll be stuck without the ability to recover a perfectly intact array of drives! A truly redundant system cannot afford any single point of non-recoverable failure, and the controller is one which entirely too many people overlook.

While one redundancy solution is to purchase two controllers (cards, motherboards, or whatever) and keep the second as a spare, I prefer to sacrifice a bit of hardware performance to achieve complete recoverability. Linux's software RAID system is more or less interface-agnostic, and as such is perfect for our application. Because it operates with block devices, you can mix and match IDE and SATA, or even spin up an array of drives sitting in USB enclosures. RAID'ing a bunch of USB enclosures together may be silly, but if it allows you to recover your data, it's not that silly.

Building an Array

First, you must create partitions on the drives which you're going to assemble the array out of. The partition type ID you need is 'fd', which shows up as 'Linux raid autodetect' when displaying the partitions in fdisk. While you should strive for the same size, it's not strictly necessary. When building a RAID-0 array it won't matter, but when building a RAID-5 array, I'm fairly sure that the usable size will be the same as the smallest partition in the volume. I do know this is the case in RAID-1, however.

mdadm --create /dev/md1 --raid-disks=4 --level=raid5 /dev/sdb2 /dev/sdc2 /dev/sdd2 /dev/sde2 This will create a RAID device 'md1', consisting of the second partition from each of the four named drives, at RAID level 5 (simple redundancy; can survive the loss of any one drive) across four volumes.

Stopping an Array

This is fairly simple... mdadm --stop /dev/md1

Checking a device's status

mdadm --detail /dev/md1: This will show you the status of device MD1. It's a virtual block device, which means that that you can actually either 'mkfs' or 'fdisk' /dev/md1. I'd be a prideful twerp if I didn't admit that yes, I do have to look this up (mdadm --help) whenever I meddle with LVM.

When I migrated my drives

A few months ago, one of my four RAID-5'ed 400GB hard drives died. My storage configuration, at the time, was to have one (relatively) small volume composed of 150GB from each of the four drives in the RAID-5 cluster. Then, I allocated the remaining 250GB of each drive in a RAID-0 configuration. This resulted in a ~450GB RAID-5 storage volume and a ~750GB RAID-0 storage volume. Everything worked out pretty well when I was doing a lot of DVD ripping/re-encoding, as the RAID-0 provided fast throughput for disk thrashing and the RAID-5 volume provided redundancy. As expected, when the drive failed I lost all the work area on the RAID-0 volume, and the RAID-5 volume was protected. However, my usage pattern has changed over time, and now I want a larger RAID-5 area.

Ultimately, I decided it was time to convert the whole thing into a RAID-5 volume, except for a small RAID-0 partition used strictly as a 'canary', as I wrote above. Because the only surviving array was the 450GB RAID-5 array, it made the transition to a single 1.1TB RAID-5 array pretty easy - the reason for that shall be outlined below.

The steps of my migration were as follows:

  • I created an ext3 filesystem on my new SATA 500GB drive, /dev/sdb1. When I mounted it on /media/temp, I got 459G of usable space out of this, so life is good.
  • I then mv'ed /stash/* to /media/temp. This moved all the useful data off from the three surviving drives of my array, onto the one new 500GB drive.
  • I will then repartition the three drives (sdc, sdd, and sde) with a minimally useful 5GB partition as the first entry. The remaining space will be allocated to the second partition, which should put about 395GB space per volume. The ultimate space usable in the volume will be (395*3)GB, or 1.1TB.
  • I assembled the new /dev/md1 volume as a degraded RAID-5 volume (i.e. because it had only three disks, it had no redundancy yet), and mounted it as an EXT3 filesystem at the mount point '/stash'. The command I used was: mdadm --create /dev/md1 --raid-disks=4 --level=raid5 /dev/sdb2 /dev/sdc2 /dev/sdd2 missing Note that, if you're playing with SATA, it's vitally important that you check (via fdisk) to make sure that the drives you're twiddling with haven't been renumbered across reboots. If I weren't paranoid, I could've easily wiped out my just-backed-up data by trusting that /dev/sdb will remain the same drive as it was last time. Also take note of the word 'missing' there, at the end of the device list. Leaving off 'missing' will cause mdadm to assume you meant to specify a fourth drive and simply forgot, and it won't build the array. Including 'missing' will cause it to realize that you know what you want, and you're requesting the creation of a degraded four-volume RAID-5 array out of three disk volumes. Functionally, at this point there is no difference between this setup and RAID-0. However, with the 'missing' qualifier, I am able to re-add the fourth volume into the array at a later point. The array will remain degraded for a short time while the fourth volume is striped appropriately with redundancy/checksum data.
  • mkfs.ext3 /dev/md1 - enough said.
  • Then, I mounted the 500GB drive and mv'ed the data back onto the newly-resized, larger md array.
  • Finally, I repartitioned the 500GB drive. First, I created a 5GB partition , then added a partition of equivalent size to the 400GB drives' main data partitions, and lastly added a third, non-mirrored partition which made use of the remaining 100GB of capacity. I'm not sure what I'll use this partition for, but if I lose and replace another 400GB drive, I can set up a small RAID-1 (mirrored) volume between the two larger drives. Originally I intended to use the larger drive as a boot volume, but I kind of doubt that's going to work out.
  • mdadm --manage --add /dev/md1 /dev/sdb2 - here, I added the newly created parity volume into the md1 disk array. Subsequently, running mdadm --detail /dev/md1 shows the status to be 'clean, degraded, recovering', and the fourth disk in the array (/dev/sdb2) is listed as 'spare rebuilding'. This process typically takes several hours, but at the end of it all you have a redundant RAID array again! As of this writing, it's taken 20 minutes to do 10% of the requisite disk thrashing to rebuild a parity volume for an array consisting of four 395GB partitions. To check progress, you'll want to look at the line titled 'Rebuild Status', above the printout of the array's UUID. Overall, the time isn't too horrible, especially when you consider that this is a live rebuild and the array is usable (albeit not yet redundant) while this is going on.
Overall, I'm well aware that I'm doing a lot of things here which don't make sense in a production or mission-critical environment. But a significant part of my goal here is to maintain and broaden my MD/LVM experience, and it's being done on backed up or non-critical data in the first place.

-- SeanNewton - 08 Jan 2008

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r9 - 2015-05-25 - SeanNewton
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback