Mdadm

From Briki
Jump to: navigation, search

Overview

Several physical disks (/dev/sdX) or partitions (/dev/sdX1) of equal size are joined into a single array.

Creating a RAID array

  • (Recommended) Create a partition on each disk. Note:
    • Use optimal alignment, with "-a optimal" (this doesn't appear to have any obvious effect on behaviour though!)
    • Use the "GPT" partition table format (to handle disks > 2TB)
    • Name the partition "primary" (note that this is free text)
    • Use 0% for partition start (this will normally mean that the partition start will be at the 1MB boundary, which gives optimal alignment)
    • End 100MB before the end of the disk (this is to allow for slight variances in exact size of similar disks)
    • Set partition type to raid (0xFD00); this is optional, but may encourage some tools to avoid writing directly to the disk (and avoid corrupting the array)
parted -a optimal -s /dev/sdX -- mklabel gpt mkpart primary 0% -100MB set 1 raid on


  • Create a RAID 5 array over 3 partitions:
    • Note, the default metadata version is now 1.2 for create commands
mdadm --create /dev/md0 --level=5 --raid-devices=3 /dev/sdX1 /dev/sdY1 /dev/sdZ1
  • Wait (potentially several days) for the array to be built
  • Once built, save the current raid setup to /etc, to allow for automounting on startup:
diff -u <(cat /etc/mdadm/mdadm.conf) <(/usr/share/mdadm/mkconf)
cp /etc/mdadm/mdadm.conf /etc/mdadm/mdadm.conf.bak
/usr/share/mdadm/mkconf > /etc/mdadm/mdadm.conf
  • Update the initial boot image for all current kernel versions to include the new mdadm.conf:
update-initramfs -u
  • Start the array:
mdadm --assemble /dev/md0 /dev/sdX1 /dev/sdY1 /dev/sdZ1
  • From this point, just treat the array (/dev/md0) as a normal physical disk.

Convert RAID 1 array to RAID 5

  • Create partition on the new disk as for creating a new array
  • Add the new partition to the array:
mdadm --add /dev/md0 /dev/sdX1
  • Convert the array to RAID 5, with the correct number of devices:
mdadm --grow --level=5 --raid-devices=3
  • Wait (potentially several days) for the array to be reshaped
  • Grow the partition / volume on /dev/md0

Readding a disk marked faulty

If a disk in the array has been marked faulty for a spurious reason, then to readd it and rebuild the array, you'll first need to remove it. Run:

mdadm /dev/md0 --remove /dev/sdX1
mdadm /dev/md0 --add /dev/sdX1

Fixing a disk with Current_Pending_Sector count > 0

If a disk in the array has a Current_Pending_Sector count > 0, this suggests one or more blocks on the disk couldn't be read. To force the disk to be recovered from the rest of the array, it needs to be rewritten which will force the pending sector to be reallocated. This entails removing the disk from the array, zeroing the superblock (to ensure it can't just be recovered from the bitmap) and then re-adding it.

mdadm /dev/md0 --fail /dev/sdX1
mdadm /dev/md0 --remove /dev/sdX1
mdadm --zero-superblock /dev/sdX1
mdadm /dev/md0 --add /dev/sdX1

Recovering from disk failure

  • Check the disk status in mdadm:
mdadm --detail /dev/md0
  • If the disk is already marked as failed, then skip this step. Otherwise:
mdadm /dev/md0 --fail /dev/sdX1
  • From this point, the array will continue to operate in "degraded" mode
  • Remove the failed disk:
mdadm /dev/md0 --remove /dev/sdX1
  • To more easily determine the disk for physical removal from the machine (once powered off), note down the serial number as reported by:
hdparm -i /dev/sdX | grep SerialNo
  • Add a replacement disk:
mdadm /dev/md0 --add /dev/sdY1
  • Wait (potentially several days) for the array to be resynced

Recover from a dirty reboot of a degraded array

If the server shuts down uncleanly (eg. due to a power cut) when the array is degraded, it will refuse to automatically assemble the array on startup (with a dmesg error of the form "cannot start dirty degraded array"). This is because the data may be in an inconsistent state. In this situation:

  • Check that the good disks have the same number of events. If the numbers differ slightly, that suggests some of the data being written when the server shutdown wasn't written fully, and is probably corrupt (hopefully this will just mean a logfile with some bad characters, or similar).
mdadm --examine /dev/sdX /dev/sdY /dev/sdZ | grep Events
  • Assuming the number of events is the same (or very similar), forcibly assemble the array.
mdadm --assemble --force /dev/md0 /dev/sdX1 /dev/sdY1 /dev/sdZ1

Repairing failing disk on degraded array

If the raid5 array is in a good state, then simply removing and readding the faulty drive should be sufficient. However, if the array is already degraded (ie. there’s no redundancy), or the disk problems became apparent when rebuilding the array from a spare drive, any bad sectors on the failing drive will need to be overwritten with new data (probably just zeros) before the disk is good enough to be able to rebuild the array.

  1. Getting information on failing/failed sectors:
    smartctl –a /dev/sdX | grep Pending
    smartctl –l xerror /dev/sdX
    
  2. Analyze/recover data from failing disk:
    1. Ideally, copy all good data to a recovery file sdX.bin, and record details of good/bad sectors in sdX.map (this needs sufficient free space for sdX.bin). This needs to be run when no partitions on the array are mounted:
      ddrescue –ask –verbose –binary-prefixes –idirect /dev/sdX sdX.bin sdX.map
      
    2. If insufficient free space, merely analyze the disk to scan for all failing sectors (force is needed to allow writing to /dev/null). This can be run when partitions are mounted, since we don’t actually care about the data we’re reading, we just care about the bad sectors:
      ddrescue --ask --verbose --binary-prefixes --idirect --force /dev/sdX /dev/null sdX.map
      

      Note that the sdX.map file is human readable, and will generally be quite small. It keeps track of which sectors are good and bad, and can be reused for subsequent ddrescue runs to avoid re-reading good sectors.

  3. Recheck the number of failing sectors, since some may not have been read yet when smartctl was last run:
    smartctl –a /dev/sdX | grep Pending
    
  4. Forcibly re-assemble the array (after checking the number of events mismatching between array members):
    1. If the number of events is wildly different, then it’s possible there will be corrupted data on the array, but in general if the array was marked as failed then no file writes will have been successful, so event discrepancies might not be reflective of a real problem:
      mdadm –examine /dev/sd[XYZ] | grep Events
      
    2. Reassemble the array, if it’s in a good state (note the disk order isn’t important – mdadm will work out the correct order):
      mdadm –assemble –verbose –run /dev/md0 /dev/sd[XYZ]
      
    3. If reassembly was unsuccessful due to mismatched event numbers, then forcibly reassemble it (be very careful here, disk order doesn’t matter but do make sure the correct disk labels are used – check output of the previous assemble to make sure it looks reasonable):
      mdadm –assemble –verbose –run –force /dev/md0 /dev/sd[XYZ]
      
    4. Remount any affected partitions, or restart the machine to remount all on startup
  5. For mdadm raid5 + lvm arrays, there’s no easy way to determine which files inhabit which bad sectors. Instead, we need to read all files by hand to determine which are unreadable. For each partition which includes space on the bad drive (xdev ensures no other mount points are included):
    find /mountpoint –type f –xdev –exec echo {} \; -exec md5sum {} \; 2>&1 | tee mountpoint-files.log
    

    Note: Ensure mountpoint-files.log is written somewhere outside of the array


  6. It’s possible that reading a bad file with md5sum above will again mark the array as failed. If so, reassemble the array using the steps above. Then look in the mountpoint-files.log file for the first failed md5sum (probably logged with “Input/output error”).
    1. Write random data over the bad file, which should force the pending sector to be marked bad and reallocated from spare space on the disk:
      shred –v /path/to/bad/file
      
    2. Check that the number of failing sectors has decreased:
      smartctl –a /dev/sdX | grep Pending
      
    3. Assuming the number of pending sectors has decreased, it’s then ok to delete the bad file:
      rm /path/to/bad/file
      
  7. Repeat md5sum scanning and file deletion until all mountpoints using the disk are free of bad files
  8. Rescan the bad sectors to see which have been fixed by deleting files, reusing the previous known state of the drive. Note that we need “-r 1” otherwise the bad sectors will be treated as known bad from the previous state, and won’t be tried at all (after backing up the original map file):
    cp sdX.map sdX.initial.map
    ddrescue –ask –verbose –binary-prefixes –idirect –force –r 1 /dev/sdX /dev/null sdX.map
    
  9. If any bad sectors remain, then they must be in free space on the drive.
    1. List out all the bad block addresses, based on the ddrescue state file (after backing up the map file):
      ddrescuelog –list-blocks=- sdX.map
      
    2. For each of the bad blocks, check with dd that we’ve got the right block IDs. For each one of these reads we expect to see an error (and “0+0 records in”):
      for block in `ddrescuelog –list-blocks=- sdX.map`
      do
        dd if=/dev/sdX of=/dev/null count=1 bs=512 skip=$block
      done
      
    3. For each of the bad blocks, write zeros over the block to force it to be reallocated from spare space on the drive. Be careful here – getting it wrong will destroy data! Also note that when reading, “skip” is used to position the input stream, but here “seek” is used to position the output stream:
      for block in `ddrescuelog –list-blocks=- sdX.map`
      do
        dd if=/dev/zero of=/dev/sdX count=1 bs=512 seek=$block
      done
      
    4. It’s possible that dd will fail to write to the block, in which case try again with hdparm:
      1. First check that we’ve got the right sectors (we expect to see “SG_IO: bad/missing sense data” for each sector on stderr, so we pipe stdout to /dev/null to avoid noise):
        for block in `ddrescuelog –list-blocks=- sdX.map`
        do
          hdparm –read-sector $block /dev/sdX > /dev/null
        done
        
      2. Assuming we’ve seen the expected errors, write zeros over each of the bad sectors. Be careful here – getting it wrong will destroy data! You may be asked to add a “—yes-i-know-what-i-am-doing” flag.
        for block in `ddrescuelog –list-blocks=- sdX.map`
        do
          hdparm –write-sector $block /dev/sdX
        done
        
  10. Check ddrescue is showing all data as readable (after backing up the map file again):
    cp sdX.map sdX.postshred.map
    ddrescue –ask –verbose –binary-prefixes –idirect –force –r 1 /dev/sdX /dev/null sdX.map
    
  11. Check smartctl is showing no pending sectors:
    smartctl –a /dev/sdX | grep Pending
    
  12. Readd spare drive and start rebuilding array redundancy:
    mdadm –add /dev/md0 /dev/sdW
    

Reducing the number of disks in a RAID 5 array (including LVM)

To reduce the number of disks in an array (so that one can be removed safely):

  • Firstly, ensure we've got a saved copy of the current PV mappings and 'mdadm --detail' somewhere (not on the machine). This will be useful if we need to recover from something having gone wrong
  • If we want to choose which disk is going to be removed (rather than mdadm deciding for us), we need to remove that drive from the RAID 5 array before we start (which will put it into 'degraded' mode):
 mdadm /dev/md0 --fail /dev/sdX1
 mdadm /dev/md0 --remove /dev/sdX1
  • Unmount the LVM logical volume we're going to take the space from:
 umount /dev/VG/LV
  • Shrink the LVM logical volume (and the ext4 filesystem that's on it) from which we're going to be reclaiming the space. Make sure that the reduction in size is larger than the size of the disk we're going to remove (in this case, a 3TB drive). Units are 1024-based, so we know that 3T will be enough (since 3TiB > 3TB):
 lvresize --verbose --resizefs -L -3T /dev/VG/LV

This step will take a LONG time (about 15 hours for me).

  • Check the PV mappings - it's likely that the free space we've created won't be at the end of the physical volume:
 pvdisplay -m /dev/md0
  • Assuming it's not at the end, we need to move the LV segments around to ensure all free space is at the end of the volume (you may need to do this more than once). Choose a segment that's after the free space, and move its extents into a similarly sized space at the beginning of the free area:
 pvmove --alloc anywhere /dev/md0:3538424-3576823 /dev/md0:2751992-2790391

This step also takes several hours.

  • Check that we have some free PEs, and calculate the new size of the physical volume with `PE Size * (Alloc PE + 1)` from:
 vgdisplay VG

I'm not sure why we need the `+ 1`, but without it the next step warns us that we shrinking by one too many extents (`cannot resize to 2790391 extents as 2790392 are allocated`).

  • Shrink the physical volume. Use KB here so that we can compare with the size of the mdadm array in the next step. pvresize will warn you that the requested size is less than the real size; check that the requested size matches the Alloc Size from vgdisplay.
 pvresize --setphysicalvolumesize 11429449728K /dev/md0
  • Check that we now have 0 Free PEs:
 vgdisplay VG
  • Check what the new size of the array will be, and ensure that it's larger than the size of the PV as set by `pvresize`. Here `5` is the new number of disks in the array. Also, ensure the backup file is not stored on the array itself:
 mdadm --grow /dev/md0 -n5 --backup-file /var/log/raid.backup

This step will fail to resize, but will give us the new array size we need from the next step.

  • Transiently resize the array (which will change the size reported to the OS until the next reboot). Use the size reported by the previous step, but ensure it's larger than the physical volume size we set with `pvresize`:
 mdadm --grow /dev/md0 --array-size 11720540160
  • At this point, you may want to run e2fsck on all the volumes on in the group to make sure we haven't accidentally truncated any by shrinking the reported size of the array.
  • Assuming all is well, go ahead and rerun the command to reshape the array with new number of disks:
 mdadm --grow /dev/md0 -n5 --backup-file /var/log/raid.backup

This will take a REALLY LONG time (several days for me). Whilst this is running though, we can run the next few steps (all except adding the spare drive back to the array).

  • First, we can grow the PV to get back the space we used as a buffer to make sure we shrank the PV more than we shrank the array:
 pvresize /dev/md0
  • Now grow one of the LVs to use the free space we got back from the step above (it doesn't need to be the same one we took space from at the beginning, and can be done while the LV is mounted). Note the free extents from vgdisplay and use that as the input for lvextend:
 vgdisplay VG
 lvextend -l +71067 /dev/VG/LV
 resize2fs /dev/VG/LV
  • Check that we've no more free extents:
 vgdisplay VG
  • Remount any unmounted partitions, and restart any services that were shutdown to allow for the unmounting
  • Once the reshape has finished, the array will probably still be in degraded more with one spare drive, so add that back into the array:
 mdadm /dev/md0 --add /dev/sdY1

This step will probably also take a day or more.

Reducing recovery time after unclean shutdown

The default (at least when I setup mdadm) consistency policy is `resync`, which means a full resync is needed if the machine shuts down uncleanly (eg. due to power loss). To see the current consistency policy:

 mdadm --detail /dev/md0 | grep Consistency

If it's set to `resync` then recoveries will be slow; `bitmap` means recoveries will be fast. To set a different consistency policy (eg. an internal bitmap) with:

 mdadm --grow --bitmap=internal /dev/md0

Changing the consistency policy only takes a few seconds.

Cancel a hanging md check

Sometimes the monthly consistency check will hang. This can be seen with output like (note the finish and speed):

 md0 : active raid5 sdh1[5] sdi1[4] sdg1[1] sdd1[0]
     8790107136 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
     [===================>.]  check = 99.9% (2930035712/2930035712) finish=0.0min speed=0K/sec
     bitmap: 14/22 pages [56KB], 65536KB chunk

This can also lead to very high load warnings (>20).

Cancelling in the regular way will just hang too, so first we need to set the state to `active` rather than `write-pending` before cancelling:

 # cat /sys/block/md0/md/array_state
   write-pending
 # echo active > /sys/block/md0/md/array_state
 # cat /sys/devices/virtual/block/md0/md/sync_action
   check
 # echo idle > /sys/devices/virtual/block/md0/md/sync_action

Useful Commands

cat /proc/mdstat 
Display a summary of current raid status
mdadm --detail /dev/md0 
Display raid information on array md0
mdadm --examine /dev/sdf 
Display raid information on device/partition sdf