Mdadm

From Briki
Revision as of 19:46, 7 January 2021 by Andrew (talk | contribs) (Fixing a disk with Current_Pending_Sector > 0)
Jump to: navigation, search

Overview

Several physical disks (/dev/sdX) or partitions (/dev/sdX1) of equal size are joined into a single array.

Creating a RAID array

  • (Recommended) Create a partition on each disk. Note:
    • Use optimal alignment, with "-a optimal" (this doesn't appear to have any obvious effect on behaviour though!)
    • Use the "GPT" partition table format (to handle disks > 2TB)
    • Name the partition "primary" (note that this is free text)
    • Use 0% for partition start (this will normally mean that the partition start will be at the 1MB boundary, which gives optimal alignment)
    • End 100MB before the end of the disk (this is to allow for slight variances in exact size of similar disks)
    • Set partition type to raid (0xFD00); this is optional, but may encourage some tools to avoid writing directly to the disk (and avoid corrupting the array)
parted -a optimal -s /dev/sdX -- mklabel gpt mkpart primary 0% -100MB set 1 raid on


  • Create a RAID 5 array over 3 partitions:
    • Note, the default metadata version is now 1.2 for create commands
mdadm --create /dev/md0 --level=5 --raid-devices=3 /dev/sdX1 /dev/sdY1 /dev/sdZ1
  • Wait (potentially several days) for the array to be built
  • Once built, save the current raid setup to /etc, to allow for automounting on startup:
mdadm --detail --scan >> /etc/mdadm/mdadm.conf
  • Update the initial boot image for all current kernel versions to include the new mdadm.conf:
update-initramfs -u -k all
  • Start the array:
mdadm --assemble /dev/md0 /dev/sdX1 /dev/sdY1 /dev/sdZ1
  • From this point, just treat the array (/dev/md0) as a normal physical disk.

Convert RAID 1 array to RAID 5

  • Create partition on the new disk as for creating a new array
  • Add the new partition to the array:
mdadm --add /dev/md0 /dev/sdX1
  • Convert the array to RAID 5, with the correct number of devices:
mdadm --grow --level=5 --raid-devices=3
  • Wait (potentially several days) for the array to be reshaped
  • Grow the partition / volume on /dev/md0

Readding a disk marked faulty

If a disk in the array has been marked faulty for a spurious reason, then to readd it and rebuild the array, you'll first need to remove it. Run:

mdadm /dev/md0 --remove /dev/sdX1
mdadm /dev/md0 --add /dev/sdX1

Fixing a disk with Current_Pending_Sector count > 0

If a disk in the array has a Current_Pending_Sector count > 0, this suggests one or more blocks on the disk couldn't be read. To force the disk to be recovered from the rest of the array, it needs to be rewritten which will force the pending sector to be reallocated. This entails removing the disk from the array, zeroing the superblock (to ensure it can't just be recovered from the bitmap) and then re-adding it.

mdadm /dev/md0 --fail /dev/sdX1
mdadm /dev/md0 --remove /dev/sdX1
mdadm --zero-superblock /dev/sdX1
mdadm /dev/md0 --add /dev/sdX1

Recovering from disk failure

  • Check the disk status in mdadm:
mdadm --detail /dev/md0
  • If the disk is already marked as failed, then skip this step. Otherwise:
mdadm /dev/md0 --fail /dev/sdX1
  • From this point, the array will continue to operate in "degraded" mode
  • Remove the failed disk:
mdadm /dev/md0 --remove /dev/sdX1
  • To more easily determine the disk for physical removal from the machine (once powered off), note down the serial number as reported by:
hdparm -i /dev/sdX | grep SerialNo
  • Add a replacement disk:
mdadm /dev/md0 --add /dev/sdY1
  • Wait (potentially several days) for the array to be resynced

Recover from a dirty reboot of a degraded array

If the server shuts down uncleanly (eg. due to a power cut) when the array is degraded, it will refuse to automatically assemble the array on startup (with a dmesg error of the form "cannot start dirty degraded array"). This is to because the data may be in an inconsistent state. In this situation:

  • Check that the good disks have the same number of events. If the numbers differ slightly, that suggests some of the data being written when the server shutdown wasn't written fully, and is probably corrupt (hopefully this will just mean a logfile with some bad characters, or similar).
mdadm --examine /dev/sdX /dev/sdY /dev/sdZ | grep Events
  • Assuming the number of events is the same (or very similar), forcibly assemble the array.
mdadm --assemble --force /dev/md0 /dev/sdX1 /dev/sdY1 /dev/sdZ1

Repairing failing disk on degraded array

If the raid5 array is in a good state, then simply removing and readding the faulty drive should be sufficient. However, if the array is already degraded (ie. there’s no redundancy), or the disk problems became apparent when rebuilding the array from a spare drive, any bad sectors on the failing drive will need to be overwritten with new data (probably just zeros) before the disk is good enough to be able to rebuild the array.

  1. Getting information on failing/failed sectors:
    smartctl –a /dev/sdX | grep Pending
    smartctl –l xerror /dev/sdX
    
  2. Analyze/recover data from failing disk:
    1. Ideally, copy all good data to a recovery file sdX.bin, and record details of good/bad sectors in sdX.map (this needs sufficient free space for sdX.bin). This needs to be run when no partitions on the array are mounted:
      ddrescue –ask –verbose –binary-prefixes –idirect /dev/sdX sdX.bin sdX.map
      
    2. If insufficient free space, merely analyze the disk to scan for all failing sectors (force is needed to allow writing to /dev/null). This can be run when partitions are mounted, since we don’t actually care about the data we’re reading, we just care about the bad sectors:
      ddrescue --ask --verbose --binary-prefixes --idirect --force /dev/sdX /dev/null sdX.map
      

      Note that the sdX.map file is human readable, and will generally be quite small. It keeps track of which sectors are good and bad, and can be reused for subsequent ddrescue runs to avoid re-reading good sectors.

  3. Recheck the number of failing sectors, since some may not have been read yet when smartctl was last run:
    smartctl –a /dev/sdX | grep Pending
    
  4. Forcibly re-assemble the array (after checking the number of events mismatching between array members):
    1. If the number of events is wildly different, then it’s possible there will be corrupted data on the array, but in general if the array was marked as failed then no file writes will have been successful, so event discrepancies might not be reflective of a real problem:
      mdadm –examine /dev/sd[XYZ] | grep Events
      
    2. Reassemble the array, if it’s in a good state (note the disk order isn’t important – mdadm will work out the correct order):
      mdadm –assemble –verbose –run /dev/md0 /dev/sd[XYZ]
      
    3. If reassembly was unsuccessful due to mismatched event numbers, then forcibly reassemble it (be very careful here, disk order doesn’t matter but do make sure the correct disk labels are used – check output of the previous assemble to make sure it looks reasonable):
      mdadm –assemble –verbose –run –force /dev/md0 /dev/sd[XYZ]
      
    4. Remount any affected partitions, or restart the machine to remount all on startup
  5. For mdadm raid5 + lvm arrays, there’s no easy way to determine which files inhabit which bad sectors. Instead, we need to read all files by hand to determine which are unreadable. For each partition which includes space on the bad drive (xdev ensures no other mount points are included):
    find /mountpoint –type f –xdev –exec echo {} \; -exec md5sum {} \; 2>&1 | tee mountpoint-files.log
    

    Note: Ensure mountpoint-files.log is written somewhere outside of the array


  6. It’s possible that reading a bad file with md5sum above will again mark the array as failed. If so, reassemble the array using the steps above. Then look in the mountpoint-files.log file for the first failed md5sum (probably logged with “Input/output error”).
    1. Write random data over the bad file, which should force the pending sector to be marked bad and reallocated from spare space on the disk:
      shred –v /path/to/bad/file
      
    2. Check that the number of failing sectors has decreased:
      smartctl –a /dev/sdX | grep Pending
      
    3. Assuming the number of pending sectors has decreased, it’s then ok to delete the bad file:
      rm /path/to/bad/file
      
  7. Repeat md5sum scanning and file deletion until all mountpoints using the disk are free of bad files
  8. Rescan the bad sectors to see which have been fixed by deleting files, reusing the previous known state of the drive. Note that we need “-r 1” otherwise the bad sectors will be treated as known bad from the previous state, and won’t be tried at all (after backing up the original map file):
    cp sdX.map sdX.initial.map
    ddrescue –ask –verbose –binary-prefixes –idirect –force –r 1 /dev/sdX /dev/null sdX.map
    
  9. If any bad sectors remain, then they must be in free space on the drive.
    1. List out all the bad block addresses, based on the ddrescue state file (after backing up the map file):
      ddrescuelog –list-blocks=- sdX.map
      
    2. For each of the bad blocks, check with dd that we’ve got the right block IDs. For each one of these reads we expect to see an error (and “0+0 records in”):
      for block in `ddrescuelog –list-blocks=- sdX.map`
      do
        dd if=/dev/sdX of=/dev/null count=1 bs=512 skip=$block
      done
      
    3. For each of the bad blocks, write zeros over the block to force it to be reallocated from spare space on the drive. Be careful here – getting it wrong will destroy data! Also note that when reading, “skip” is used to position the input stream, but here “seek” is used to position the output stream:
      for block in `ddrescuelog –list-blocks=- sdX.map`
      do
        dd if=/dev/zero of=/dev/sdX count=1 bs=512 seek=$block
      done
      
    4. It’s possible that dd will fail to write to the block, in which case try again with hdparm:
      1. First check that we’ve got the right sectors (we expect to see “SG_IO: bad/missing sense data” for each sector on stderr, so we pipe stdout to /dev/null to avoid noise):
        for block in `ddrescuelog –list-blocks=- sdX.map`
        do
          hdparm –read-sector $block /dev/sdX > /dev/null
        done
        
      2. Assuming we’ve seen the expected errors, write zeros over each of the bad sectors. Be careful here – getting it wrong will destroy data! You may be asked to add a “—yes-i-know-what-i-am-doing” flag.
        for block in `ddrescuelog –list-blocks=- sdX.map`
        do
          hdparm –write-sector $block /dev/sdX
        done
        
  10. Check ddrescue is showing all data as readable (after backing up the map file again):
    cp sdX.map sdX.postshred.map
    ddrescue –ask –verbose –binary-prefixes –idirect –force –r 1 /dev/sdX /dev/null sdX.map
    
  11. Check smartctl is showing no pending sectors:
    smartctl –a /dev/sdX | grep Pending
    
  12. Readd spare drive and start rebuilding array redundancy:
    mdadm –add /dev/md0 /dev/sdW
    

Reducing recovery time after unclean shutdown

The default (at least when I setup mdadm) consistency policy is `resync`, which means a full resync is needed if the machine shuts down uncleanly (eg. due to power loss). To see the current consistency policy:

 mdadm --detail /dev/md0 | grep Consistency

This slow resync can be avoided by setting a different consistency policy (eg. an internal bitmap) with:

 mdadm --grow --bitmap=internal /dev/md0

Changing the consistency policy only takes a few seconds.

Useful Commands

cat /proc/mdstat 
Display a summary of current raid status
mdadm --detail /dev/md0 
Display raid information on array md0
mdadm --examine /dev/sdf 
Display raid information on device/partition sdf