Thursday, 27 August 2015

Replace hard disk from software RAID

A hard disk is one of those components that will fail at some point in time. To protect against data loss in such a situation you might have already configured a RAID array. Eventually, you’ll have to replace a malfunctioning hard disk, even if it did not die completely. This is how to replace a hard disc correctly.
If a hard disk has failed completely, it is logical to replace it. But even if the hard disc has not completely failed, it can be a wise decision to replace it. An incorrectly functioning hard disk in your RAID, can cause sporadic rebuilds of the RAID. This means that the content of the working disk gets copied to the faulty disk. This process takes some time during which the write and read throughput to the RAID storage is slower (sometimes much slower) then in normal conditions. This can even cause performance problems in your service.
My advice is to monitor the health of the RAID as well as the S.M.A.R.T values of your disks whenever possible.

Remove failed disk from RAID

When the disk causes problems but has not yet failed completely, it is still part of the RAID. To replace the disk, the disk’s partitions need to be removed from the RAID array. To remove the disk from the RAID array, the partitions need to be marked as failed manually first.
mdadm /dev/md0 --fail /dev/sdb1
mdadm /dev/md1 --fail /dev/sdb2
The above commands mark the two partitions of the disk /dev/sdb as failed. With the partitions marked as failed, they can be removed from the RAID configuration.
mdadm /dev/md0 --remove /dev/sdb1
mdadm /dev/md1 --remove /dev/sdb2
With the partitions removed from the RAID configuration, the disk is ready to be removed from the system.
Remove bad disk and insert the new blank disk.
Depending on the hardware capabilities of your system, you can remove the disk from the system and replace it with the new one. If your system does not support the replacement of the disk while running, you need to shutdown the system to replace the disk. If the system needs to be shutdown for the replacement, you have to make sure that the system boots from the working RAID disk instead of the new disk. It has been proven as good practice in this situation to make sure the new disk is empty before you install it.

Copy the partitions to the new disk

With the new empty disk installed, it will not yet have the partition layout needed for the RAID configuration. To prepare the disk, its partition layout needs to be the same as on the other disk. The partition layout can be copied from the working disk to the new one with the following command.
sfdisk --dump /dev/sda | sfdisk --force /dev/sdb

Add the new disk to the RAID

Prepared with the partition layout, the new disk can be added to the RAID configuration. If the system was shut down to replace the disk, check again the name of the devices in the RAID as the name of the devices could have changed during the restart.
mdadm -a /dev/md0 /dev/sdb1
mdadm -a /dev/md1 /dev/sdb2
As soon as the disk’s partitions are added to the raid, the rebuild starts to copy the data to the new disk. Starting with the small boot-partition would be a good idea, to be able to continue with the next steps without having to wait for the rebuild to finish.
With the following command you can check the progress of the RAID rebuild.
$ cat /proc/mdstat
Personalities : [raid1] 
md0 : active raid1 sdb1[3] sda1[2]
204736 blocks super 1.0 [2/1] [U_]
[======>..............]  recovery = 34.2% (70144/204736) finish=0.0min speed=70144K/sec
md1 : active raid1 sdb2[3] sda2[2]
35309440 blocks super 1.1 [2/1] [U_]
resync=DELAYED
bitmap: 1/1 pages [4KB], 65536KB chunk
unused devices: <none>
The above shows that the RAID partition md0 is in the process of rebuilding while the partition md1 is waiting for md0 to finish the rebuild.
$ cat /proc/mdstat
Personalities : [raid1] 
md0 : active raid1 sdb1[3] sda1[2]
204736 blocks super 1.0 [2/2] [UU]
md1 : active raid1 sdb2[3] sda2[2]
35309440 blocks super 1.1 [2/1] [U_]
[==>..................]  recovery = 11.2% (3988672/35309440) finish=11.8min speed=44086K/sec
bitmap: 1/1 pages [4KB], 65536KB chunk
unused devices: >none>
Checking again after some time will show a different picture. The partition md0 has finished the rebuild process and md1 has started. Partition md0 is the boot-partition in my example, and so the next step can already be performed.

Install grub on the new disk

Rebuilding the disk’s partitions will only mirror the content of the disk’s partition. The newly installed disk is not yet bootable as no boot manager is installed on it. Installing the boot manager on the disk requires the boot partition rebuild to be completed.
$ grub
grub> find /grub/stage1
(hd0,0)
(hd1,0)
After you executed the command “grub” you get the grub shell. With the above command in the grub-shell you will see the disks containing the file. To install grub now on the new hard disk, enter the following commands into the grub-shell.
grub> device (hd1) /dev/sdb
device (hd1) /dev/sdb
grub> root (hd1,0)
root (hd1,0)
Filesystem type is ext2fs, partition type 0xfd
grub> setup (hd1)
setup (hd1)
Checking if "/boot/grub/stage1" exists... no
Checking if "/grub/stage1" exists... yes
Checking if "/grub/stage2" exists... yes
Checking if "/grub/e2fs_stage1_5" exists... yes
Running "embed /grub/e2fs_stage1_5 (hd1)"...  27 sectors are embedded.
succeeded
Running "install /grub/stage1 (hd1) (hd1)1+27 p (hd1,0)/grub/stage2 /grub/grub.conf"... succeeded
Done.
grub> quit
As soon as the rebuild of all RAID partitions is finished, the RAID will be working again. You can test the RAID by removing the old disc from the system and rebooting it. Now the boot manager is installed, the system should boot normally. But do keep in mind for this test that the rebuild needs to be finished!

No comments:

Post a Comment

The Future of Remote Work, According to Startups

  The Future of Remote Work, According to Startups No matter where in the world you log in from—Silicon Valley, London, and beyond—COVID-19 ...