Increase the size of software RAID 10

raidraid10

Currently I have 8 disks each of size 32G forming a RAID 10. Now, I want to increase the size of this RAID by adding extra disks. This is a production device so there is already critical data in the RAID. The filesystem is XFS. Is there any way to increase the size of this RAID without affecting the running read/writes on that RAID. If not, how to do this with minimum offline time ?

Best Answer

The existing answers are quite outdated. Here in 2020, it's now possible to grow an mdadm software RAID 10, simply by adding 2 or more same sized disks.

Creating the example RAID 10 array

For testing purposes, instead of physical drives, I created 6x 10GB LVM volumes, /dev/vg0/rtest1 to rtest6 - which mdadm had no complaints about.

# Using the thinpool lvthin on VG vg0 - I created 6x 10G volumes
lvcreate -T vg0/lvthin -V 10G -n rtest1 
lvcreate -T vg0/lvthin -V 10G -n rtest2
...

Next, I created a RAID 10 mdadm array using the first 4 rtestX volumes

mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/vg0/rtest[1-4]

Using mdadm -D (equal to --detail), we can see the array has 4x "drives", with a capacity of 20GB out of the 40GB of volumes, as is expected with RAID 10.

root@host ~ # mdadm -D /dev/md0

/dev/md0:
           Version : 1.2
     Creation Time : Fri Nov 20 09:02:39 2020
        Raid Level : raid10
        Array Size : 20953088 (19.98 GiB 21.46 GB)
     Used Dev Size : 10476544 (9.99 GiB 10.73 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

       Update Time : Fri Nov 20 09:04:24 2020
             State : clean
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : resync

              Name : someguy123:0  (local to host someguy123)
              UUID : e49ab53b:c66321f0:9a4e272e:09dc25b1
            Events : 23

    Number   Major   Minor   RaidDevice State
       0     253        9        0      active sync set-A   /dev/dm-9
       1     253       10        1      active sync set-B   /dev/dm-10
       2     253       11        2      active sync set-A   /dev/dm-11
       3     253       12        3      active sync set-B   /dev/dm-12

Expanding the RAID10 with 2 new equal sized volumes/disks

To grow the array, first you need to --add the pair(s) of disks to the array, then use --grow --raid-devices=X (where X is the new total number of disks in the RAID) to request that mdadm reshapes the RAID10 to use the 2 spare disks as part of the array.

mdadm --add /dev/md0 /dev/vg0/rtest5 /dev/vg0/rtest6
mdadm --grow /dev/md0 --raid-devices=6

Monitor the resync process

Here's the boring part - waiting for anywhere from minutes, to hours, to days or even weeks depending on how big your RAID is, until mdadm finishes reshaping around the new drives.

If we check mdadm -D - we can see the RAID is currently reshaping.

mdadm -D /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Fri Nov 20 09:02:39 2020
        Raid Level : raid10
        Array Size : 20953088 (19.98 GiB 21.46 GB)
     Used Dev Size : 10476544 (9.99 GiB 10.73 GB)
      Raid Devices : 6
     Total Devices : 6
       Persistence : Superblock is persistent

       Update Time : Fri Nov 20 09:15:05 2020
             State : clean, reshaping
    Active Devices : 6
   Working Devices : 6
    Failed Devices : 0
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : resync

    Reshape Status : 0% complete
     Delta Devices : 2, (4->6)

              Name : someguy123:0  (local to host someguy123)
              UUID : e49ab53b:c66321f0:9a4e272e:09dc25b1
            Events : 31

    Number   Major   Minor   RaidDevice State
       0     253        9        0      active sync set-A   /dev/dm-9
       1     253       10        1      active sync set-B   /dev/dm-10
       2     253       11        2      active sync set-A   /dev/dm-11
       3     253       12        3      active sync set-B   /dev/dm-12
       5     253       14        4      active sync set-A   /dev/dm-14
       4     253       13        5      active sync set-B   /dev/dm-13

Enjoy your larger RAID10 array!

Eventually once mdadm finishes reshaping, we can now see that the array size is ~30G instead of ~20G, meaning that the reshaping was successful and relatively painless to do :)

mdadm -D /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Fri Nov 20 09:02:39 2020
        Raid Level : raid10
        Array Size : 31429632 (29.97 GiB 32.18 GB)
     Used Dev Size : 10476544 (9.99 GiB 10.73 GB)
      Raid Devices : 6
     Total Devices : 6
       Persistence : Superblock is persistent

       Update Time : Fri Nov 20 09:25:01 2020
             State : clean
    Active Devices : 6
   Working Devices : 6
    Failed Devices : 0
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : resync

              Name : someguy123:0  (local to host someguy123)
              UUID : e49ab53b:c66321f0:9a4e272e:09dc25b1
            Events : 93

    Number   Major   Minor   RaidDevice State
       0     253        9        0      active sync set-A   /dev/dm-9
       1     253       10        1      active sync set-B   /dev/dm-10
       2     253       11        2      active sync set-A   /dev/dm-11
       3     253       12        3      active sync set-B   /dev/dm-12
       5     253       14        4      active sync set-A   /dev/dm-14
       4     253       13        5      active sync set-B   /dev/dm-13

Related Solutions

Linux Software RAID 5 – Why mkfs Operation Takes Very Long

I agree, that it may be related to stripe alignment. From my experience creation of unaligned XFS on 3*2TB RAID-0 takes ~5 minutes but if it is aligned to stripe size it is ~10-15 seconds. Here is a command for aligning XFS to 256KB stripe size:

mkfs.xfs -l internal,lazy-count=1,sunit=512 -d agsize=64g,sunit=512,swidth=1536 -b size=4096 /dev/vg10/lv00

BTW, stripe width in my case is 3 units, which will be the same for you with 4 drives but in raid-5.

Obviously, this also improves FS performance, so you better keep it aligned.

Linux – What effect does RAID stripe size have on read-ahead settings

The logic for when Linux applies read-ahead is complicated. Starting in 2.6.23 there's the really fancy On-Demand Readahead, before that it used a less complicated prediction mechanism. The design goals of read-ahead always include not doing read-ahead unless you have a read access pattern that justifies it. So the idea that the stripe size is a relevant piece of data here is fundamentally unsound. Individual reads that are on that end of the file I/O range, below the stripe size, aren't normally going to trigger the read-ahead logic and have it applied to them anyway. Tiny values of read-ahead effectively turn the feature off. And you don't want that.

When you really are doing sequential I/O to a large RAID10 array, the only way to reach the full throughput of many systems is to have read-ahead working for you. Otherwise Linux won't dispatch requests fast enough to keep the array reading to its full potential. The last few times I've tested larger disk arrays of RAID10 drives, in the 24 disk array range, large read-ahead settings (>=4096 = 2048KB) have given 50 to 100% performance gains on sequential I/O, as measured by dd or bonnie++. Try that yourself; run bonnie++, increase read-ahead a lot, and see what happens. If you have a large array, that will quickly dispel the idea that read-ahead numbers smaller than typical stripe sizes make any sense.

The Linux kernel is so aware of this necessity that it even automatically increases read-ahead for you when you create some types of arrays. Check out this example from a system with a 2.6.32 kernel:

[root@toy ~]# blockdev --report
RO    RA   SSZ   BSZ   StartSec        Size      Device 
rw   256   512  4096          0    905712320512  /dev/md1 
rw   768   512   512          0    900026204160   /dev/md0

Why is read-ahead 256 (128KB) on md1 while it's 768 (384KB) on md0? That's because md1 is a 3-disk RAID0, and Linux increases read-ahead knowing it has no hope of achieving full speed across an array of that size with the default of 256. Even that's actually too low; it needs to be 2048 (1024KB) or larger to hit the maximum speed that small array is capable of.

Much of the lore on low-level RAID settings like stripe sizes and alignment is just that: lore, not reality. Run some benchmarks at a few read-ahead settings yourself, see what happens, and then you'll have known good facts to work with instead.