Discussion:
RAID60/mdadm/xfs performance tuning
Paul Anderson
2011-12-05 18:50:58 UTC
Permalink
I've set up an software RAID-60 array composed of 7 software RAID6's,
each with 32k chunks, 18 devices total (16 data, 2 parity), and in
theory appropriate setup parameters according to a nice white paper
written by Christoph and presented this last summer at LinuxCon.

My question is, if the mdraid and XFS are all configured properly,
would I expect to see any read operations when doing a write-only
test? I would have assumed that I would not, since XFS should write
stripe-aligned sets of data, and in theory nothing needs to be read
(no read-modify-write going on, I would think).

The performance is great, but I'm wondering if I need to keep looking.

Thanks,

Paul Anderson

Here's the details for kernel 2.6.38.5:

mdadm --detail /dev/md0 (md1, md2, md3, md4, md5, and md6 all the same)
/dev/md0:
Version : 01.02
Creation Time : Fri Dec 2 14:54:23 2011
Raid Level : raid6
Array Size : 31256214528 (29808.25 GiB 32006.36 GB)
Used Dev Size : 3907026816 (3726.03 GiB 4000.80 GB)
Raid Devices : 18
Total Devices : 18
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Mon Dec 5 13:38:52 2011
State : clean
Active Devices : 18
Working Devices : 18
Failed Devices : 0
Spare Devices : 0

Chunk Size : 32K

/dev/md8 is the RAID0 that concatenates the above RAID6's, making a
single RAID60:

mdadm --detail /dev/md8
/dev/md8:
Version : 01.02
Creation Time : Fri Dec 2 14:55:36 2011
Raid Level : raid0
Array Size : 218793480192 (208657.73 GiB 224044.52 GB)
Raid Devices : 7
Total Devices : 7
Preferred Minor : 8
Persistence : Superblock is persistent

Update Time : Fri Dec 2 14:55:36 2011
State : clean
Active Devices : 7
Working Devices : 7
Failed Devices : 0
Spare Devices : 0

Chunk Size : 4096K (this is what the RAID0 container thinks, but
I ignore it for xfs)

xfs_info /exports/
meta-data=/dev/md8 isize=256 agcount=204, agsize=268435448 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=54698370048, imaxpct=1
= sunit=8 swidth=1024 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

I made the filesystem like this:
mkfs.xfs -L $(hostname) -l su=32768 -d su=32768,sw=128 /dev/md8

mount options: inode64,largeio,swalloc,delaylog,logbsize=256k,logbufs=8,noatime,nodiratime

I intended to make it with an external log, but forgot.
Dave Chinner
2011-12-05 22:48:20 UTC
Permalink
Post by Paul Anderson
I've set up an software RAID-60 array composed of 7 software RAID6's,
each with 32k chunks, 18 devices total (16 data, 2 parity), and in
theory appropriate setup parameters according to a nice white paper
written by Christoph and presented this last summer at LinuxCon.
My question is, if the mdraid and XFS are all configured properly,
would I expect to see any read operations when doing a write-only
test? I would have assumed that I would not, since XFS should write
stripe-aligned sets of data, and in theory nothing needs to be read
(no read-modify-write going on, I would think).
That depends. What's your "write only" test?
Post by Paul Anderson
The performance is great, but I'm wondering if I need to keep looking.
If performance is great, then what's the problem?
Post by Paul Anderson
Thanks,
Paul Anderson
mdadm --detail /dev/md0 (md1, md2, md3, md4, md5, and md6 all the same)
....
Post by Paul Anderson
Chunk Size : 32K
/dev/md8 is the RAID0 that concatenates the above RAID6's, making a
mdadm --detail /dev/md8
....
Post by Paul Anderson
Chunk Size : 4096K (this is what the RAID0 container thinks, but
I ignore it for xfs)
You should set the RAID0 chunk size to the stripe width of the
underlying RAID6 volume (i.e. 512k).
Post by Paul Anderson
xfs_info /exports/
meta-data=/dev/md8 isize=256 agcount=204, agsize=268435448 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=54698370048, imaxpct=1
= sunit=8 swidth=1024 blks
Because XFS has clearly not been configured correctly. You've given
it a stripe unit of 32k (the RAID6 chunk size), and a width of 4MB
(the RAID0 chunk size).

What you are doing is aligning allocation to individual disks in the
RAID6 volumes but the filesystem doesn't know what the stripe width
of those volumes are so can't really align correctly to the RAID6
geometry. And because it is not set up as a sunit = 128 (512k), it
can't align to the RAID0 on top of it correctly, either.

You need to align all layers of the stack to each other so the
filesystem has a consistent view of stripe unit and widths. In this
configuration, the RAID0 really needs a chunk size of 512k to match
the RAID6 stripe width. Then you can chose from two different valid
alignments for the filesytsem - align to the underlying RAID6 or to
the top level RAID0.

If you have a small file intensive workload, then aligning to the
RAID6 is probably best so that small files can pack full RAID6
stripe widths. If you have a bandwidth intensive workload, then
aligning to the RAID0 is probaly best so that large writes are
aligned to the full stripe width of the underlying RAID6 devices.

Either way, you need to understand and test your workload to improve
on whatever the default XFS settings give you.
Post by Paul Anderson
mkfs.xfs -L $(hostname) -l su=32768 -d su=32768,sw=128 /dev/md8
mount options: inode64,largeio,swalloc,delaylog,logbsize=256k,logbufs=8,noatime,nodiratime
Why largeio,swalloc? Have you determined that you're actually
getting hot disks in your array without it?

FWIW, delaylog and logbufs are the default so you don't need to set
them, and nodiratime is a subset of noatime, so you don't need to
specify that, either.
Post by Paul Anderson
I intended to make it with an external log, but forgot.
So you've determined an internal log is a performance bottleneck for
your workload?

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Loading...