Discussion:
XFS blocked task in xlog_cil_force_lsn
(too old to reply)
Kevin Richter
2013-12-18 02:05:51 UTC
Permalink
Hi,

around April 2012 there was a similar thread on this list which I have
found via Google, so my mail topic is the same.

I have a RAID6 array with 5 disks (each 2TB, net: 6TB). While copying
under heavy load there are always these blocks. At the bottom of this
message I have included some line from the syslog.

Even a reboot is now not possible anymore, because the whole system
hangs while executing the "sync" command in one of the shutdown scripts.

So... first I have thought that my disks are faulty.
But with smartmontools I have started a short and a long test on all of
the 5 disks: no errors

Then I have even recreated the whole array, but no improvement.

Details about my server: 3.2.0-57-generic, Ubuntu 12.04.3 LTS
Details about the array: soft array with mdadm v3.2.5, no hardware raid
controller in the server
$ cat /sys/block/sd[cdefg]/queue/scheduler
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
Any ideas what I can do?


Thanks,
Kevin
Dec 18 02:19:19 kernel: [390470.193560] INFO: task kswapd0:51 blocked for more than 120 seconds.
Dec 18 02:19:19 kernel: [390470.194541] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 18 02:19:19 kernel: [390470.195795] kswapd0 D ffffffff81806240 0 51 2 0x00000000
Dec 18 02:19:19 kernel: [390470.195804] ffff8804658099f0 0000000000000046 ffff880465809a74 ffff880465809a78
Dec 18 02:19:19 kernel: [390470.195813] ffff880465809fd8 ffff880465809fd8 ffff880465809fd8 00000000000137c0
Dec 18 02:19:19 kernel: [390470.195820] ffff880465dd1700 ffff880465818000 0000000000000000 ffff88046175aa28
Dec 18 02:19:19 kernel: [390470.195844] [<ffffffff8165f62f>] schedule+0x3f/0x60
Dec 18 02:19:19 kernel: [390470.195850] [<ffffffff81660437>] __mutex_lock_slowpath+0xd7/0x150
Dec 18 02:19:19 kernel: [390470.195856] [<ffffffff8166004a>] mutex_lock+0x2a/0x50
Dec 18 02:19:19 kernel: [390470.195904] [<ffffffffa029dcf8>] xfs_reclaim_inodes_ag+0x2d8/0x3a0 [xfs]
Dec 18 02:19:19 kernel: [390470.195928] [<ffffffffa029d83a>] ? __xfs_inode_set_reclaim_tag+0x4a/0x140 [xfs]
Dec 18 02:19:19 kernel: [390470.195951] [<ffffffffa029d96d>] ? xfs_inode_set_reclaim_tag+0x3d/0xa0 [xfs]
Dec 18 02:19:19 kernel: [390470.195973] [<ffffffffa029d9b4>] ? xfs_inode_set_reclaim_tag+0x84/0xa0 [xfs]
Dec 18 02:19:19 kernel: [390470.195995] [<ffffffffa029c6a5>] ? xfs_fs_destroy_inode+0x65/0x150 [xfs]
Dec 18 02:19:19 kernel: [390470.196003] [<ffffffff81194dfc>] ? destroy_inode+0x3c/0x70
Dec 18 02:19:19 kernel: [390470.196008] [<ffffffff81194f5a>] ? evict+0x12a/0x1c0
Dec 18 02:19:19 kernel: [390470.196016] [<ffffffff8166154e>] ? _raw_spin_lock+0xe/0x20
Dec 18 02:19:19 kernel: [390470.196037] [<ffffffffa029df63>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
Dec 18 02:19:19 kernel: [390470.196059] [<ffffffffa029a715>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
Dec 18 02:19:19 kernel: [390470.196067] [<ffffffff8117d71e>] prune_super+0x10e/0x1a0
Dec 18 02:19:19 kernel: [390470.196073] [<ffffffff8112b924>] shrink_slab+0x154/0x300
Dec 18 02:19:19 kernel: [390470.196079] [<ffffffff8112ec1a>] balance_pgdat+0x50a/0x6d0
Dec 18 02:19:19 kernel: [390470.196085] [<ffffffff8112ef01>] kswapd+0x121/0x220
Dec 18 02:19:19 kernel: [390470.196090] [<ffffffff8112ede0>] ? balance_pgdat+0x6d0/0x6d0
Dec 18 02:19:19 kernel: [390470.196097] [<ffffffff8108b64c>] kthread+0x8c/0xa0
Dec 18 02:19:19 kernel: [390470.196105] [<ffffffff8166bcf4>] kernel_thread_helper+0x4/0x10
Dec 18 02:19:19 kernel: [390470.196111] [<ffffffff8108b5c0>] ? flush_kthread_worker+0xa0/0xa0
Dec 18 02:19:19 kernel: [390470.196117] [<ffffffff8166bcf0>] ? gs_change+0x13/0x13
Dec 18 02:19:19 kernel: [390470.196121] INFO: task kswapd1:52 blocked for more than 120 seconds.
Dec 18 02:19:19 kernel: [390470.197086] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 18 02:19:19 kernel: [390470.198427] kswapd1 D 0000000000000007 0 52 2 0x00000000
Dec 18 02:19:19 kernel: [390470.198434] ffff88046580b840 0000000000000046 0000000000000001 00000000ffffffff
Dec 18 02:19:19 kernel: [390470.198441] ffff88046580bfd8 ffff88046580bfd8 ffff88046580bfd8 00000000000137c0
Dec 18 02:19:19 kernel: [390470.198447] ffff880865e8ae00 ffff880465819700 ffff88046580b830 0000000000000004
Dec 18 02:19:19 kernel: [390470.198462] [<ffffffff8165f62f>] schedule+0x3f/0x60
Dec 18 02:19:19 kernel: [390470.198498] [<ffffffffa02e4d17>] xlog_cil_force_lsn+0xf7/0x120 [xfs]
Dec 18 02:19:19 kernel: [390470.198510] [<ffffffff811643af>] ? unfreeze_partials+0x24f/0x2b0
Dec 18 02:19:19 kernel: [390470.198521] [<ffffffff810608e0>] ? try_to_wake_up+0x200/0x200
Dec 18 02:19:19 kernel: [390470.198554] [<ffffffffa02d1c65>] ? xfs_iunpin_wait+0x35/0xb0 [xfs]
Dec 18 02:19:19 kernel: [390470.198587] [<ffffffffa02e3367>] _xfs_log_force_lsn+0x57/0x2e0 [xfs]
Dec 18 02:19:19 kernel: [390470.198617] [<ffffffffa02ba398>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
Dec 18 02:19:19 kernel: [390470.198646] [<ffffffffa02ae972>] ? xfs_bmap_search_multi_extents+0x92/0x110 [xfs]
Dec 18 02:19:19 kernel: [390470.198677] [<ffffffffa02d1c65>] ? xfs_iunpin_wait+0x35/0xb0 [xfs]
Dec 18 02:19:19 kernel: [390470.198710] [<ffffffffa02e3608>] xfs_log_force_lsn+0x18/0x40 [xfs]
Dec 18 02:19:19 kernel: [390470.198740] [<ffffffffa02cfdd4>] xfs_iunpin_nowait+0x44/0xb0 [xfs]
Dec 18 02:19:19 kernel: [390470.198769] [<ffffffffa02d1c65>] xfs_iunpin_wait+0x35/0xb0 [xfs]
Dec 18 02:19:19 kernel: [390470.198780] [<ffffffff8104ef9b>] ? try_wait_for_completion+0x4b/0x60
Dec 18 02:19:19 kernel: [390470.198806] [<ffffffffa029cf72>] xfs_reclaim_inode+0xb2/0x2a0 [xfs]
Dec 18 02:19:19 kernel: [390470.198818] [<ffffffff81314a4c>] ? radix_tree_gang_lookup_tag.part.7+0x6c/0xf0
Dec 18 02:19:19 kernel: [390470.198844] [<ffffffffa029dc5f>] xfs_reclaim_inodes_ag+0x23f/0x3a0 [xfs]
Dec 18 02:19:19 kernel: [390470.198871] [<ffffffffa029df63>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
Dec 18 02:19:19 kernel: [390470.198895] [<ffffffffa029a715>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
Dec 18 02:19:19 kernel: [390470.198905] [<ffffffff8117d71e>] prune_super+0x10e/0x1a0
Dec 18 02:19:19 kernel: [390470.198913] [<ffffffff8112b924>] shrink_slab+0x154/0x300
Dec 18 02:19:19 kernel: [390470.198922] [<ffffffff8112ec1a>] balance_pgdat+0x50a/0x6d0
Dec 18 02:19:19 kernel: [390470.198931] [<ffffffff8112ef01>] kswapd+0x121/0x220
Dec 18 02:19:19 kernel: [390470.198938] [<ffffffff8112ede0>] ? balance_pgdat+0x6d0/0x6d0
Dec 18 02:19:19 kernel: [390470.198947] [<ffffffff8108b64c>] kthread+0x8c/0xa0
Dec 18 02:19:19 kernel: [390470.198956] [<ffffffff8105f750>] ? __migrate_task+0x120/0x120
Dec 18 02:19:19 kernel: [390470.198966] [<ffffffff8166bcf4>] kernel_thread_helper+0x4/0x10
Dec 18 02:19:19 kernel: [390470.198974] [<ffffffff8108b5c0>] ? flush_kthread_worker+0xa0/0xa0
Dec 18 02:19:19 kernel: [390470.198984] [<ffffffff8166bcf0>] ? gs_change+0x13/0x13
Dec 18 02:19:19 kernel: [390470.199017] INFO: task xfsaild/dm-0:19192 blocked for more than 120 seconds.
Dec 18 02:19:19 kernel: [390470.199828] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 18 02:19:19 kernel: [390470.200262] xfsaild/dm-0 D 0000000000000001 0 19192 2 0x00000000
Dec 18 02:19:19 kernel: [390470.200268] ffff88084e847a60 0000000000000046 ffff88046454d500 0000000000000000
Dec 18 02:19:19 kernel: [390470.200274] ffff88084e847fd8 ffff88084e847fd8 ffff88084e847fd8 00000000000137c0
Dec 18 02:19:19 kernel: [390470.200281] ffff88045fdd0000 ffff880461b1dc00 ffff88084e847a50 ffff88085f47d080
Dec 18 02:19:19 kernel: [390470.200293] [<ffffffff8165f62f>] schedule+0x3f/0x60
Dec 18 02:19:19 kernel: [390470.200323] [<ffffffffa02e25da>] xlog_state_get_iclog_space+0xea/0x2e0 [xfs]
Dec 18 02:19:19 kernel: [390470.200330] [<ffffffff810608e0>] ? try_to_wake_up+0x200/0x200
Dec 18 02:19:19 kernel: [390470.200358] [<ffffffffa02e2b99>] xlog_write+0xe9/0x400 [xfs]
Dec 18 02:19:19 kernel: [390470.200382] [<ffffffffa02a2877>] ? kmem_zone_alloc+0x67/0xe0 [xfs]
Dec 18 02:19:19 kernel: [390470.200410] [<ffffffffa02e440c>] xlog_cil_push+0x20c/0x380 [xfs]
Dec 18 02:19:19 kernel: [390470.200438] [<ffffffffa02e4d2e>] xlog_cil_force_lsn+0x10e/0x120 [xfs]
Dec 18 02:19:19 kernel: [390470.200446] [<ffffffff810136e5>] ? __switch_to+0xf5/0x360
Dec 18 02:19:19 kernel: [390470.200473] [<ffffffffa02e3098>] _xfs_log_force+0x68/0x2a0 [xfs]
Dec 18 02:19:19 kernel: [390470.200481] [<ffffffff81078cb2>] ? try_to_del_timer_sync+0x92/0x130
Dec 18 02:19:19 kernel: [390470.200507] [<ffffffffa02e32e8>] xfs_log_force+0x18/0x40 [xfs]
Dec 18 02:19:19 kernel: [390470.200535] [<ffffffffa02e7743>] xfsaild_push+0x213/0x680 [xfs]
Dec 18 02:19:19 kernel: [390470.200540] [<ffffffff8165fb45>] ? schedule_timeout+0x175/0x320
Dec 18 02:19:19 kernel: [390470.200568] [<ffffffffa02e7c12>] xfsaild+0x62/0xc0 [xfs]
Dec 18 02:19:19 kernel: [390470.200593] [<ffffffffa02e7bb0>] ? xfsaild_push+0x680/0x680 [xfs]
Dec 18 02:19:19 kernel: [390470.200599] [<ffffffff8108b64c>] kthread+0x8c/0xa0
Dec 18 02:19:19 kernel: [390470.200605] [<ffffffff8166bcf4>] kernel_thread_helper+0x4/0x10
Dec 18 02:19:19 kernel: [390470.200611] [<ffffffff8108b5c0>] ? flush_kthread_worker+0xa0/0xa0
Dec 18 02:19:19 kernel: [390470.200617] [<ffffffff8166bcf0>] ? gs_change+0x13/0x13
Stan Hoeppner
2013-12-18 03:38:17 UTC
Permalink
This post might be inappropriate. Click to display it.
Kevin Richter
2013-12-18 10:27:54 UTC
Permalink
Thanks for your mails!
Post by Stefan Ring
This is unusual. How long have you waited?
For the reboot? One night.
After the copy process hangs: several hours. But mostly it recovers
after several minutes.
Post by Stefan Ring
1. Switch to deadline. CFQ is not suitable for RAID storage, and not
suitable for XFS. This may not be a silver bullet but it will help.
Can I switch it while my copy process (from a separate hd to this raid)
is running... without data loss? Otherwise I would wait a bit, because
now it is actually running for 8 hours without kernel panics.
Post by Stefan Ring
2. Post your chunk size and RAID6 stripe_cache_size value. They may be
sub optimal for your workload.
$ cat /sys/block/md2/md/stripe_cache_size
256
$ mdadm --detail /dev/md2 | grep Chunk
Chunk Size : 512K
Post by Stefan Ring
3. Post 'xfs_info /dev/mdX'
There is a LUKS volume around /dev/md2, named '6tb'.
Post by Stefan Ring
$ xfs_info /dev/md2
xfs_info: /dev/md2 is not a mounted XFS filesystem
$ xfs_info /dev/mapper/6tb
meta-data=/dev/mapper/6tb isize=256 agcount=32, agsize=45631360 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=1460203520, imaxpct=5
= sunit=128 swidth=384 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
4. You're getting a lot of kswapd timeouts because you have swap and
the md/RAID6 array on the same disks. Relocate swap to disks that are
not part of this RAID6. Small SSDs are cheap and fast. Buy one and put
swap on it. Or install more RAM in the machine. Going the SSD route is
better as it gives flexibility. For instance, you can also relocate
your syslog files to it and anything else that does IO without eating
lots of space. This decreases the IOPS load on your rust.
No no, swap is not on any of the raid disks.
Post by Stefan Ring
# cat /proc/swaps
Filename Type Size Used Priority
/dev/sda3 partition 7812496 0 -1
sda is not in the raid. In the raid there are sd[cdefg].
Post by Stefan Ring
5. Describe in some detail the workload(s) causing the heavy IO, and
thus these timeouts.
cd /olddharddisk
cp -av . /raid/

oldhardddisk is a mounted 1tb old harddisk, /raid is the 6tb raid from
above.
Post by Stefan Ring
top - 11:13:37 up 4 days, 21:32, 2 users, load average: 12.95, 11.33, 10.32
Tasks: 155 total, 2 running, 153 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 5.7%sy, 0.0%ni, 82.1%id, 11.8%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 32916276k total, 32750240k used, 166036k free, 10076760k buffers
Swap: 7812496k total, 0k used, 7812496k free, 21221136k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
699 root 20 0 0 0 0 S 11 0.0 248:17.59 md2_raid6
Dont know what consumes all of this 32GB RAM... 'top' sorted by memory
consumption does not tell me. All entries are only 0.0% and 0.1%




Thanks,
Kevin
Post by Stefan Ring
Post by Kevin Richter
Hi,
around April 2012 there was a similar thread on this list which I have
found via Google, so my mail topic is the same.
I have a RAID6 array with 5 disks (each 2TB, net: 6TB). While copying
under heavy load there are always these blocks. At the bottom of this
message I have included some line from the syslog.
Even a reboot is now not possible anymore, because the whole system
hangs while executing the "sync" command in one of the shutdown scripts.
So... first I have thought that my disks are faulty.
But with smartmontools I have started a short and a long test on all of
the 5 disks: no errors
Then I have even recreated the whole array, but no improvement.
Details about my server: 3.2.0-57-generic, Ubuntu 12.04.3 LTS
Details about the array: soft array with mdadm v3.2.5, no hardware raid
controller in the server
$ cat /sys/block/sd[cdefg]/queue/scheduler
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
Any ideas what I can do?
Your workload is seeking the disks to death, which is why you're getting
these timeouts. The actuators simply can't keep up.
1. Switch to deadline. CFQ is not suitable for RAID storage, and not
suitable for XFS. This may not be a silver bullet but it will help.
2. Post your chunk size and RAID6 stripe_cache_size value. They may be
sub optimal for your workload. For the latter
$ cat /sys/block/mdX/md/stripe_cache_size
3. Post 'xfs_info /dev/mdX'
4. You're getting a lot of kswapd timeouts because you have swap and
the md/RAID6 array on the same disks. Relocate swap to disks that are
not part of this RAID6. Small SSDs are cheap and fast. Buy one and put
swap on it. Or install more RAM in the machine. Going the SSD route is
better as it gives flexibility. For instance, you can also relocate
your syslog files to it and anything else that does IO without eating
lots of space. This decreases the IOPS load on your rust.
5. Describe in some detail the workload(s) causing the heavy IO, and
thus these timeouts.
Stan Hoeppner
2013-12-19 14:11:42 UTC
Permalink
Post by Kevin Richter
Thanks for your mails!
Post by Stefan Ring
This is unusual. How long have you waited?
For the reboot? One night.
After the copy process hangs: several hours. But mostly it recovers
after several minutes.
Post by Stefan Ring
1. Switch to deadline. CFQ is not suitable for RAID storage, and not
suitable for XFS. This may not be a silver bullet but it will help.
Can I switch it while my copy process (from a separate hd to this raid)
is running... without data loss? Otherwise I would wait a bit, because
now it is actually running for 8 hours without kernel panics.
Yes, you can safely change the elevator on the fly. Repeat the
following once for each disk, replacing X appropriately.

$ echo deadline > /sys/block/sdX/queue/scheduler

This on the fly change is not permanent. Add "elevator=deadline" to
your kernel command line in your bootloader config file. Or you can add
the commands to a startup script. Or if you build your own kernels
simply make it the default, and/or remove all the other elevators from
your kernel config, which is what I do.
Post by Kevin Richter
Post by Stefan Ring
2. Post your chunk size and RAID6 stripe_cache_size value. They may be
sub optimal for your workload.
$ cat /sys/block/md2/md/stripe_cache_size
256
256 is the default and it is way too low. This is limiting your write
througput. Increase this to a minimum of 1024 which will give you a
20MB stripe cache buffer. This should become active immediately. Add
it to a startup script to make it permanent.

$ echo 1024 > /sys/block/md2/md/stripe_cache_size

((4096*stripe-cache_size)*num_drives)= RAM consumed for stripe cache

At some future point when the system is idle, play with FIO write
testing and multiple jobs (threads) with stripe_cache_size values up to
2048 to see which value gives the best throughput. Test 1024, 2048,
4096. Going higher probably won't gain anything. And at some point as
you go higher your throughput will decrease. When you see that, select
the previous value and make it permanent.
Post by Kevin Richter
$ mdadm --detail /dev/md2 | grep Chunk
Chunk Size : 512K
Post by Stefan Ring
3. Post 'xfs_info /dev/mdX'
There is a LUKS volume around /dev/md2, named '6tb'.
Ok, this adds a huge fly to the ointment. LUKS is single threaded per
device. You've encrypted the md device instead of the individual disks.
This limits your encryption throughput to a single CPU core. You've
not provided any data on your LUKS thread. Is it pegging a core during
large file IO? I'd bet it is. That would explain all of this.
Post by Kevin Richter
Post by Stefan Ring
$ xfs_info /dev/md2
xfs_info: /dev/md2 is not a mounted XFS filesystem
$ xfs_info /dev/mapper/6tb
meta-data=/dev/mapper/6tb isize=256 agcount=32, agsize=45631360 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=1460203520, imaxpct=5
= sunit=128 swidth=384 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
md2 is 5 spindle RAID6, geometry 512KB * 3 = 1.5MB stripe width. The
XFS alignment matches the md device. This may be hurting you with LUKS
between XFS and md. I don't know how LUKS handles 1.5MB inbound writes
and if it passes an aligned write to md after encrypting. I also don't
know how it handles writes barriers passed down the stack by XFS, or if
it does at all.
Post by Kevin Richter
Post by Stefan Ring
4. You're getting a lot of kswapd timeouts because you have swap and
the md/RAID6 array on the same disks. Relocate swap to disks that are
not part of this RAID6. Small SSDs are cheap and fast. Buy one and put
swap on it. Or install more RAM in the machine. Going the SSD route is
better as it gives flexibility. For instance, you can also relocate
your syslog files to it and anything else that does IO without eating
lots of space. This decreases the IOPS load on your rust.
No no, swap is not on any of the raid disks.
Post by Stefan Ring
# cat /proc/swaps
Filename Type Size Used Priority
/dev/sda3 partition 7812496 0 -1
sda is not in the raid. In the raid there are sd[cdefg].
I was wrong here. After reading a little about kswapd I now know it
doesn't simply move pages to/from disk. It frees memory in other ways
as well. In this case it's apparently waiting to free an inode being
reclaimed by xfs_reclaim_inode, which is stalled on lower level IO. So
kswapd times out as a result. A slow LUKS thread would explain this as
well.
Post by Kevin Richter
Post by Stefan Ring
5. Describe in some detail the workload(s) causing the heavy IO, and
thus these timeouts.
cd /olddharddisk
cp -av . /raid/
oldhardddisk is a mounted 1tb old harddisk, /raid is the 6tb raid from
above.
So you are simply doing a huge file copy from an apparently slower
single disk to an apparently faster striped RAID. But you get horrible
throughput and kernel timeout errors, and the RAID disks are apparently
fairly idle.
Model # of the CPUs so I can look up the specs?
Post by Kevin Richter
Post by Stefan Ring
top - 11:13:37 up 4 days, 21:32, 2 users, load average: 12.95, 11.33, 10.32
Tasks: 155 total, 2 running, 153 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 5.7%sy, 0.0%ni, 82.1%id, 11.8%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 32916276k total, 32750240k used, 166036k free, 10076760k buffers
Swap: 7812496k total, 0k used, 7812496k free, 21221136k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
699 root 20 0 0 0 0 S 11 0.0 248:17.59 md2_raid6
Dont know what consumes all of this 32GB RAM... 'top' sorted by memory
consumption does not tell me. All entries are only 0.0% and 0.1%
~30GB of your RAM is being eaten by filesystem metadata buffers (10GB)
and page cache (20GB). Linux uses nearly all free memory for cache and
buffers to prevent disk accesses, which speeds up access to frequently
used data considerably. This is normal. A huge copy like this is going
to cause a huge amount caching.

I assume this 1TB file copy has finally completed. Start another large
test copy and paste the CPU burn for the LUKS thread so we can confirm
whether this is the root cause of your problem.
--
Stan
Post by Kevin Richter
Thanks,
Kevin
Post by Stefan Ring
Post by Kevin Richter
Hi,
around April 2012 there was a similar thread on this list which I have
found via Google, so my mail topic is the same.
I have a RAID6 array with 5 disks (each 2TB, net: 6TB). While copying
under heavy load there are always these blocks. At the bottom of this
message I have included some line from the syslog.
Even a reboot is now not possible anymore, because the whole system
hangs while executing the "sync" command in one of the shutdown scripts.
So... first I have thought that my disks are faulty.
But with smartmontools I have started a short and a long test on all of
the 5 disks: no errors
Then I have even recreated the whole array, but no improvement.
Details about my server: 3.2.0-57-generic, Ubuntu 12.04.3 LTS
Details about the array: soft array with mdadm v3.2.5, no hardware raid
controller in the server
$ cat /sys/block/sd[cdefg]/queue/scheduler
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
noop deadline [cfq]
Any ideas what I can do?
Your workload is seeking the disks to death, which is why you're getting
these timeouts. The actuators simply can't keep up.
1. Switch to deadline. CFQ is not suitable for RAID storage, and not
suitable for XFS. This may not be a silver bullet but it will help.
2. Post your chunk size and RAID6 stripe_cache_size value. They may be
sub optimal for your workload. For the latter
$ cat /sys/block/mdX/md/stripe_cache_size
3. Post 'xfs_info /dev/mdX'
4. You're getting a lot of kswapd timeouts because you have swap and
the md/RAID6 array on the same disks. Relocate swap to disks that are
not part of this RAID6. Small SSDs are cheap and fast. Buy one and put
swap on it. Or install more RAM in the machine. Going the SSD route is
better as it gives flexibility. For instance, you can also relocate
your syslog files to it and anything else that does IO without eating
lots of space. This decreases the IOPS load on your rust.
5. Describe in some detail the workload(s) causing the heavy IO, and
thus these timeouts.
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
Kevin Richter
2013-12-20 10:26:47 UTC
Permalink
Thanks a lot for your huge replies!
Post by Dave Chinner
Oh, that's quite old. I'd suggest upgrading to a much more recent
kernel as we've fixed lots of issues in this area since then.
First I have switched to the newer kernel from Ubuntu Saucy:
$ uname -a
Linux 3.11.0-14-generic #21~precise1-Ubuntu SMP

Thus, it seems that the default scheduler has been changed to deadline.
I did not change anything. After a reboot the schedulers of all disks
are now deadline.
Post by Dave Chinner
Model # of the CPUs so I can look up the specs?
Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz

I have prepared a folder with 60GB for the tests. This is nearly twice
of the available memory, so the process should be forced to actually
write the stuff to the disk - and not only hold in the memory.
Post by Dave Chinner
$ echo 256 > /sys/block/md2/md/stripe_cache_size
$ time cp -a /olddisk/testfolder /6tb/foo1/
real 25m38.925s
user 0m0.595s
sys 1m23.182s
$ echo 1024 > /sys/block/md2/md/stripe_cache_size
$ time cp -a /olddisk/testfolder /raid/foo2/
real 7m32.824s
user 0m0.438s
sys 1m6.759s
$ echo 2048 > /sys/block/md2/md/stripe_cache_size
$ time cp -a /olddisk/testfolder /raid/foo3/
real 5m32.847s
user 0m0.418s
sys 1m5.671s
$ echo 4096 > /sys/block/md2/md/stripe_cache_size
$ time cp -a /olddisk/testfolder /raid/foo4/
real 5m54.554s
user 0m0.437s
sys 1m6.268s
The difference is really amazing! So 2048 seems to be the best choice.
60GB in 5,5minutes are 180MB/sek. That sounds a bit high, doesnt it?
The RAID only consist of 5 SATA disks with 7200rpm.
Post by Dave Chinner
top - 10:48:24 up 1 day, 1:41, 2 users, load average: 5.66, 3.53, 2.17
Tasks: 210 total, 2 running, 208 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 35.8%sy, 0.0%ni, 46.0%id, 17.9%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 32913992k total, 32709208k used, 204784k free, 10770344k buffers
Swap: 7812496k total, 0k used, 7812496k free, 20866844k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19524 root 20 0 0 0 0 R 93 0.0 4:00.12 kworker/3:1
23744 root 20 0 0 0 0 S 55 0.0 0:50.84 kworker/0:1
23738 root 20 0 0 0 0 S 29 0.0 0:56.94 kworker/4:0
3893 root 20 0 0 0 0 S 28 0.0 36:47.50 md2_raid6
4551 root 20 0 22060 3328 720 D 25 0.0 20:21.61 mount.ntfs
23273 root 20 0 0 0 0 S 22 0.0 1:54.86 kworker/7:2
23734 root 20 0 21752 1280 1040 D 21 0.0 0:49.84 cp
84 root 20 0 0 0 0 S 7 0.0 8:19.34 kswapd1
83 root 20 0 0 0 0 S 6 0.0 11:55.81 kswapd0
23745 root 20 0 0 0 0 S 2 0.0 0:33.60 kworker/1:2
21598 root 20 0 0 0 0 D 1 0.0 0:11.33 kworker/u17:1
And the best thing at all:
While all of these tests there are no warnings/panics in the syslog.


With best regards,
Kevin
Post by Dave Chinner
Post by Kevin Richter
Thanks for your mails!
Post by Stefan Ring
This is unusual. How long have you waited?
For the reboot? One night.
After the copy process hangs: several hours. But mostly it recovers
after several minutes.
Post by Stefan Ring
1. Switch to deadline. CFQ is not suitable for RAID storage, and not
suitable for XFS. This may not be a silver bullet but it will help.
Can I switch it while my copy process (from a separate hd to this raid)
is running... without data loss? Otherwise I would wait a bit, because
now it is actually running for 8 hours without kernel panics.
Yes, you can safely change the elevator on the fly. Repeat the
following once for each disk, replacing X appropriately.
$ echo deadline > /sys/block/sdX/queue/scheduler
This on the fly change is not permanent. Add "elevator=deadline" to
your kernel command line in your bootloader config file. Or you can add
the commands to a startup script. Or if you build your own kernels
simply make it the default, and/or remove all the other elevators from
your kernel config, which is what I do.
Post by Kevin Richter
Post by Stefan Ring
2. Post your chunk size and RAID6 stripe_cache_size value. They may be
sub optimal for your workload.
$ cat /sys/block/md2/md/stripe_cache_size
256
256 is the default and it is way too low. This is limiting your write
througput. Increase this to a minimum of 1024 which will give you a
20MB stripe cache buffer. This should become active immediately. Add
it to a startup script to make it permanent.
$ echo 1024 > /sys/block/md2/md/stripe_cache_size
((4096*stripe-cache_size)*num_drives)= RAM consumed for stripe cache
At some future point when the system is idle, play with FIO write
testing and multiple jobs (threads) with stripe_cache_size values up to
2048 to see which value gives the best throughput. Test 1024, 2048,
4096. Going higher probably won't gain anything. And at some point as
you go higher your throughput will decrease. When you see that, select
the previous value and make it permanent.
Post by Kevin Richter
$ mdadm --detail /dev/md2 | grep Chunk
Chunk Size : 512K
Post by Stefan Ring
3. Post 'xfs_info /dev/mdX'
There is a LUKS volume around /dev/md2, named '6tb'.
Ok, this adds a huge fly to the ointment. LUKS is single threaded per
device. You've encrypted the md device instead of the individual disks.
This limits your encryption throughput to a single CPU core. You've
not provided any data on your LUKS thread. Is it pegging a core during
large file IO? I'd bet it is. That would explain all of this.
Post by Kevin Richter
Post by Stefan Ring
$ xfs_info /dev/md2
xfs_info: /dev/md2 is not a mounted XFS filesystem
$ xfs_info /dev/mapper/6tb
meta-data=/dev/mapper/6tb isize=256 agcount=32, agsize=45631360 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=1460203520, imaxpct=5
= sunit=128 swidth=384 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
md2 is 5 spindle RAID6, geometry 512KB * 3 = 1.5MB stripe width. The
XFS alignment matches the md device. This may be hurting you with LUKS
between XFS and md. I don't know how LUKS handles 1.5MB inbound writes
and if it passes an aligned write to md after encrypting. I also don't
know how it handles writes barriers passed down the stack by XFS, or if
it does at all.
Post by Kevin Richter
Post by Stefan Ring
4. You're getting a lot of kswapd timeouts because you have swap and
the md/RAID6 array on the same disks. Relocate swap to disks that are
not part of this RAID6. Small SSDs are cheap and fast. Buy one and put
swap on it. Or install more RAM in the machine. Going the SSD route is
better as it gives flexibility. For instance, you can also relocate
your syslog files to it and anything else that does IO without eating
lots of space. This decreases the IOPS load on your rust.
No no, swap is not on any of the raid disks.
Post by Stefan Ring
# cat /proc/swaps
Filename Type Size Used Priority
/dev/sda3 partition 7812496 0 -1
sda is not in the raid. In the raid there are sd[cdefg].
I was wrong here. After reading a little about kswapd I now know it
doesn't simply move pages to/from disk. It frees memory in other ways
as well. In this case it's apparently waiting to free an inode being
reclaimed by xfs_reclaim_inode, which is stalled on lower level IO. So
kswapd times out as a result. A slow LUKS thread would explain this as
well.
Post by Kevin Richter
Post by Stefan Ring
5. Describe in some detail the workload(s) causing the heavy IO, and
thus these timeouts.
cd /olddharddisk
cp -av . /raid/
oldhardddisk is a mounted 1tb old harddisk, /raid is the 6tb raid from
above.
So you are simply doing a huge file copy from an apparently slower
single disk to an apparently faster striped RAID. But you get horrible
throughput and kernel timeout errors, and the RAID disks are apparently
fairly idle.
Model # of the CPUs so I can look up the specs?
Post by Kevin Richter
Post by Stefan Ring
top - 11:13:37 up 4 days, 21:32, 2 users, load average: 12.95, 11.33, 10.32
Tasks: 155 total, 2 running, 153 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 5.7%sy, 0.0%ni, 82.1%id, 11.8%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 32916276k total, 32750240k used, 166036k free, 10076760k buffers
Swap: 7812496k total, 0k used, 7812496k free, 21221136k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
699 root 20 0 0 0 0 S 11 0.0 248:17.59 md2_raid6
Dont know what consumes all of this 32GB RAM... 'top' sorted by memory
consumption does not tell me. All entries are only 0.0% and 0.1%
~30GB of your RAM is being eaten by filesystem metadata buffers (10GB)
and page cache (20GB). Linux uses nearly all free memory for cache and
buffers to prevent disk accesses, which speeds up access to frequently
used data considerably. This is normal. A huge copy like this is going
to cause a huge amount caching.
I assume this 1TB file copy has finally completed. Start another large
test copy and paste the CPU burn for the LUKS thread so we can confirm
whether this is the root cause of your problem.
Stan Hoeppner
2013-12-20 12:36:33 UTC
Permalink
Post by Kevin Richter
Thanks a lot for your huge replies!
You're welcome.
Post by Kevin Richter
Post by Dave Chinner
Oh, that's quite old. I'd suggest upgrading to a much more recent
kernel as we've fixed lots of issues in this area since then.
$ uname -a
Linux 3.11.0-14-generic #21~precise1-Ubuntu SMP
Thus, it seems that the default scheduler has been changed to deadline.
I did not change anything. After a reboot the schedulers of all disks
are now deadline.
Good move Ubuntu.
Post by Kevin Richter
Post by Dave Chinner
Model # of the CPUs so I can look up the specs?
Strong CPUs.
Post by Kevin Richter
I have prepared a folder with 60GB for the tests. This is nearly twice
of the available memory, so the process should be forced to actually
write the stuff to the disk - and not only hold in the memory.
Post by Dave Chinner
$ echo 256 > /sys/block/md2/md/stripe_cache_size
$ time cp -a /olddisk/testfolder /6tb/foo1/
real 25m38.925s
user 0m0.595s
sys 1m23.182s
$ echo 1024 > /sys/block/md2/md/stripe_cache_size
$ time cp -a /olddisk/testfolder /raid/foo2/
real 7m32.824s
user 0m0.438s
sys 1m6.759s
$ echo 2048 > /sys/block/md2/md/stripe_cache_size
$ time cp -a /olddisk/testfolder /raid/foo3/
real 5m32.847s
user 0m0.418s
sys 1m5.671s
$ echo 4096 > /sys/block/md2/md/stripe_cache_size
$ time cp -a /olddisk/testfolder /raid/foo4/
real 5m54.554s
user 0m0.437s
sys 1m6.268s
The difference is really amazing! So 2048 seems to be the best choice.
60GB in 5,5minutes are 180MB/sek. That sounds a bit high, doesnt it?
The RAID only consist of 5 SATA disks with 7200rpm.
A lot of the source data is being cached between runs so these numbers
aren't accurate. The throughput of this copy operation will be limited
by the speed of the single source disk, not the array. To make the
elapsed times of this copy test accurate you need to execute something
like these commands after each run:

# sync
# echo 3 > /proc/sys/vm/drop_caches

But this copy test will not inform you about the potential peak
performance of your array. That's why I suggested you test with FIO,
the flexible IO tester.

# aptitude install fio
# man fio

Sample job file suitable for your system:

[global]
directory=/your/XFS/test/directory
zero_buffers
numjobs=8
group_reporting
blocksize=512k
ioengine=libaio
iodepth=16
direct=1
size=10g

[read]
rw=read
stonewall

[write]
rw=write
stonewall

This should give you a relatively accurate picture of the actual
potential throughput of your array and filesystem.
Post by Kevin Richter
Post by Dave Chinner
top - 10:48:24 up 1 day, 1:41, 2 users, load average: 5.66, 3.53, 2.17
Tasks: 210 total, 2 running, 208 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 35.8%sy, 0.0%ni, 46.0%id, 17.9%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 32913992k total, 32709208k used, 204784k free, 10770344k buffers
Swap: 7812496k total, 0k used, 7812496k free, 20866844k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19524 root 20 0 0 0 0 R 93 0.0 4:00.12 kworker/3:1
23744 root 20 0 0 0 0 S 55 0.0 0:50.84 kworker/0:1
23738 root 20 0 0 0 0 S 29 0.0 0:56.94 kworker/4:0
3893 root 20 0 0 0 0 S 28 0.0 36:47.50 md2_raid6
4551 root 20 0 22060 3328 720 D 25 0.0 20:21.61 mount.ntfs
23273 root 20 0 0 0 0 S 22 0.0 1:54.86 kworker/7:2
23734 root 20 0 21752 1280 1040 D 21 0.0 0:49.84 cp
84 root 20 0 0 0 0 S 7 0.0 8:19.34 kswapd1
83 root 20 0 0 0 0 S 6 0.0 11:55.81 kswapd0
23745 root 20 0 0 0 0 S 2 0.0 0:33.60 kworker/1:2
21598 root 20 0 0 0 0 D 1 0.0 0:11.33 kworker/u17:1
Hmm, what's kworker/3:1? That's not a crypto thread eating 93% of a
SandyBridge core at only ~180 MB/s throughput is it?
Post by Kevin Richter
While all of these tests there are no warnings/panics in the syslog.
With best regards,
Kevin
Even though XFS wasn't the cause of the problem I'm glad we were able to
help you fix it nonetheless. I'm really curious to see what kind of
throughput you can achieve with FIO, and whether crypto is a bottleneck
at the 250-350 MB/s your array should be capable of. It would be great
if you would play around a bit with FIO and post some numbers.

--
Stan
Post by Kevin Richter
Post by Dave Chinner
Post by Kevin Richter
Thanks for your mails!
Post by Stefan Ring
This is unusual. How long have you waited?
For the reboot? One night.
After the copy process hangs: several hours. But mostly it recovers
after several minutes.
Post by Stefan Ring
1. Switch to deadline. CFQ is not suitable for RAID storage, and not
suitable for XFS. This may not be a silver bullet but it will help.
Can I switch it while my copy process (from a separate hd to this raid)
is running... without data loss? Otherwise I would wait a bit, because
now it is actually running for 8 hours without kernel panics.
Yes, you can safely change the elevator on the fly. Repeat the
following once for each disk, replacing X appropriately.
$ echo deadline > /sys/block/sdX/queue/scheduler
This on the fly change is not permanent. Add "elevator=deadline" to
your kernel command line in your bootloader config file. Or you can add
the commands to a startup script. Or if you build your own kernels
simply make it the default, and/or remove all the other elevators from
your kernel config, which is what I do.
Post by Kevin Richter
Post by Stefan Ring
2. Post your chunk size and RAID6 stripe_cache_size value. They may be
sub optimal for your workload.
$ cat /sys/block/md2/md/stripe_cache_size
256
256 is the default and it is way too low. This is limiting your write
througput. Increase this to a minimum of 1024 which will give you a
20MB stripe cache buffer. This should become active immediately. Add
it to a startup script to make it permanent.
$ echo 1024 > /sys/block/md2/md/stripe_cache_size
((4096*stripe-cache_size)*num_drives)= RAM consumed for stripe cache
At some future point when the system is idle, play with FIO write
testing and multiple jobs (threads) with stripe_cache_size values up to
2048 to see which value gives the best throughput. Test 1024, 2048,
4096. Going higher probably won't gain anything. And at some point as
you go higher your throughput will decrease. When you see that, select
the previous value and make it permanent.
Post by Kevin Richter
$ mdadm --detail /dev/md2 | grep Chunk
Chunk Size : 512K
Post by Stefan Ring
3. Post 'xfs_info /dev/mdX'
There is a LUKS volume around /dev/md2, named '6tb'.
Ok, this adds a huge fly to the ointment. LUKS is single threaded per
device. You've encrypted the md device instead of the individual disks.
This limits your encryption throughput to a single CPU core. You've
not provided any data on your LUKS thread. Is it pegging a core during
large file IO? I'd bet it is. That would explain all of this.
Post by Kevin Richter
Post by Stefan Ring
$ xfs_info /dev/md2
xfs_info: /dev/md2 is not a mounted XFS filesystem
$ xfs_info /dev/mapper/6tb
meta-data=/dev/mapper/6tb isize=256 agcount=32, agsize=45631360 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=1460203520, imaxpct=5
= sunit=128 swidth=384 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
md2 is 5 spindle RAID6, geometry 512KB * 3 = 1.5MB stripe width. The
XFS alignment matches the md device. This may be hurting you with LUKS
between XFS and md. I don't know how LUKS handles 1.5MB inbound writes
and if it passes an aligned write to md after encrypting. I also don't
know how it handles writes barriers passed down the stack by XFS, or if
it does at all.
Post by Kevin Richter
Post by Stefan Ring
4. You're getting a lot of kswapd timeouts because you have swap and
the md/RAID6 array on the same disks. Relocate swap to disks that are
not part of this RAID6. Small SSDs are cheap and fast. Buy one and put
swap on it. Or install more RAM in the machine. Going the SSD route is
better as it gives flexibility. For instance, you can also relocate
your syslog files to it and anything else that does IO without eating
lots of space. This decreases the IOPS load on your rust.
No no, swap is not on any of the raid disks.
Post by Stefan Ring
# cat /proc/swaps
Filename Type Size Used Priority
/dev/sda3 partition 7812496 0 -1
sda is not in the raid. In the raid there are sd[cdefg].
I was wrong here. After reading a little about kswapd I now know it
doesn't simply move pages to/from disk. It frees memory in other ways
as well. In this case it's apparently waiting to free an inode being
reclaimed by xfs_reclaim_inode, which is stalled on lower level IO. So
kswapd times out as a result. A slow LUKS thread would explain this as
well.
Post by Kevin Richter
Post by Stefan Ring
5. Describe in some detail the workload(s) causing the heavy IO, and
thus these timeouts.
cd /olddharddisk
cp -av . /raid/
oldhardddisk is a mounted 1tb old harddisk, /raid is the 6tb raid from
above.
So you are simply doing a huge file copy from an apparently slower
single disk to an apparently faster striped RAID. But you get horrible
throughput and kernel timeout errors, and the RAID disks are apparently
fairly idle.
Model # of the CPUs so I can look up the specs?
Post by Kevin Richter
Post by Stefan Ring
top - 11:13:37 up 4 days, 21:32, 2 users, load average: 12.95, 11.33, 10.32
Tasks: 155 total, 2 running, 153 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 5.7%sy, 0.0%ni, 82.1%id, 11.8%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 32916276k total, 32750240k used, 166036k free, 10076760k buffers
Swap: 7812496k total, 0k used, 7812496k free, 21221136k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
699 root 20 0 0 0 0 S 11 0.0 248:17.59 md2_raid6
Dont know what consumes all of this 32GB RAM... 'top' sorted by memory
consumption does not tell me. All entries are only 0.0% and 0.1%
~30GB of your RAM is being eaten by filesystem metadata buffers (10GB)
and page cache (20GB). Linux uses nearly all free memory for cache and
buffers to prevent disk accesses, which speeds up access to frequently
used data considerably. This is normal. A huge copy like this is going
to cause a huge amount caching.
I assume this 1TB file copy has finally completed. Start another large
test copy and paste the CPU burn for the LUKS thread so we can confirm
whether this is the root cause of your problem.
Dave Chinner
2013-12-21 05:30:32 UTC
Permalink
Post by Stan Hoeppner
Post by Dave Chinner
top - 10:48:24 up 1 day, 1:41, 2 users, load average: 5.66, 3.53, 2.17
Tasks: 210 total, 2 running, 208 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 35.8%sy, 0.0%ni, 46.0%id, 17.9%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 32913992k total, 32709208k used, 204784k free, 10770344k buffers
Swap: 7812496k total, 0k used, 7812496k free, 20866844k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19524 root 20 0 0 0 0 R 93 0.0 4:00.12 kworker/3:1
23744 root 20 0 0 0 0 S 55 0.0 0:50.84 kworker/0:1
23738 root 20 0 0 0 0 S 29 0.0 0:56.94 kworker/4:0
3893 root 20 0 0 0 0 S 28 0.0 36:47.50 md2_raid6
4551 root 20 0 22060 3328 720 D 25 0.0 20:21.61 mount.ntfs
23273 root 20 0 0 0 0 S 22 0.0 1:54.86 kworker/7:2
23734 root 20 0 21752 1280 1040 D 21 0.0 0:49.84 cp
84 root 20 0 0 0 0 S 7 0.0 8:19.34 kswapd1
83 root 20 0 0 0 0 S 6 0.0 11:55.81 kswapd0
23745 root 20 0 0 0 0 S 2 0.0 0:33.60 kworker/1:2
21598 root 20 0 0 0 0 D 1 0.0 0:11.33 kworker/u17:1
Hmm, what's kworker/3:1? That's not a crypto thread eating 93% of a
SandyBridge core at only ~180 MB/s throughput is it?
Kworkers are an anonymous kernel worker threads that do work that
has been pushed to a workqueue. kworker/3:1 is the 2nd worker thread
on CPU 3 (3:0 is the first). The kworker is a thread pool that grows
and shrinks according to demand.

As the naming suggests, kworker threads are per-CPU, and there can
be hundreds of them per CPU is there is enough workqueue work blocks
during execution of the work (e.g. on locks, waiting for IO, etc).
If there is little blocking occurring, there might on ly be a couple
of kworker threads that do all the work, and hence you see them
consuming huge amounts of CPU on behalf of other systems...

XFS uses workqueues for lots of things, so it's not unusual to see
an IO or metadata heavy workload end up with this huge numbers of
kworker threads doing work:

<run parallel fsmark test>
....
$ ps -ef |grep kworker |wc -l
91
$

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Stan Hoeppner
2013-12-22 09:18:05 UTC
Permalink
Post by Dave Chinner
Post by Stan Hoeppner
Post by Dave Chinner
top - 10:48:24 up 1 day, 1:41, 2 users, load average: 5.66, 3.53, 2.17
Tasks: 210 total, 2 running, 208 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 35.8%sy, 0.0%ni, 46.0%id, 17.9%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 32913992k total, 32709208k used, 204784k free, 10770344k buffers
Swap: 7812496k total, 0k used, 7812496k free, 20866844k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19524 root 20 0 0 0 0 R 93 0.0 4:00.12 kworker/3:1
23744 root 20 0 0 0 0 S 55 0.0 0:50.84 kworker/0:1
23738 root 20 0 0 0 0 S 29 0.0 0:56.94 kworker/4:0
3893 root 20 0 0 0 0 S 28 0.0 36:47.50 md2_raid6
4551 root 20 0 22060 3328 720 D 25 0.0 20:21.61 mount.ntfs
23273 root 20 0 0 0 0 S 22 0.0 1:54.86 kworker/7:2
23734 root 20 0 21752 1280 1040 D 21 0.0 0:49.84 cp
84 root 20 0 0 0 0 S 7 0.0 8:19.34 kswapd1
83 root 20 0 0 0 0 S 6 0.0 11:55.81 kswapd0
23745 root 20 0 0 0 0 S 2 0.0 0:33.60 kworker/1:2
21598 root 20 0 0 0 0 D 1 0.0 0:11.33 kworker/u17:1
Hmm, what's kworker/3:1? That's not a crypto thread eating 93% of a
SandyBridge core at only ~180 MB/s throughput is it?
Kworkers are an anonymous kernel worker threads that do work that
has been pushed to a workqueue. kworker/3:1 is the 2nd worker thread
on CPU 3 (3:0 is the first). The kworker is a thread pool that grows
and shrinks according to demand.
As the naming suggests, kworker threads are per-CPU, and there can
be hundreds of them per CPU is there is enough workqueue work blocks
during execution of the work (e.g. on locks, waiting for IO, etc).
If there is little blocking occurring, there might on ly be a couple
of kworker threads that do all the work, and hence you see them
consuming huge amounts of CPU on behalf of other systems...
XFS uses workqueues for lots of things, so it's not unusual to see
an IO or metadata heavy workload end up with this huge numbers of
<run parallel fsmark test>
....
$ ps -ef |grep kworker |wc -l
91
For the record, again, I've never used dm-crypt. I'm just trying to
work through the layers to identify the source of Kevin's problems. So
please don't clobber me too hard for going OT, or trying to speak
(somewhat) intelligently about something I'm just learning about...

So, how does one identify who is doing work inside a worker thread? In
this case we're trying to figure out if dm-crypt might be eating a CPU
core thus limiting its throughput. However...

I just read a thread on dm-devel about dm-crypt performance dated late
March 2013. It wasn't clear to me if dm-crypt yet uses unbound
workqueues, if it dispatches work to multiple queues, or if it still
uses the hand built dispatcher, as of kernel 3.11, which Kevin just
updated to.

What was clear is that as of March dm-crypt was not preserving IO
ordering. If this is still the case, that would imply RMW with md
parity arrays, which would tend to explain Kevin's write throughput
being ~1/3rd of his read, and less than a single disk.

Mikulas Patocka posted 400-620 MB/s dm-crypt throughput with a RAMdisk.
It was not stated which CPU/clock he was using. Even if it were a 4GHz
core, Kevin's 2.4GHz SandyBridge should be capable of far more than ~80
MB/s of encryption.

So if dm-crypt is contributing to Kevin's low throughput I'd think it
would be due to the IO ordering causing RMW, not due to CPU starvation.
Is there an easy way for a mere mortal to peek under the hood and see
if dm-crypt is causing RMW?

Christoph you were involved in that thread WRT IO ordering. Can you
shed more light on this, and if it may be a factor here.
--
Stan
Dave Chinner
2013-12-22 20:14:51 UTC
Permalink
Post by Stan Hoeppner
Post by Dave Chinner
Post by Stan Hoeppner
Post by Dave Chinner
top - 10:48:24 up 1 day, 1:41, 2 users, load average: 5.66, 3.53, 2.17
Tasks: 210 total, 2 running, 208 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 35.8%sy, 0.0%ni, 46.0%id, 17.9%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 32913992k total, 32709208k used, 204784k free, 10770344k buffers
Swap: 7812496k total, 0k used, 7812496k free, 20866844k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19524 root 20 0 0 0 0 R 93 0.0 4:00.12 kworker/3:1
23744 root 20 0 0 0 0 S 55 0.0 0:50.84 kworker/0:1
23738 root 20 0 0 0 0 S 29 0.0 0:56.94 kworker/4:0
3893 root 20 0 0 0 0 S 28 0.0 36:47.50 md2_raid6
4551 root 20 0 22060 3328 720 D 25 0.0 20:21.61 mount.ntfs
23273 root 20 0 0 0 0 S 22 0.0 1:54.86 kworker/7:2
23734 root 20 0 21752 1280 1040 D 21 0.0 0:49.84 cp
84 root 20 0 0 0 0 S 7 0.0 8:19.34 kswapd1
83 root 20 0 0 0 0 S 6 0.0 11:55.81 kswapd0
23745 root 20 0 0 0 0 S 2 0.0 0:33.60 kworker/1:2
21598 root 20 0 0 0 0 D 1 0.0 0:11.33 kworker/u17:1
Hmm, what's kworker/3:1? That's not a crypto thread eating 93% of a
SandyBridge core at only ~180 MB/s throughput is it?
Kworkers are an anonymous kernel worker threads that do work that
has been pushed to a workqueue. kworker/3:1 is the 2nd worker thread
on CPU 3 (3:0 is the first). The kworker is a thread pool that grows
and shrinks according to demand.
As the naming suggests, kworker threads are per-CPU, and there can
be hundreds of them per CPU is there is enough workqueue work blocks
during execution of the work (e.g. on locks, waiting for IO, etc).
If there is little blocking occurring, there might on ly be a couple
of kworker threads that do all the work, and hence you see them
consuming huge amounts of CPU on behalf of other systems...
XFS uses workqueues for lots of things, so it's not unusual to see
an IO or metadata heavy workload end up with this huge numbers of
<run parallel fsmark test>
....
$ ps -ef |grep kworker |wc -l
91
For the record, again, I've never used dm-crypt. I'm just trying to
work through the layers to identify the source of Kevin's problems. So
please don't clobber me too hard for going OT, or trying to speak
(somewhat) intelligently about something I'm just learning about...
So, how does one identify who is doing work inside a worker thread? In
this case we're trying to figure out if dm-crypt might be eating a CPU
core thus limiting its throughput. However...
perf top is your friend for finding out where the kernel is spending
it's CPU time. It will tell you functions (and call chains) that are
consuming CPU time rather than what individual threads are doing.

If you really want to know what worker threads are doing,
/proc/sysrq-trigger dumps the kworker threads with the name of the
workqueue they are doing work on behalf of as well as the stack
trace.
Post by Stan Hoeppner
I just read a thread on dm-devel about dm-crypt performance dated late
March 2013. It wasn't clear to me if dm-crypt yet uses unbound
workqueues, if it dispatches work to multiple queues, or if it still
uses the hand built dispatcher, as of kernel 3.11, which Kevin just
updated to.
dm-crypt uses workqueues with a concurrency depth of 1, and kcryptd
is defined as CPU intensive and so is scheduled like a long running
kernel thread rather than a workqueue. Reads decrypting is done on
IO completion, so if you've only got one CPU processing IO
completions, then read decryption will effectively be single
threaded due to the nature of per-cpu work queues and a concurrency
depth of 1.
Post by Stan Hoeppner
What was clear is that as of March dm-crypt was not preserving IO
ordering. If this is still the case, that would imply RMW with md
parity arrays, which would tend to explain Kevin's write throughput
being ~1/3rd of his read, and less than a single disk.
encryption adds IO latency, so unless you have lots of concurrency
to hide the latency, dm-crypt will almost always be slower than the
same IO an equivalent unencrypted disk.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Michael L. Semon
2013-12-22 21:01:14 UTC
Permalink
Post by Stan Hoeppner
For the record, again, I've never used dm-crypt. I'm just trying to
work through the layers to identify the source of Kevin's problems. So
please don't clobber me too hard for going OT, or trying to speak
(somewhat) intelligently about something I'm just learning about...
This stuff isn't easy. There are a lot of security-vs-speed tradeoffs,
and kernel support varies by processor and architecture. Take this
example from x86:

# info from fdisk:
/dev/sda5 8404992 10502143 1048576 83 Linux

# cryptsetup status /dev/mapper/wCrypt
/dev/mapper/wCrypt is active.
type: LUKS1
cipher: aes-xts-essiv:sha256
keysize: 512 bits
device: /dev/sda5
offset: 4096 sectors
size: 2093056 sectors
mode: read/write

What this jibberish means from a speed standpoint is this:

aes: had assembly speedups in the kernel; NIST approved;

xts: almost as fast as cbc while being more secure;

essiv: "plain" is faster, but essiv is more secure;

sha256: it seemed like an ok hash;

512 bits: much slower than 256 or 128 bits, but hopefully more secure, fine
for a small partition like this.

A similar "cryptsetup status <cryptname>" from Kevin might help you
benchmark things, and Kevin might explain his rationale without blowing his
cover ;-) A `zcat /proc/config.gz` | grep CRYPTO` might help you see what
kernel support he has as well.

Dave's recent post about I/O latency with dm-crypt reads/writes might be
the answer that you're seeking. I just wanted to put in my two cents.

Good luck!

Michael
Kevin Richter
2013-12-22 02:35:57 UTC
Permalink
Okay, well, I've run your fio config... but with so many results and
abbrevations I currently feel a bit overchallenged :)
So, please dont take badly, if I just paste the whole results.
I have tried with a stripe cache size of 1024, 2048 and 4096.

Btw, I also have /dev/md1, this is a 15GB unencrypted array using the
same 5 disks like my LUKS-md2 array, so nearly the same just without
this LUKS layer. If helpful, I can execute some fio tests on this
filesystem, too.


So long :)
Kevin


$ echo 1024 > /sys/block/md2/md/stripe_cache_size
Jobs: 1 (f=1): [____________W___] [99.7% done] [0K/99.24M /s] [0 /193 iops] [eta 00m:06s]
read: (groupid=0, jobs=8): err= 0: pid=12987
read : io=81920MB, bw=189835KB/s, iops=370 , runt=441890msec
slat (usec): min=32 , max=4561 , avg=76.28, stdev=28.66
clat (msec): min=5 , max=1115 , avg=334.19, stdev=151.18
lat (msec): min=5 , max=1115 , avg=334.26, stdev=151.18
bw (KB/s) : min= 0, max=261120, per=12.79%, avg=24288.95, stdev=11586.29
cpu : usr=0.05%, sys=0.50%, ctx=157180, majf=0, minf=16982
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=163840/0/0, short=0/0/0
lat (msec): 10=0.52%, 20=0.71%, 50=3.01%, 100=6.17%, 250=14.59%
lat (msec): 500=68.81%, 750=4.87%, 1000=1.08%, 2000=0.25%
write: (groupid=1, jobs=8): err= 0: pid=13202
write: io=81920MB, bw=58504KB/s, iops=114 , runt=1433851msec
slat (usec): min=45 , max=1729 , avg=212.20, stdev=56.68
clat (msec): min=14 , max=11691 , avg=1101.17, stdev=1116.82
lat (msec): min=14 , max=11691 , avg=1101.39, stdev=1116.82
bw (KB/s) : min= 0, max=106666, per=14.35%, avg=8395.94, stdev=6752.35
cpu : usr=0.28%, sys=0.10%, ctx=117451, majf=0, minf=3410
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/163840/0, short=0/0/0
lat (msec): 20=0.01%, 50=0.17%, 100=2.03%, 250=17.02%, 500=22.20%
lat (msec): 750=12.22%, 1000=8.66%, 2000=20.04%, >=2000=17.64%
READ: io=81920MB, aggrb=189834KB/s, minb=194390KB/s, maxb=194390KB/s, mint=441890msec, maxt=441890msec
WRITE: io=81920MB, aggrb=58504KB/s, minb=59908KB/s, maxb=59908KB/s, mint=1433851msec, maxt=1433851msec
dm-0: ios=327681/327756, merge=0/0, ticks=78591352/353235376, in_queue=431834680, util=100.00%, aggrios=327681/327922, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
md2: ios=327681/327922, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=171660/222845, aggrmerge=4561629/9494657, aggrticks=16341417/4008187, aggrin_queue=20351472, aggrutil=85.78%
sdc: ios=181512/222455, merge=4583055/9505634, ticks=15650600/5944080, in_queue=21596560, util=85.78%
sdd: ios=180545/224362, merge=4526197/9587956, ticks=14356708/5542120, in_queue=19900820, util=85.42%
sde: ios=179853/224317, merge=4519718/9540999, ticks=13375156/5676828, in_queue=19053876, util=83.90%
sdf: ios=157605/222569, merge=4551205/9459549, ticks=18828608/1234632, in_queue=20065204, util=75.36%
sdg: ios=158787/220525, merge=4627970/9379150, ticks=19496016/1643276, in_queue=21140904, util=77.26%
$ echo 2048 > /sys/block/md2/md/stripe_cache_size
Jobs: 1 (f=1): [_________W______] [99.6% done] [0K/92182K /s] [0 /175 iops] [eta 00m:06s]
read: (groupid=0, jobs=8): err= 0: pid=6392
read : io=81920MB, bw=185893KB/s, iops=363 , runt=451259msec
slat (usec): min=32 , max=524 , avg=75.08, stdev=26.00
clat (msec): min=8 , max=1849 , avg=335.65, stdev=149.76
lat (msec): min=8 , max=1849 , avg=335.72, stdev=149.76
bw (KB/s) : min= 0, max=105860, per=13.08%, avg=24308.99, stdev=7467.64
cpu : usr=0.05%, sys=0.49%, ctx=157968, majf=0, minf=17171
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=163840/0/0, short=0/0/0
lat (msec): 10=0.01%, 20=0.06%, 50=2.32%, 100=7.44%, 250=15.36%
lat (msec): 500=69.09%, 750=4.50%, 1000=0.87%, 2000=0.35%
write: (groupid=1, jobs=8): err= 0: pid=6663
write: io=81920MB, bw=75377KB/s, iops=147 , runt=1112887msec
slat (usec): min=46 , max=6453.8K, avg=988.13, stdev=46664.22
clat (msec): min=10 , max=8631 , avg=854.18, stdev=683.37
lat (msec): min=10 , max=8631 , avg=855.17, stdev=684.82
bw (KB/s) : min= 0, max=100352, per=14.04%, avg=10581.97, stdev=6871.80
cpu : usr=0.34%, sys=0.11%, ctx=92502, majf=0, minf=1531
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/163840/0, short=0/0/0
lat (msec): 20=0.01%, 50=0.09%, 100=1.14%, 250=11.19%, 500=24.64%
lat (msec): 750=18.78%, 1000=13.40%, 2000=24.53%, >=2000=6.23%
READ: io=81920MB, aggrb=185893KB/s, minb=190354KB/s, maxb=190354KB/s, mint=451259msec, maxt=451259msec
WRITE: io=81920MB, aggrb=75376KB/s, minb=77186KB/s, maxb=77186KB/s, mint=1112887msec, maxt=1112887msec
dm-0: ios=327701/328169, merge=0/0, ticks=79402348/261890568, in_queue=341301888, util=100.00%, aggrios=327701/328481, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
md2: ios=327701/328481, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=127112/146994, aggrmerge=3364061/7997232, aggrticks=18091347/5817941, aggrin_queue=23914356, aggrutil=94.20%
sdc: ios=135729/149357, merge=3341205/8134447, ticks=18312932/8398716, in_queue=26721324, util=92.77%
sdd: ios=135561/151584, merge=3312121/8238249, ticks=16877204/8190968, in_queue=25077332, util=92.44%
sde: ios=135741/146023, merge=3345948/7981968, ticks=17659792/8659780, in_queue=26322124, util=94.20%
sdf: ios=114396/143768, merge=3413295/7801050, ticks=18551976/1652924, in_queue=20207384, util=72.89%
sdg: ios=114134/144241, merge=3407738/7830447, ticks=19054832/2187320, in_queue=21243620, util=74.96%
$ echo 4096 > /sys/block/md2/md/stripe_cache_size
Jobs: 1 (f=1): [________W_______] [100.0% done] [0K/95848K /s] [0 /182 iops] [eta 00m:00s]
read: (groupid=0, jobs=8): err= 0: pid=11787
read : io=81920MB, bw=189274KB/s, iops=369 , runt=443200msec
slat (usec): min=31 , max=4511 , avg=75.47, stdev=29.74
clat (msec): min=5 , max=1338 , avg=336.39, stdev=155.14
lat (msec): min=5 , max=1338 , avg=336.47, stdev=155.14
bw (KB/s) : min= 0, max=253455, per=12.77%, avg=24162.01, stdev=11368.71
cpu : usr=0.05%, sys=0.49%, ctx=157193, majf=0, minf=17313
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=163840/0/0, short=0/0/0
lat (msec): 10=0.47%, 20=0.66%, 50=2.95%, 100=6.33%, 250=14.56%
lat (msec): 500=68.12%, 750=5.42%, 1000=1.15%, 2000=0.33%
write: (groupid=1, jobs=8): err= 0: pid=12060
write: io=81920MB, bw=64993KB/s, iops=126 , runt=1290687msec
slat (usec): min=61 , max=16991 , avg=197.22, stdev=110.87
clat (msec): min=14 , max=2820 , avg=980.92, stdev=366.56
lat (msec): min=14 , max=2821 , avg=981.12, stdev=366.56
bw (KB/s) : min= 0, max=103770, per=13.11%, avg=8517.92, stdev=3794.28
cpu : usr=0.28%, sys=0.08%, ctx=84352, majf=0, minf=723
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/163840/0, short=0/0/0
lat (msec): 20=0.01%, 50=0.03%, 100=0.28%, 250=0.22%, 500=5.37%
lat (msec): 750=22.02%, 1000=31.66%, 2000=39.27%, >=2000=1.16%
READ: io=81920MB, aggrb=189273KB/s, minb=193816KB/s, maxb=193816KB/s, mint=443200msec, maxt=443200msec
WRITE: io=81920MB, aggrb=64993KB/s, minb=66553KB/s, maxb=66553KB/s, mint=1290687msec, maxt=1290687msec
dm-0: ios=327681/327629, merge=0/0, ticks=78990724/301988444, in_queue=380991692, util=100.00%, aggrios=327681/327709, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
md2: ios=327681/327709, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=120158/119573, aggrmerge=2858405/7663126, aggrticks=17291831/9119008, aggrin_queue=26414023, aggrutil=99.60%
sdc: ios=135579/119976, merge=2813832/7324879, ticks=13974928/2192484, in_queue=16167996, util=66.57%
sdd: ios=136115/127048, merge=2826584/7736191, ticks=12932248/2477796, in_queue=15410924, util=68.08%
sde: ios=136007/130908, merge=2844473/7936354, ticks=12642232/3141268, in_queue=15784336, util=71.86%
sdf: ios=78473/94458, merge=2882361/7865984, ticks=29053772/37421808, in_queue=66488856, util=99.60%
sdg: ios=114620/125479, merge=2924777/7452224, ticks=17855976/361684, in_queue=18218004, util=54.84%
Post by Kevin Richter
Thanks a lot for your huge replies!
You're welcome.
Post by Kevin Richter
Post by Dave Chinner
Oh, that's quite old. I'd suggest upgrading to a much more recent
kernel as we've fixed lots of issues in this area since then.
$ uname -a
Linux 3.11.0-14-generic #21~precise1-Ubuntu SMP
Thus, it seems that the default scheduler has been changed to deadline.
I did not change anything. After a reboot the schedulers of all disks
are now deadline.
Good move Ubuntu.
Post by Kevin Richter
Post by Dave Chinner
Model # of the CPUs so I can look up the specs?
Strong CPUs.
Post by Kevin Richter
I have prepared a folder with 60GB for the tests. This is nearly twice
of the available memory, so the process should be forced to actually
write the stuff to the disk - and not only hold in the memory.
Post by Dave Chinner
$ echo 256 > /sys/block/md2/md/stripe_cache_size
$ time cp -a /olddisk/testfolder /6tb/foo1/
real 25m38.925s
user 0m0.595s
sys 1m23.182s
$ echo 1024 > /sys/block/md2/md/stripe_cache_size
$ time cp -a /olddisk/testfolder /raid/foo2/
real 7m32.824s
user 0m0.438s
sys 1m6.759s
$ echo 2048 > /sys/block/md2/md/stripe_cache_size
$ time cp -a /olddisk/testfolder /raid/foo3/
real 5m32.847s
user 0m0.418s
sys 1m5.671s
$ echo 4096 > /sys/block/md2/md/stripe_cache_size
$ time cp -a /olddisk/testfolder /raid/foo4/
real 5m54.554s
user 0m0.437s
sys 1m6.268s
The difference is really amazing! So 2048 seems to be the best choice.
60GB in 5,5minutes are 180MB/sek. That sounds a bit high, doesnt it?
The RAID only consist of 5 SATA disks with 7200rpm.
A lot of the source data is being cached between runs so these numbers
aren't accurate. The throughput of this copy operation will be limited
by the speed of the single source disk, not the array. To make the
elapsed times of this copy test accurate you need to execute something
# sync
# echo 3 > /proc/sys/vm/drop_caches
But this copy test will not inform you about the potential peak
performance of your array. That's why I suggested you test with FIO,
the flexible IO tester.
# aptitude install fio
# man fio
[global]
directory=/your/XFS/test/directory
zero_buffers
numjobs=8
group_reporting
blocksize=512k
ioengine=libaio
iodepth=16
direct=1
size=10g
[read]
rw=read
stonewall
[write]
rw=write
stonewall
This should give you a relatively accurate picture of the actual
potential throughput of your array and filesystem.
Post by Kevin Richter
Post by Dave Chinner
top - 10:48:24 up 1 day, 1:41, 2 users, load average: 5.66, 3.53, 2.17
Tasks: 210 total, 2 running, 208 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 35.8%sy, 0.0%ni, 46.0%id, 17.9%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 32913992k total, 32709208k used, 204784k free, 10770344k buffers
Swap: 7812496k total, 0k used, 7812496k free, 20866844k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19524 root 20 0 0 0 0 R 93 0.0 4:00.12 kworker/3:1
23744 root 20 0 0 0 0 S 55 0.0 0:50.84 kworker/0:1
23738 root 20 0 0 0 0 S 29 0.0 0:56.94 kworker/4:0
3893 root 20 0 0 0 0 S 28 0.0 36:47.50 md2_raid6
4551 root 20 0 22060 3328 720 D 25 0.0 20:21.61 mount.ntfs
23273 root 20 0 0 0 0 S 22 0.0 1:54.86 kworker/7:2
23734 root 20 0 21752 1280 1040 D 21 0.0 0:49.84 cp
84 root 20 0 0 0 0 S 7 0.0 8:19.34 kswapd1
83 root 20 0 0 0 0 S 6 0.0 11:55.81 kswapd0
23745 root 20 0 0 0 0 S 2 0.0 0:33.60 kworker/1:2
21598 root 20 0 0 0 0 D 1 0.0 0:11.33 kworker/u17:1
Hmm, what's kworker/3:1? That's not a crypto thread eating 93% of a
SandyBridge core at only ~180 MB/s throughput is it?
Post by Kevin Richter
While all of these tests there are no warnings/panics in the syslog.
With best regards,
Kevin
Even though XFS wasn't the cause of the problem I'm glad we were able to
help you fix it nonetheless. I'm really curious to see what kind of
throughput you can achieve with FIO, and whether crypto is a bottleneck
at the 250-350 MB/s your array should be capable of. It would be great
if you would play around a bit with FIO and post some numbers.
Stan Hoeppner
2013-12-22 08:12:37 UTC
Permalink
Post by Kevin Richter
Okay, well, I've run your fio config... but with so many results and
I'm glad you changed to size=1G instead of size=10G lest the test would
have read/written 80GB instead of 8GB, and taken forever. I often
forget the size parameter is per job (thread) not global.
Post by Kevin Richter
abbrevations I currently feel a bit overchallenged :)
So, please dont take badly, if I just paste the whole results.
I have tried with a stripe cache size of 1024, 2048 and 4096.
The full output is good. Going from 1024 to 2048 gained you ~17MB/s
throughput with this job file. But you didn't run with the default 256
for comparison, which is very useful to know. That's ok, as you can do
so with the next set of test runs. Given the huge latencies I'm
thinking this config may be seeking the disks too much. Modify your fio
job file as follows:

numjobs=4
blocksize=1536k
iodepth=4
size=1g

as this may yield a little better results. ~190MB/s read and 75MB/s
write is very low for three effective 7.2K RPM SATA spindles, especially
the write throughput. Compare to a similar FIO job with 2 threads,
iodepth 16, directio, run against a single 500GB 7.2K WD SATA drive, NQC
disabled, w/deadline elevator:

Run status group 0 (all jobs):
READ: io=2048MB, aggrb=88663KB/s, minb=90791KB/s, maxb=90791KB/s
Run status group 1 (all jobs):
WRITE: io=2048MB, aggrb=108089KB/s, minb=110683KB/s, maxb=110683KB/s

88MB/s read, 108MB/s write. Write is faster apparently due to the small
16MB drive write cache. With your eight 2.4GHz SandyBridge cores and
md-RAID6 on much newer larger 7.2K drives w/larger caches, you should
have significantly greater write speed than this old single SATA drive.
Currently you're showing 30MB/s slower write throughput, some 40%
slower than this single drive. With aligned writes, thus no RMW, you
should be getting at least 2 spindles worth of throughput, or ~200MB/s.

So, run FIO again with the mods above, once with stripe_cache_size 256
and once with 2048. Something seems to be seriously dragging down your
throughput. I'm sure you'd like to know what, and how to fix it.
Post by Kevin Richter
Btw, I also have /dev/md1, this is a 15GB unencrypted array using the
same 5 disks like my LUKS-md2 array, so nearly the same just without
this LUKS layer. If helpful, I can execute some fio tests on this
filesystem, too.
This is also a RAID6 array, without dm-crypt? If so this will afford a
great comparison, and may be very informative. Run FIO against this
filesystem (I assume it's also XFS) as well with stripe_cache_size 512
and 2048. Four FIO tests total including the two on the big array.

It'll be interesting to see where this leads. At this point the
candidates I see are a single dm-crypt thread peaking a core, dm-crypt
taking aligned writes and misaligning them causing all writes to be
RMWs, or AF drives with misaligned partitions causing every write to be
an internal RMW. Or maybe something I've not considered.
Post by Kevin Richter
So long :)
Kevin
$ echo 1024 > /sys/block/md2/md/stripe_cache_size
Jobs: 1 (f=1): [____________W___] [99.7% done] [0K/99.24M /s] [0 /193 iops] [eta 00m:06s]
read: (groupid=0, jobs=8): err= 0: pid=12987
read : io=81920MB, bw=189835KB/s, iops=370 , runt=441890msec
slat (usec): min=32 , max=4561 , avg=76.28, stdev=28.66
clat (msec): min=5 , max=1115 , avg=334.19, stdev=151.18
lat (msec): min=5 , max=1115 , avg=334.26, stdev=151.18
bw (KB/s) : min= 0, max=261120, per=12.79%, avg=24288.95, stdev=11586.29
cpu : usr=0.05%, sys=0.50%, ctx=157180, majf=0, minf=16982
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=163840/0/0, short=0/0/0
lat (msec): 10=0.52%, 20=0.71%, 50=3.01%, 100=6.17%, 250=14.59%
lat (msec): 500=68.81%, 750=4.87%, 1000=1.08%, 2000=0.25%
write: (groupid=1, jobs=8): err= 0: pid=13202
write: io=81920MB, bw=58504KB/s, iops=114 , runt=1433851msec
slat (usec): min=45 , max=1729 , avg=212.20, stdev=56.68
clat (msec): min=14 , max=11691 , avg=1101.17, stdev=1116.82
lat (msec): min=14 , max=11691 , avg=1101.39, stdev=1116.82
bw (KB/s) : min= 0, max=106666, per=14.35%, avg=8395.94, stdev=6752.35
cpu : usr=0.28%, sys=0.10%, ctx=117451, majf=0, minf=3410
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/163840/0, short=0/0/0
lat (msec): 20=0.01%, 50=0.17%, 100=2.03%, 250=17.02%, 500=22.20%
lat (msec): 750=12.22%, 1000=8.66%, 2000=20.04%, >=2000=17.64%
READ: io=81920MB, aggrb=189834KB/s, minb=194390KB/s, maxb=194390KB/s, mint=441890msec, maxt=441890msec
WRITE: io=81920MB, aggrb=58504KB/s, minb=59908KB/s, maxb=59908KB/s, mint=1433851msec, maxt=1433851msec
dm-0: ios=327681/327756, merge=0/0, ticks=78591352/353235376, in_queue=431834680, util=100.00%, aggrios=327681/327922, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
md2: ios=327681/327922, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=171660/222845, aggrmerge=4561629/9494657, aggrticks=16341417/4008187, aggrin_queue=20351472, aggrutil=85.78%
sdc: ios=181512/222455, merge=4583055/9505634, ticks=15650600/5944080, in_queue=21596560, util=85.78%
sdd: ios=180545/224362, merge=4526197/9587956, ticks=14356708/5542120, in_queue=19900820, util=85.42%
sde: ios=179853/224317, merge=4519718/9540999, ticks=13375156/5676828, in_queue=19053876, util=83.90%
sdf: ios=157605/222569, merge=4551205/9459549, ticks=18828608/1234632, in_queue=20065204, util=75.36%
sdg: ios=158787/220525, merge=4627970/9379150, ticks=19496016/1643276, in_queue=21140904, util=77.26%
$ echo 2048 > /sys/block/md2/md/stripe_cache_size
Jobs: 1 (f=1): [_________W______] [99.6% done] [0K/92182K /s] [0 /175 iops] [eta 00m:06s]
read: (groupid=0, jobs=8): err= 0: pid=6392
read : io=81920MB, bw=185893KB/s, iops=363 , runt=451259msec
slat (usec): min=32 , max=524 , avg=75.08, stdev=26.00
clat (msec): min=8 , max=1849 , avg=335.65, stdev=149.76
lat (msec): min=8 , max=1849 , avg=335.72, stdev=149.76
bw (KB/s) : min= 0, max=105860, per=13.08%, avg=24308.99, stdev=7467.64
cpu : usr=0.05%, sys=0.49%, ctx=157968, majf=0, minf=17171
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=163840/0/0, short=0/0/0
lat (msec): 10=0.01%, 20=0.06%, 50=2.32%, 100=7.44%, 250=15.36%
lat (msec): 500=69.09%, 750=4.50%, 1000=0.87%, 2000=0.35%
write: (groupid=1, jobs=8): err= 0: pid=6663
write: io=81920MB, bw=75377KB/s, iops=147 , runt=1112887msec
slat (usec): min=46 , max=6453.8K, avg=988.13, stdev=46664.22
clat (msec): min=10 , max=8631 , avg=854.18, stdev=683.37
lat (msec): min=10 , max=8631 , avg=855.17, stdev=684.82
bw (KB/s) : min= 0, max=100352, per=14.04%, avg=10581.97, stdev=6871.80
cpu : usr=0.34%, sys=0.11%, ctx=92502, majf=0, minf=1531
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/163840/0, short=0/0/0
lat (msec): 20=0.01%, 50=0.09%, 100=1.14%, 250=11.19%, 500=24.64%
lat (msec): 750=18.78%, 1000=13.40%, 2000=24.53%, >=2000=6.23%
READ: io=81920MB, aggrb=185893KB/s, minb=190354KB/s, maxb=190354KB/s, mint=451259msec, maxt=451259msec
WRITE: io=81920MB, aggrb=75376KB/s, minb=77186KB/s, maxb=77186KB/s, mint=1112887msec, maxt=1112887msec
dm-0: ios=327701/328169, merge=0/0, ticks=79402348/261890568, in_queue=341301888, util=100.00%, aggrios=327701/328481, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
md2: ios=327701/328481, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=127112/146994, aggrmerge=3364061/7997232, aggrticks=18091347/5817941, aggrin_queue=23914356, aggrutil=94.20%
sdc: ios=135729/149357, merge=3341205/8134447, ticks=18312932/8398716, in_queue=26721324, util=92.77%
sdd: ios=135561/151584, merge=3312121/8238249, ticks=16877204/8190968, in_queue=25077332, util=92.44%
sde: ios=135741/146023, merge=3345948/7981968, ticks=17659792/8659780, in_queue=26322124, util=94.20%
sdf: ios=114396/143768, merge=3413295/7801050, ticks=18551976/1652924, in_queue=20207384, util=72.89%
sdg: ios=114134/144241, merge=3407738/7830447, ticks=19054832/2187320, in_queue=21243620, util=74.96%
$ echo 4096 > /sys/block/md2/md/stripe_cache_size
Jobs: 1 (f=1): [________W_______] [100.0% done] [0K/95848K /s] [0 /182 iops] [eta 00m:00s]
read: (groupid=0, jobs=8): err= 0: pid=11787
read : io=81920MB, bw=189274KB/s, iops=369 , runt=443200msec
slat (usec): min=31 , max=4511 , avg=75.47, stdev=29.74
clat (msec): min=5 , max=1338 , avg=336.39, stdev=155.14
lat (msec): min=5 , max=1338 , avg=336.47, stdev=155.14
bw (KB/s) : min= 0, max=253455, per=12.77%, avg=24162.01, stdev=11368.71
cpu : usr=0.05%, sys=0.49%, ctx=157193, majf=0, minf=17313
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=163840/0/0, short=0/0/0
lat (msec): 10=0.47%, 20=0.66%, 50=2.95%, 100=6.33%, 250=14.56%
lat (msec): 500=68.12%, 750=5.42%, 1000=1.15%, 2000=0.33%
write: (groupid=1, jobs=8): err= 0: pid=12060
write: io=81920MB, bw=64993KB/s, iops=126 , runt=1290687msec
slat (usec): min=61 , max=16991 , avg=197.22, stdev=110.87
clat (msec): min=14 , max=2820 , avg=980.92, stdev=366.56
lat (msec): min=14 , max=2821 , avg=981.12, stdev=366.56
bw (KB/s) : min= 0, max=103770, per=13.11%, avg=8517.92, stdev=3794.28
cpu : usr=0.28%, sys=0.08%, ctx=84352, majf=0, minf=723
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/163840/0, short=0/0/0
lat (msec): 20=0.01%, 50=0.03%, 100=0.28%, 250=0.22%, 500=5.37%
lat (msec): 750=22.02%, 1000=31.66%, 2000=39.27%, >=2000=1.16%
READ: io=81920MB, aggrb=189273KB/s, minb=193816KB/s, maxb=193816KB/s, mint=443200msec, maxt=443200msec
WRITE: io=81920MB, aggrb=64993KB/s, minb=66553KB/s, maxb=66553KB/s, mint=1290687msec, maxt=1290687msec
dm-0: ios=327681/327629, merge=0/0, ticks=78990724/301988444, in_queue=380991692, util=100.00%, aggrios=327681/327709, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
md2: ios=327681/327709, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=120158/119573, aggrmerge=2858405/7663126, aggrticks=17291831/9119008, aggrin_queue=26414023, aggrutil=99.60%
sdc: ios=135579/119976, merge=2813832/7324879, ticks=13974928/2192484, in_queue=16167996, util=66.57%
sdd: ios=136115/127048, merge=2826584/7736191, ticks=12932248/2477796, in_queue=15410924, util=68.08%
sde: ios=136007/130908, merge=2844473/7936354, ticks=12642232/3141268, in_queue=15784336, util=71.86%
sdf: ios=78473/94458, merge=2882361/7865984, ticks=29053772/37421808, in_queue=66488856, util=99.60%
sdg: ios=114620/125479, merge=2924777/7452224, ticks=17855976/361684, in_queue=18218004, util=54.84%
Post by Kevin Richter
Thanks a lot for your huge replies!
You're welcome.
Post by Kevin Richter
Post by Dave Chinner
Oh, that's quite old. I'd suggest upgrading to a much more recent
kernel as we've fixed lots of issues in this area since then.
$ uname -a
Linux 3.11.0-14-generic #21~precise1-Ubuntu SMP
Thus, it seems that the default scheduler has been changed to deadline.
I did not change anything. After a reboot the schedulers of all disks
are now deadline.
Good move Ubuntu.
Post by Kevin Richter
Post by Dave Chinner
Model # of the CPUs so I can look up the specs?
Strong CPUs.
Post by Kevin Richter
I have prepared a folder with 60GB for the tests. This is nearly twice
of the available memory, so the process should be forced to actually
write the stuff to the disk - and not only hold in the memory.
Post by Dave Chinner
$ echo 256 > /sys/block/md2/md/stripe_cache_size
$ time cp -a /olddisk/testfolder /6tb/foo1/
real 25m38.925s
user 0m0.595s
sys 1m23.182s
$ echo 1024 > /sys/block/md2/md/stripe_cache_size
$ time cp -a /olddisk/testfolder /raid/foo2/
real 7m32.824s
user 0m0.438s
sys 1m6.759s
$ echo 2048 > /sys/block/md2/md/stripe_cache_size
$ time cp -a /olddisk/testfolder /raid/foo3/
real 5m32.847s
user 0m0.418s
sys 1m5.671s
$ echo 4096 > /sys/block/md2/md/stripe_cache_size
$ time cp -a /olddisk/testfolder /raid/foo4/
real 5m54.554s
user 0m0.437s
sys 1m6.268s
The difference is really amazing! So 2048 seems to be the best choice.
60GB in 5,5minutes are 180MB/sek. That sounds a bit high, doesnt it?
The RAID only consist of 5 SATA disks with 7200rpm.
A lot of the source data is being cached between runs so these numbers
aren't accurate. The throughput of this copy operation will be limited
by the speed of the single source disk, not the array. To make the
elapsed times of this copy test accurate you need to execute something
# sync
# echo 3 > /proc/sys/vm/drop_caches
But this copy test will not inform you about the potential peak
performance of your array. That's why I suggested you test with FIO,
the flexible IO tester.
# aptitude install fio
# man fio
[global]
directory=/your/XFS/test/directory
zero_buffers
numjobs=8
group_reporting
blocksize=512k
ioengine=libaio
iodepth=16
direct=1
size=10g
[read]
rw=read
stonewall
[write]
rw=write
stonewall
This should give you a relatively accurate picture of the actual
potential throughput of your array and filesystem.
Post by Kevin Richter
Post by Dave Chinner
top - 10:48:24 up 1 day, 1:41, 2 users, load average: 5.66, 3.53, 2.17
Tasks: 210 total, 2 running, 208 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 35.8%sy, 0.0%ni, 46.0%id, 17.9%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 32913992k total, 32709208k used, 204784k free, 10770344k buffers
Swap: 7812496k total, 0k used, 7812496k free, 20866844k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19524 root 20 0 0 0 0 R 93 0.0 4:00.12 kworker/3:1
23744 root 20 0 0 0 0 S 55 0.0 0:50.84 kworker/0:1
23738 root 20 0 0 0 0 S 29 0.0 0:56.94 kworker/4:0
3893 root 20 0 0 0 0 S 28 0.0 36:47.50 md2_raid6
4551 root 20 0 22060 3328 720 D 25 0.0 20:21.61 mount.ntfs
23273 root 20 0 0 0 0 S 22 0.0 1:54.86 kworker/7:2
23734 root 20 0 21752 1280 1040 D 21 0.0 0:49.84 cp
84 root 20 0 0 0 0 S 7 0.0 8:19.34 kswapd1
83 root 20 0 0 0 0 S 6 0.0 11:55.81 kswapd0
23745 root 20 0 0 0 0 S 2 0.0 0:33.60 kworker/1:2
21598 root 20 0 0 0 0 D 1 0.0 0:11.33 kworker/u17:1
Hmm, what's kworker/3:1? That's not a crypto thread eating 93% of a
SandyBridge core at only ~180 MB/s throughput is it?
Post by Kevin Richter
While all of these tests there are no warnings/panics in the syslog.
With best regards,
Kevin
Even though XFS wasn't the cause of the problem I'm glad we were able to
help you fix it nonetheless. I'm really curious to see what kind of
throughput you can achieve with FIO, and whether crypto is a bottleneck
at the 250-350 MB/s your array should be capable of. It would be great
if you would play around a bit with FIO and post some numbers.
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
Kevin Richter
2013-12-22 14:10:53 UTC
Permalink
Post by Stan Hoeppner
I'm glad you changed to size=1G instead of size=10G
No no, I didnt change that. It has been written 160GB on these disks. 80
for the read part, 80 for the write part. Each test took round 30 minutes.
Post by Stan Hoeppner
This is also a RAID6 array, without dm-crypt?
Yeah, right. The same 5 disks in RAID6 mode, just in each case another
partition on these disks. And yes, of course, XFS.

So, with your new config I have made 7 fio runs:
1) 15GB RAID6 array, stripe 256
2) 15GB RAID6 array, stripe 1024
3) 15GB RAID6 array, stripe 2048
4) 15GB RAID6 array, stripe 4096
5) 3TB LUKS volume (no raid, just one single disk)
6) 6TB RAID6+LUKS array, stripe 256
7) 6TB RAID6+LUKS array, stripe 2048

In the 15GB array a stripe size of 4096 seems to get slightly more
throughput than 2048.

See below for the results :-)

Perhaps, I should give you another information about the server
hardware, which might (but hopefully do not) explain the low performance:
The server is that one:
http://www.supermicro.com/products/system/2u/6027/ssg-6027r-e1r12l.cfm
with a 12-port SAS expander, which is that one:
http://www.thomas-krenn.com/de/wiki/BPN-SAS-826EL1_SAS_Expander_Backplane
But the backplane should have a thoughput of 3Gb/s

However... I am very curious about your comments :-)

Bye,
Kevin


#################################
Post by Stan Hoeppner
read: (groupid=0, jobs=4): err= 0: pid=3959
read : io=4092.0MB, bw=217900KB/s, iops=141 , runt= 19230msec
slat (usec): min=190 , max=1325 , avg=231.32, stdev=77.13
clat (msec): min=7 , max=326 , avg=111.45, stdev=67.48
lat (msec): min=7 , max=326 , avg=111.69, stdev=67.48
bw (KB/s) : min=29425, max=109280, per=24.98%, avg=54423.27, stdev=15887.11
cpu : usr=0.03%, sys=0.90%, ctx=2614, majf=0, minf=6233
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=2728/0/0, short=0/0/0
lat (msec): 10=0.66%, 20=4.84%, 50=16.35%, 100=25.70%, 250=47.91%
lat (msec): 500=4.55%
write: (groupid=1, jobs=4): err= 0: pid=3964
write: io=4092.0MB, bw=54044KB/s, iops=35 , runt= 77533msec
slat (msec): min=9 , max=339 , avg=113.48, stdev=36.86
clat (msec): min=2 , max=744 , avg=339.51, stdev=70.02
lat (msec): min=23 , max=927 , avg=453.00, stdev=82.81
bw (KB/s) : min= 5475, max=20756, per=24.92%, avg=13467.27, stdev=2107.66
cpu : usr=0.58%, sys=6.26%, ctx=343060, majf=0, minf=550
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/2728/0, short=0/0/0
lat (msec): 4=0.04%, 10=0.04%, 50=0.11%, 100=0.26%, 250=5.94%
lat (msec): 500=91.50%, 750=2.13%
READ: io=4092.0MB, aggrb=217899KB/s, minb=223129KB/s, maxb=223129KB/s, mint=19230msec, maxt=19230msec
WRITE: io=4092.0MB, aggrb=54044KB/s, minb=55341KB/s, maxb=55341KB/s, mint=77533msec, maxt=77533msec
md1: ios=10912/10904, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=22836/29872, aggrmerge=364645/574313, aggrticks=216408/101340, aggrin_queue=317789, aggrutil=69.36%
sdc: ios=23542/29233, merge=377270/563331, ticks=217520/152300, in_queue=369860, util=67.99%
sdd: ios=23462/28930, merge=378915/557686, ticks=214104/142224, in_queue=356428, util=67.81%
sde: ios=22264/29920, merge=355831/576114, ticks=149184/99724, in_queue=248924, util=58.22%
sdf: ios=22240/30724, merge=352471/588412, ticks=126740/52916, in_queue=179696, util=55.44%
sdg: ios=22676/30557, merge=358741/586025, ticks=374496/59540, in_queue=434040, util=69.36%
#################################
Post by Stan Hoeppner
read: (groupid=0, jobs=4): err= 0: pid=4334
read : io=4092.0MB, bw=237702KB/s, iops=154 , runt= 17628msec
slat (usec): min=196 , max=1469 , avg=232.84, stdev=88.56
clat (msec): min=6 , max=391 , avg=101.62, stdev=56.61
lat (msec): min=7 , max=391 , avg=101.85, stdev=56.61
bw (KB/s) : min=26482, max=107305, per=24.94%, avg=59289.94, stdev=14153.87
cpu : usr=0.04%, sys=0.98%, ctx=2581, majf=0, minf=6241
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=2728/0/0, short=0/0/0
lat (msec): 10=0.22%, 20=3.08%, 50=16.83%, 100=31.01%, 250=47.10%
lat (msec): 500=1.76%
write: (groupid=1, jobs=4): err= 0: pid=4340
write: io=4092.0MB, bw=120759KB/s, iops=78 , runt= 34699msec
slat (msec): min=1 , max=156 , avg=47.70, stdev=27.94
clat (msec): min=15 , max=520 , avg=152.37, stdev=53.34
lat (msec): min=28 , max=564 , avg=200.07, stdev=57.63
bw (KB/s) : min= 0, max=62270, per=25.04%, avg=30239.54, stdev=5435.51
cpu : usr=1.34%, sys=5.68%, ctx=134184, majf=0, minf=94
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/2728/0, short=0/0/0
lat (msec): 20=0.04%, 50=1.10%, 100=12.02%, 250=82.18%, 500=4.62%
lat (msec): 750=0.04%
READ: io=4092.0MB, aggrb=237701KB/s, minb=243406KB/s, maxb=243406KB/s, mint=17628msec, maxt=17628msec
WRITE: io=4092.0MB, aggrb=120758KB/s, minb=123656KB/s, maxb=123656KB/s, mint=34699msec, maxt=34699msec
md1: ios=10912/10901, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=4062/6768, aggrmerge=109645/369712, aggrticks=157952/111924, aggrin_queue=270487, aggrutil=89.91%
sdc: ios=4070/6893, merge=109432/378682, ticks=173784/179392, in_queue=353780, util=89.88%
sdd: ios=4041/6961, merge=110017/380798, ticks=164960/160236, in_queue=325820, util=89.91%
sde: ios=4039/6727, merge=110658/366163, ticks=169804/166052, in_queue=336484, util=88.98%
sdf: ios=4080/6591, merge=108922/359343, ticks=39460/24636, in_queue=64820, util=52.33%
sdg: ios=4081/6668, merge=109197/363574, ticks=241752/29308, in_queue=271532, util=73.63%
#################################
Post by Stan Hoeppner
read: (groupid=0, jobs=4): err= 0: pid=4102
read : io=4092.0MB, bw=207827KB/s, iops=135 , runt= 20162msec
slat (usec): min=122 , max=3904 , avg=230.75, stdev=99.04
clat (msec): min=5 , max=363 , avg=117.03, stdev=70.26
lat (msec): min=5 , max=363 , avg=117.27, stdev=70.25
bw (KB/s) : min=26533, max=104239, per=25.00%, avg=51947.24, stdev=13121.27
cpu : usr=0.05%, sys=0.83%, ctx=2640, majf=0, minf=6241
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=2728/0/0, short=0/0/0
lat (msec): 10=0.55%, 20=4.99%, 50=14.96%, 100=21.33%, 250=53.01%
lat (msec): 500=5.17%
write: (groupid=1, jobs=4): err= 0: pid=4108
write: io=4092.0MB, bw=145833KB/s, iops=94 , runt= 28733msec
slat (usec): min=768 , max=94184 , avg=1838.20, stdev=4140.24
clat (msec): min=15 , max=1155 , avg=145.02, stdev=128.70
lat (msec): min=16 , max=1156 , avg=146.86, stdev=128.77
bw (KB/s) : min= 6144, max=104448, per=29.22%, avg=42617.70, stdev=17118.84
cpu : usr=1.63%, sys=2.52%, ctx=4086, majf=0, minf=94
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/2728/0, short=0/0/0
lat (msec): 20=0.37%, 50=9.02%, 100=38.42%, 250=39.22%, 500=10.12%
lat (msec): 750=2.09%, 1000=0.59%, 2000=0.18%
READ: io=4092.0MB, aggrb=207827KB/s, minb=212814KB/s, maxb=212814KB/s, mint=20162msec, maxt=20162msec
WRITE: io=4092.0MB, aggrb=145832KB/s, minb=149332KB/s, maxb=149332KB/s, mint=28733msec, maxt=28733msec
md1: ios=10912/10873, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=3018/4990, aggrmerge=79606/371231, aggrticks=152741/83319, aggrin_queue=236284, aggrutil=92.45%
sdc: ios=2988/5046, merge=79033/376737, ticks=129600/103552, in_queue=233412, util=84.14%
sdd: ios=3010/5140, merge=77101/390954, ticks=119032/103236, in_queue=222536, util=85.87%
sde: ios=3039/5009, merge=79935/374563, ticks=196200/159792, in_queue=356236, util=92.45%
sdf: ios=3021/4883, merge=80483/356643, ticks=29976/24956, in_queue=55100, util=49.53%
sdg: ios=3033/4872, merge=81481/357261, ticks=288900/25060, in_queue=314136, util=76.17%
#################################
Post by Stan Hoeppner
read: (groupid=0, jobs=4): err= 0: pid=4411
read : io=4092.0MB, bw=214938KB/s, iops=139 , runt= 19495msec
slat (usec): min=192 , max=1468 , avg=226.97, stdev=87.18
clat (msec): min=4 , max=339 , avg=112.51, stdev=63.05
lat (msec): min=4 , max=339 , avg=112.73, stdev=63.05
bw (KB/s) : min= 0, max=96000, per=24.87%, avg=53457.49, stdev=13728.88
cpu : usr=0.02%, sys=0.89%, ctx=2591, majf=0, minf=6241
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=2728/0/0, short=0/0/0
lat (msec): 10=0.15%, 20=4.62%, 50=13.56%, 100=26.43%, 250=52.20%
lat (msec): 500=3.04%
write: (groupid=1, jobs=4): err= 0: pid=4419
write: io=4092.0MB, bw=146526KB/s, iops=95 , runt= 28597msec
slat (usec): min=822 , max=36057 , avg=1585.60, stdev=1174.08
clat (msec): min=14 , max=1132 , avg=142.49, stdev=126.05
lat (msec): min=15 , max=1134 , avg=144.08, stdev=126.08
bw (KB/s) : min= 0, max=83836, per=28.88%, avg=42315.69, stdev=18036.84
cpu : usr=1.42%, sys=2.85%, ctx=2937, majf=0, minf=94
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/2728/0, short=0/0/0
lat (msec): 20=0.18%, 50=10.08%, 100=37.61%, 250=40.62%, 500=8.50%
lat (msec): 750=2.46%, 1000=0.44%, 2000=0.11%
READ: io=4092.0MB, aggrb=214937KB/s, minb=220096KB/s, maxb=220096KB/s, mint=19495msec, maxt=19495msec
WRITE: io=4092.0MB, aggrb=146526KB/s, minb=150042KB/s, maxb=150042KB/s, mint=28597msec, maxt=28597msec
md1: ios=10912/10895, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=3013/5053, aggrmerge=78083/371841, aggrticks=151707/88554, aggrin_queue=240554, aggrutil=92.72%
sdc: ios=2981/5124, merge=77153/379556, ticks=142800/116248, in_queue=259356, util=87.13%
sdd: ios=2990/5182, merge=75443/390196, ticks=128184/107752, in_queue=236264, util=86.87%
sde: ios=3034/5057, merge=77882/372755, ticks=185048/159352, in_queue=344636, util=92.72%
sdf: ios=3035/4941, merge=79423/356692, ticks=29196/28100, in_queue=57520, util=50.56%
sdg: ios=3026/4963, merge=80518/360009, ticks=273308/31320, in_queue=304996, util=76.24%
###################################################
Post by Stan Hoeppner
read: (groupid=0, jobs=4): err= 0: pid=31359
read : io=4092.0MB, bw=125032KB/s, iops=81 , runt= 33513msec
slat (usec): min=86 , max=1404 , avg=125.10, stdev=97.21
clat (msec): min=26 , max=323 , avg=195.52, stdev=39.04
lat (msec): min=26 , max=323 , avg=195.65, stdev=39.03
bw (KB/s) : min=21982, max=41890, per=25.12%, avg=31405.62, stdev=3662.77
cpu : usr=0.02%, sys=0.31%, ctx=2738, majf=0, minf=6233
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=2728/0/0, short=0/0/0
lat (msec): 50=0.51%, 100=2.38%, 250=93.07%, 500=4.03%
write: (groupid=1, jobs=4): err= 0: pid=31364
write: io=4092.0MB, bw=158193KB/s, iops=102 , runt= 26488msec
slat (usec): min=181 , max=673 , avg=442.78, stdev=67.26
clat (msec): min=21 , max=555 , avg=154.33, stdev=45.35
lat (msec): min=21 , max=555 , avg=154.77, stdev=45.35
bw (KB/s) : min=19097, max=45988, per=25.04%, avg=39617.62, stdev=5714.79
cpu : usr=1.00%, sys=0.25%, ctx=2737, majf=0, minf=86
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/2728/0, short=0/0/0
lat (msec): 50=0.44%, 100=1.54%, 250=94.61%, 500=3.15%, 750=0.26%
READ: io=4092.0MB, aggrb=125032KB/s, minb=128033KB/s, maxb=128033KB/s, mint=33513msec, maxt=33513msec
WRITE: io=4092.0MB, aggrb=158192KB/s, minb=161989KB/s, maxb=161989KB/s, mint=26488msec, maxt=26488msec
dm-1: ios=8184/8196, merge=0/0, ticks=1521200/1235264, in_queue=2759104, util=99.58%, aggrios=8184/8225, aggrmerge=0/31, aggrticks=1421660/1137984, aggrin_queue=2559612, aggrutil=99.49%
sda: ios=8184/8225, merge=0/31, ticks=1421660/1137984, in_queue=2559612, util=99.49%
##############################
Post by Stan Hoeppner
read: (groupid=0, jobs=4): err= 0: pid=31121
read : io=4092.0MB, bw=137700KB/s, iops=89 , runt= 30430msec
slat (usec): min=123 , max=1353 , avg=157.35, stdev=91.53
clat (msec): min=15 , max=545 , avg=148.80, stdev=61.30
lat (msec): min=15 , max=545 , avg=148.96, stdev=61.29
bw (KB/s) : min= 0, max=129024, per=29.75%, avg=40959.72, stdev=21470.47
cpu : usr=0.01%, sys=0.51%, ctx=2797, majf=0, minf=6241
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=2728/0/0, short=0/0/0
lat (msec): 20=0.29%, 50=9.90%, 100=12.76%, 250=75.07%, 500=1.58%
lat (msec): 750=0.40%
write: (groupid=1, jobs=4): err= 0: pid=31131
write: io=4092.0MB, bw=41225KB/s, iops=26 , runt=101643msec
slat (usec): min=197 , max=1309 , avg=557.69, stdev=124.14
clat (msec): min=42 , max=2648 , avg=586.07, stdev=393.18
lat (msec): min=43 , max=2649 , avg=586.63, stdev=393.19
bw (KB/s) : min= 0, max=35583, per=26.36%, avg=10867.00, stdev=5232.75
cpu : usr=0.33%, sys=0.09%, ctx=2699, majf=0, minf=94
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/2728/0, short=0/0/0
lat (msec): 50=0.07%, 100=1.39%, 250=18.55%, 500=31.82%, 750=20.49%
lat (msec): 1000=12.57%, 2000=14.66%, >=2000=0.44%
READ: io=4092.0MB, aggrb=137699KB/s, minb=141004KB/s, maxb=141004KB/s, mint=30430msec, maxt=30430msec
WRITE: io=4092.0MB, aggrb=41224KB/s, minb=42214KB/s, maxb=42214KB/s, mint=101643msec, maxt=101643msec
dm-0: ios=10912/10906, merge=0/0, ticks=1385524/5596828, in_queue=6983364, util=99.89%, aggrios=10912/10940, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
md2: ios=10912/10940, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=20710/27494, aggrmerge=358521/573067, aggrticks=208334/119186, aggrin_queue=327551, aggrutil=61.95%
sdc: ios=21005/27276, merge=366645/567533, ticks=224772/179916, in_queue=404728, util=61.95%
sdd: ios=21154/26926, merge=369215/564213, ticks=212876/160624, in_queue=373552, util=60.42%
sde: ios=20460/27449, merge=354440/571659, ticks=127008/102188, in_queue=229220, util=48.28%
sdf: ios=20464/27908, merge=350591/581422, ticks=241608/78180, in_queue=319784, util=55.24%
sdg: ios=20470/27915, merge=351718/580509, ticks=235408/75024, in_queue=310472, util=55.91%
###############################
Post by Stan Hoeppner
read: (groupid=0, jobs=4): err= 0: pid=30910
read : io=4092.0MB, bw=237500KB/s, iops=154 , runt= 17643msec
slat (usec): min=134 , max=1511 , avg=165.25, stdev=96.45
clat (msec): min=16 , max=859 , avg=101.07, stdev=61.40
lat (msec): min=17 , max=859 , avg=101.23, stdev=61.40
bw (KB/s) : min=32807, max=97523, per=25.40%, avg=60325.02, stdev=15396.63
cpu : usr=0.02%, sys=0.76%, ctx=2709, majf=0, minf=6233
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=2728/0/0, short=0/0/0
lat (msec): 20=0.07%, 50=13.38%, 100=48.90%, 250=34.38%, 500=3.08%
lat (msec): 750=0.15%, 1000=0.04%
write: (groupid=1, jobs=4): err= 0: pid=30915
write: io=4092.0MB, bw=90652KB/s, iops=59 , runt= 46223msec
slat (usec): min=173 , max=983 , avg=599.60, stdev=141.06
clat (msec): min=34 , max=1185 , avg=258.48, stdev=135.97
lat (msec): min=34 , max=1185 , avg=259.08, stdev=135.98
bw (KB/s) : min= 0, max=88722, per=25.95%, avg=23524.06, stdev=9036.31
cpu : usr=0.77%, sys=0.22%, ctx=2576, majf=0, minf=86
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/2728/0, short=0/0/0
lat (msec): 50=0.48%, 100=7.00%, 250=49.01%, 500=37.72%, 750=5.10%
lat (msec): 1000=0.59%, 2000=0.11%
READ: io=4092.0MB, aggrb=237499KB/s, minb=243199KB/s, maxb=243199KB/s, mint=17643msec, maxt=17643msec
WRITE: io=4092.0MB, aggrb=90652KB/s, minb=92827KB/s, maxb=92827KB/s, mint=46223msec, maxt=46223msec
dm-0: ios=10912/10875, merge=0/0, ticks=544880/2481824, in_queue=3026996, util=99.57%, aggrios=10912/10930, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
md2: ios=10912/10930, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=5058/6798, aggrmerge=153871/404343, aggrticks=195870/113487, aggrin_queue=309447, aggrutil=92.15%
sdc: ios=5024/7100, merge=152571/421782, ticks=332612/204936, in_queue=537660, util=92.15%
sdd: ios=4961/7109, merge=150591/422166, ticks=257308/161200, in_queue=418608, util=88.11%
sde: ios=5019/6697, merge=154781/395583, ticks=226120/152068, in_queue=378296, util=85.49%
sdf: ios=5118/6463, merge=156068/385905, ticks=65716/23932, in_queue=89724, util=55.85%
sdg: ios=5172/6624, merge=155347/396280, ticks=97596/25300, in_queue=122948, util=60.00%
Post by Kevin Richter
Okay, well, I've run your fio config... but with so many results and
I'm glad you changed to size=1G instead of size=10G lest the test would
have read/written 80GB instead of 8GB, and taken forever. I often
forget the size parameter is per job (thread) not global.
Post by Kevin Richter
abbrevations I currently feel a bit overchallenged :)
So, please dont take badly, if I just paste the whole results.
I have tried with a stripe cache size of 1024, 2048 and 4096.
The full output is good. Going from 1024 to 2048 gained you ~17MB/s
throughput with this job file. But you didn't run with the default 256
for comparison, which is very useful to know. That's ok, as you can do
so with the next set of test runs. Given the huge latencies I'm
thinking this config may be seeking the disks too much. Modify your fio
numjobs=4
blocksize=1536k
iodepth=4
size=1g
as this may yield a little better results. ~190MB/s read and 75MB/s
write is very low for three effective 7.2K RPM SATA spindles, especially
the write throughput. Compare to a similar FIO job with 2 threads,
iodepth 16, directio, run against a single 500GB 7.2K WD SATA drive, NQC
READ: io=2048MB, aggrb=88663KB/s, minb=90791KB/s, maxb=90791KB/s
WRITE: io=2048MB, aggrb=108089KB/s, minb=110683KB/s, maxb=110683KB/s
88MB/s read, 108MB/s write. Write is faster apparently due to the small
16MB drive write cache. With your eight 2.4GHz SandyBridge cores and
md-RAID6 on much newer larger 7.2K drives w/larger caches, you should
have significantly greater write speed than this old single SATA drive.
Currently you're showing 30MB/s slower write throughput, some 40%
slower than this single drive. With aligned writes, thus no RMW, you
should be getting at least 2 spindles worth of throughput, or ~200MB/s.
So, run FIO again with the mods above, once with stripe_cache_size 256
and once with 2048. Something seems to be seriously dragging down your
throughput. I'm sure you'd like to know what, and how to fix it.
Post by Kevin Richter
Btw, I also have /dev/md1, this is a 15GB unencrypted array using the
same 5 disks like my LUKS-md2 array, so nearly the same just without
this LUKS layer. If helpful, I can execute some fio tests on this
filesystem, too.
This is also a RAID6 array, without dm-crypt? If so this will afford a
great comparison, and may be very informative. Run FIO against this
filesystem (I assume it's also XFS) as well with stripe_cache_size 512
and 2048. Four FIO tests total including the two on the big array.
It'll be interesting to see where this leads. At this point the
candidates I see are a single dm-crypt thread peaking a core, dm-crypt
taking aligned writes and misaligning them causing all writes to be
RMWs, or AF drives with misaligned partitions causing every write to be
an internal RMW. Or maybe something I've not considered.
Post by Kevin Richter
So long :)
Kevin
$ echo 1024 > /sys/block/md2/md/stripe_cache_size
Jobs: 1 (f=1): [____________W___] [99.7% done] [0K/99.24M /s] [0 /193 iops] [eta 00m:06s]
read: (groupid=0, jobs=8): err= 0: pid=12987
read : io=81920MB, bw=189835KB/s, iops=370 , runt=441890msec
slat (usec): min=32 , max=4561 , avg=76.28, stdev=28.66
clat (msec): min=5 , max=1115 , avg=334.19, stdev=151.18
lat (msec): min=5 , max=1115 , avg=334.26, stdev=151.18
bw (KB/s) : min= 0, max=261120, per=12.79%, avg=24288.95, stdev=11586.29
cpu : usr=0.05%, sys=0.50%, ctx=157180, majf=0, minf=16982
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=163840/0/0, short=0/0/0
lat (msec): 10=0.52%, 20=0.71%, 50=3.01%, 100=6.17%, 250=14.59%
lat (msec): 500=68.81%, 750=4.87%, 1000=1.08%, 2000=0.25%
write: (groupid=1, jobs=8): err= 0: pid=13202
write: io=81920MB, bw=58504KB/s, iops=114 , runt=1433851msec
slat (usec): min=45 , max=1729 , avg=212.20, stdev=56.68
clat (msec): min=14 , max=11691 , avg=1101.17, stdev=1116.82
lat (msec): min=14 , max=11691 , avg=1101.39, stdev=1116.82
bw (KB/s) : min= 0, max=106666, per=14.35%, avg=8395.94, stdev=6752.35
cpu : usr=0.28%, sys=0.10%, ctx=117451, majf=0, minf=3410
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/163840/0, short=0/0/0
lat (msec): 20=0.01%, 50=0.17%, 100=2.03%, 250=17.02%, 500=22.20%
lat (msec): 750=12.22%, 1000=8.66%, 2000=20.04%, >=2000=17.64%
READ: io=81920MB, aggrb=189834KB/s, minb=194390KB/s, maxb=194390KB/s, mint=441890msec, maxt=441890msec
WRITE: io=81920MB, aggrb=58504KB/s, minb=59908KB/s, maxb=59908KB/s, mint=1433851msec, maxt=1433851msec
dm-0: ios=327681/327756, merge=0/0, ticks=78591352/353235376, in_queue=431834680, util=100.00%, aggrios=327681/327922, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
md2: ios=327681/327922, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=171660/222845, aggrmerge=4561629/9494657, aggrticks=16341417/4008187, aggrin_queue=20351472, aggrutil=85.78%
sdc: ios=181512/222455, merge=4583055/9505634, ticks=15650600/5944080, in_queue=21596560, util=85.78%
sdd: ios=180545/224362, merge=4526197/9587956, ticks=14356708/5542120, in_queue=19900820, util=85.42%
sde: ios=179853/224317, merge=4519718/9540999, ticks=13375156/5676828, in_queue=19053876, util=83.90%
sdf: ios=157605/222569, merge=4551205/9459549, ticks=18828608/1234632, in_queue=20065204, util=75.36%
sdg: ios=158787/220525, merge=4627970/9379150, ticks=19496016/1643276, in_queue=21140904, util=77.26%
$ echo 2048 > /sys/block/md2/md/stripe_cache_size
Jobs: 1 (f=1): [_________W______] [99.6% done] [0K/92182K /s] [0 /175 iops] [eta 00m:06s]
read: (groupid=0, jobs=8): err= 0: pid=6392
read : io=81920MB, bw=185893KB/s, iops=363 , runt=451259msec
slat (usec): min=32 , max=524 , avg=75.08, stdev=26.00
clat (msec): min=8 , max=1849 , avg=335.65, stdev=149.76
lat (msec): min=8 , max=1849 , avg=335.72, stdev=149.76
bw (KB/s) : min= 0, max=105860, per=13.08%, avg=24308.99, stdev=7467.64
cpu : usr=0.05%, sys=0.49%, ctx=157968, majf=0, minf=17171
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=163840/0/0, short=0/0/0
lat (msec): 10=0.01%, 20=0.06%, 50=2.32%, 100=7.44%, 250=15.36%
lat (msec): 500=69.09%, 750=4.50%, 1000=0.87%, 2000=0.35%
write: (groupid=1, jobs=8): err= 0: pid=6663
write: io=81920MB, bw=75377KB/s, iops=147 , runt=1112887msec
slat (usec): min=46 , max=6453.8K, avg=988.13, stdev=46664.22
clat (msec): min=10 , max=8631 , avg=854.18, stdev=683.37
lat (msec): min=10 , max=8631 , avg=855.17, stdev=684.82
bw (KB/s) : min= 0, max=100352, per=14.04%, avg=10581.97, stdev=6871.80
cpu : usr=0.34%, sys=0.11%, ctx=92502, majf=0, minf=1531
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/163840/0, short=0/0/0
lat (msec): 20=0.01%, 50=0.09%, 100=1.14%, 250=11.19%, 500=24.64%
lat (msec): 750=18.78%, 1000=13.40%, 2000=24.53%, >=2000=6.23%
READ: io=81920MB, aggrb=185893KB/s, minb=190354KB/s, maxb=190354KB/s, mint=451259msec, maxt=451259msec
WRITE: io=81920MB, aggrb=75376KB/s, minb=77186KB/s, maxb=77186KB/s, mint=1112887msec, maxt=1112887msec
dm-0: ios=327701/328169, merge=0/0, ticks=79402348/261890568, in_queue=341301888, util=100.00%, aggrios=327701/328481, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
md2: ios=327701/328481, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=127112/146994, aggrmerge=3364061/7997232, aggrticks=18091347/5817941, aggrin_queue=23914356, aggrutil=94.20%
sdc: ios=135729/149357, merge=3341205/8134447, ticks=18312932/8398716, in_queue=26721324, util=92.77%
sdd: ios=135561/151584, merge=3312121/8238249, ticks=16877204/8190968, in_queue=25077332, util=92.44%
sde: ios=135741/146023, merge=3345948/7981968, ticks=17659792/8659780, in_queue=26322124, util=94.20%
sdf: ios=114396/143768, merge=3413295/7801050, ticks=18551976/1652924, in_queue=20207384, util=72.89%
sdg: ios=114134/144241, merge=3407738/7830447, ticks=19054832/2187320, in_queue=21243620, util=74.96%
$ echo 4096 > /sys/block/md2/md/stripe_cache_size
Jobs: 1 (f=1): [________W_______] [100.0% done] [0K/95848K /s] [0 /182 iops] [eta 00m:00s]
read: (groupid=0, jobs=8): err= 0: pid=11787
read : io=81920MB, bw=189274KB/s, iops=369 , runt=443200msec
slat (usec): min=31 , max=4511 , avg=75.47, stdev=29.74
clat (msec): min=5 , max=1338 , avg=336.39, stdev=155.14
lat (msec): min=5 , max=1338 , avg=336.47, stdev=155.14
bw (KB/s) : min= 0, max=253455, per=12.77%, avg=24162.01, stdev=11368.71
cpu : usr=0.05%, sys=0.49%, ctx=157193, majf=0, minf=17313
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=163840/0/0, short=0/0/0
lat (msec): 10=0.47%, 20=0.66%, 50=2.95%, 100=6.33%, 250=14.56%
lat (msec): 500=68.12%, 750=5.42%, 1000=1.15%, 2000=0.33%
write: (groupid=1, jobs=8): err= 0: pid=12060
write: io=81920MB, bw=64993KB/s, iops=126 , runt=1290687msec
slat (usec): min=61 , max=16991 , avg=197.22, stdev=110.87
clat (msec): min=14 , max=2820 , avg=980.92, stdev=366.56
lat (msec): min=14 , max=2821 , avg=981.12, stdev=366.56
bw (KB/s) : min= 0, max=103770, per=13.11%, avg=8517.92, stdev=3794.28
cpu : usr=0.28%, sys=0.08%, ctx=84352, majf=0, minf=723
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/163840/0, short=0/0/0
lat (msec): 20=0.01%, 50=0.03%, 100=0.28%, 250=0.22%, 500=5.37%
lat (msec): 750=22.02%, 1000=31.66%, 2000=39.27%, >=2000=1.16%
READ: io=81920MB, aggrb=189273KB/s, minb=193816KB/s, maxb=193816KB/s, mint=443200msec, maxt=443200msec
WRITE: io=81920MB, aggrb=64993KB/s, minb=66553KB/s, maxb=66553KB/s, mint=1290687msec, maxt=1290687msec
dm-0: ios=327681/327629, merge=0/0, ticks=78990724/301988444, in_queue=380991692, util=100.00%, aggrios=327681/327709, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
md2: ios=327681/327709, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=120158/119573, aggrmerge=2858405/7663126, aggrticks=17291831/9119008, aggrin_queue=26414023, aggrutil=99.60%
sdc: ios=135579/119976, merge=2813832/7324879, ticks=13974928/2192484, in_queue=16167996, util=66.57%
sdd: ios=136115/127048, merge=2826584/7736191, ticks=12932248/2477796, in_queue=15410924, util=68.08%
sde: ios=136007/130908, merge=2844473/7936354, ticks=12642232/3141268, in_queue=15784336, util=71.86%
sdf: ios=78473/94458, merge=2882361/7865984, ticks=29053772/37421808, in_queue=66488856, util=99.60%
sdg: ios=114620/125479, merge=2924777/7452224, ticks=17855976/361684, in_queue=18218004, util=54.84%
Stan Hoeppner
2013-12-22 17:29:32 UTC
Permalink
Post by Kevin Richter
Post by Stan Hoeppner
I'm glad you changed to size=1G instead of size=10G
No no, I didnt change that. It has been written 160GB on these disks. 80
for the read part, 80 for the write part. Each test took round 30 minutes.
I read this "read : io=81920MB" without the zero hence my incorrect
assumption. Again, my apologies for the length test runs. Use "1g"
from now on as that should be plenty to get good data, and run quickly.
Post by Kevin Richter
Post by Stan Hoeppner
This is also a RAID6 array, without dm-crypt?
Yeah, right. The same 5 disks in RAID6 mode, just in each case another
partition on these disks. And yes, of course, XFS.
1) 15GB RAID6 array, stripe 256
2) 15GB RAID6 array, stripe 1024
3) 15GB RAID6 array, stripe 2048
4) 15GB RAID6 array, stripe 4096
5) 3TB LUKS volume (no raid, just one single disk)
6) 6TB RAID6+LUKS array, stripe 256
7) 6TB RAID6+LUKS array, stripe 2048
In the 15GB array a stripe size of 4096 seems to get slightly more
throughput than 2048.
See below for the results :-)
Perhaps, I should give you another information about the server
http://www.supermicro.com/products/system/2u/6027/ssg-6027r-e1r12l.cfm
http://www.thomas-krenn.com/de/wiki/BPN-SAS-826EL1_SAS_Expander_Backplane
But the backplane should have a thoughput of 3Gb/s
The hardware is fine. You have 1.2GB/s <-> the backplane.
Post by Kevin Richter
However... I am very curious about your comments :-)
3) 15GB RAID6 array, stripe 2048:
READ: io=4092.0MB, aggrb=207827KB/s, minb=212814KB/s, maxb=212814KB/s
WRITE: io=4092.0MB, aggrb=145832KB/s, minb=149332KB/s, maxb=149332KB/s

~50MB/s per drive * 5 drives = ~250 MB/s
This may simply be the md-RAID6 max write you'll get from these drives.
But 145MB/s aggregate write throughput is still decidedly better than
90MB/s, some 62% faster than the encrypted array.


5) 3TB LUKS volume (no raid, just one single disk):
READ: io=4092.0MB, aggrb=125032KB/s, minb=128033KB/s, maxb=128033KB/s
WRITE: io=4092.0MB, aggrb=158192KB/s, minb=161989KB/s, maxb=161989KB/s

~160MB/s per drive, ~160MB/s dm-crypt throughput


7) 6TB LUKS array, stripe 2048:
READ: io=4092.0MB, aggrb=237499KB/s, minb=243199KB/s, maxb=243199KB/s
WRITE: io=4092.0MB, aggrb=90652KB/s, minb=92827KB/s, maxb=92827KB/s

~30MB/s per drive * 5 drives = ~150MB/s dm-crypt throughput.

Write throughput is ~100MB/s lower on your RAID6 w/dm-crypt than
without. Read throughput however is ~30MB/s higher, though these
differences may not be as great with multiple runs and normalized data.
Even so, since the read throughput is roughly the same at 71-81MB/s per
drive, both with/out encryption, I don't believe the difference in write
speed is due to platter placement of the filesystems and test files.

So at this point I'd guess we need to look at dm-crypt, which may be
peaking a core or unaligning your IOs causing RMW cycles. And look at
your partitions under the encrypted RAID6 to see if they're aligned.

We can identify the latter easily if you post 'fdisk -l' output for each
of the 5 drives. Whether partitions are misaligned or dm-crypt is the
problem, the fix in either case is very destructive and time consuming,
requiring a full backup/restore after redoing the stack.

--
Stan
Post by Kevin Richter
Bye,
Kevin
#################################
Post by Stan Hoeppner
read: (groupid=0, jobs=4): err= 0: pid=3959
read : io=4092.0MB, bw=217900KB/s, iops=141 , runt= 19230msec
slat (usec): min=190 , max=1325 , avg=231.32, stdev=77.13
clat (msec): min=7 , max=326 , avg=111.45, stdev=67.48
lat (msec): min=7 , max=326 , avg=111.69, stdev=67.48
bw (KB/s) : min=29425, max=109280, per=24.98%, avg=54423.27, stdev=15887.11
cpu : usr=0.03%, sys=0.90%, ctx=2614, majf=0, minf=6233
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=2728/0/0, short=0/0/0
lat (msec): 10=0.66%, 20=4.84%, 50=16.35%, 100=25.70%, 250=47.91%
lat (msec): 500=4.55%
write: (groupid=1, jobs=4): err= 0: pid=3964
write: io=4092.0MB, bw=54044KB/s, iops=35 , runt= 77533msec
slat (msec): min=9 , max=339 , avg=113.48, stdev=36.86
clat (msec): min=2 , max=744 , avg=339.51, stdev=70.02
lat (msec): min=23 , max=927 , avg=453.00, stdev=82.81
bw (KB/s) : min= 5475, max=20756, per=24.92%, avg=13467.27, stdev=2107.66
cpu : usr=0.58%, sys=6.26%, ctx=343060, majf=0, minf=550
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/2728/0, short=0/0/0
lat (msec): 4=0.04%, 10=0.04%, 50=0.11%, 100=0.26%, 250=5.94%
lat (msec): 500=91.50%, 750=2.13%
READ: io=4092.0MB, aggrb=217899KB/s, minb=223129KB/s, maxb=223129KB/s, mint=19230msec, maxt=19230msec
WRITE: io=4092.0MB, aggrb=54044KB/s, minb=55341KB/s, maxb=55341KB/s, mint=77533msec, maxt=77533msec
md1: ios=10912/10904, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=22836/29872, aggrmerge=364645/574313, aggrticks=216408/101340, aggrin_queue=317789, aggrutil=69.36%
sdc: ios=23542/29233, merge=377270/563331, ticks=217520/152300, in_queue=369860, util=67.99%
sdd: ios=23462/28930, merge=378915/557686, ticks=214104/142224, in_queue=356428, util=67.81%
sde: ios=22264/29920, merge=355831/576114, ticks=149184/99724, in_queue=248924, util=58.22%
sdf: ios=22240/30724, merge=352471/588412, ticks=126740/52916, in_queue=179696, util=55.44%
sdg: ios=22676/30557, merge=358741/586025, ticks=374496/59540, in_queue=434040, util=69.36%
#################################
Post by Stan Hoeppner
read: (groupid=0, jobs=4): err= 0: pid=4334
read : io=4092.0MB, bw=237702KB/s, iops=154 , runt= 17628msec
slat (usec): min=196 , max=1469 , avg=232.84, stdev=88.56
clat (msec): min=6 , max=391 , avg=101.62, stdev=56.61
lat (msec): min=7 , max=391 , avg=101.85, stdev=56.61
bw (KB/s) : min=26482, max=107305, per=24.94%, avg=59289.94, stdev=14153.87
cpu : usr=0.04%, sys=0.98%, ctx=2581, majf=0, minf=6241
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=2728/0/0, short=0/0/0
lat (msec): 10=0.22%, 20=3.08%, 50=16.83%, 100=31.01%, 250=47.10%
lat (msec): 500=1.76%
write: (groupid=1, jobs=4): err= 0: pid=4340
write: io=4092.0MB, bw=120759KB/s, iops=78 , runt= 34699msec
slat (msec): min=1 , max=156 , avg=47.70, stdev=27.94
clat (msec): min=15 , max=520 , avg=152.37, stdev=53.34
lat (msec): min=28 , max=564 , avg=200.07, stdev=57.63
bw (KB/s) : min= 0, max=62270, per=25.04%, avg=30239.54, stdev=5435.51
cpu : usr=1.34%, sys=5.68%, ctx=134184, majf=0, minf=94
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/2728/0, short=0/0/0
lat (msec): 20=0.04%, 50=1.10%, 100=12.02%, 250=82.18%, 500=4.62%
lat (msec): 750=0.04%
READ: io=4092.0MB, aggrb=237701KB/s, minb=243406KB/s, maxb=243406KB/s, mint=17628msec, maxt=17628msec
WRITE: io=4092.0MB, aggrb=120758KB/s, minb=123656KB/s, maxb=123656KB/s, mint=34699msec, maxt=34699msec
md1: ios=10912/10901, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=4062/6768, aggrmerge=109645/369712, aggrticks=157952/111924, aggrin_queue=270487, aggrutil=89.91%
sdc: ios=4070/6893, merge=109432/378682, ticks=173784/179392, in_queue=353780, util=89.88%
sdd: ios=4041/6961, merge=110017/380798, ticks=164960/160236, in_queue=325820, util=89.91%
sde: ios=4039/6727, merge=110658/366163, ticks=169804/166052, in_queue=336484, util=88.98%
sdf: ios=4080/6591, merge=108922/359343, ticks=39460/24636, in_queue=64820, util=52.33%
sdg: ios=4081/6668, merge=109197/363574, ticks=241752/29308, in_queue=271532, util=73.63%
#################################
Post by Stan Hoeppner
read: (groupid=0, jobs=4): err= 0: pid=4102
read : io=4092.0MB, bw=207827KB/s, iops=135 , runt= 20162msec
slat (usec): min=122 , max=3904 , avg=230.75, stdev=99.04
clat (msec): min=5 , max=363 , avg=117.03, stdev=70.26
lat (msec): min=5 , max=363 , avg=117.27, stdev=70.25
bw (KB/s) : min=26533, max=104239, per=25.00%, avg=51947.24, stdev=13121.27
cpu : usr=0.05%, sys=0.83%, ctx=2640, majf=0, minf=6241
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=2728/0/0, short=0/0/0
lat (msec): 10=0.55%, 20=4.99%, 50=14.96%, 100=21.33%, 250=53.01%
lat (msec): 500=5.17%
write: (groupid=1, jobs=4): err= 0: pid=4108
write: io=4092.0MB, bw=145833KB/s, iops=94 , runt= 28733msec
slat (usec): min=768 , max=94184 , avg=1838.20, stdev=4140.24
clat (msec): min=15 , max=1155 , avg=145.02, stdev=128.70
lat (msec): min=16 , max=1156 , avg=146.86, stdev=128.77
bw (KB/s) : min= 6144, max=104448, per=29.22%, avg=42617.70, stdev=17118.84
cpu : usr=1.63%, sys=2.52%, ctx=4086, majf=0, minf=94
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/2728/0, short=0/0/0
lat (msec): 20=0.37%, 50=9.02%, 100=38.42%, 250=39.22%, 500=10.12%
lat (msec): 750=2.09%, 1000=0.59%, 2000=0.18%
READ: io=4092.0MB, aggrb=207827KB/s, minb=212814KB/s, maxb=212814KB/s, mint=20162msec, maxt=20162msec
WRITE: io=4092.0MB, aggrb=145832KB/s, minb=149332KB/s, maxb=149332KB/s, mint=28733msec, maxt=28733msec
md1: ios=10912/10873, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=3018/4990, aggrmerge=79606/371231, aggrticks=152741/83319, aggrin_queue=236284, aggrutil=92.45%
sdc: ios=2988/5046, merge=79033/376737, ticks=129600/103552, in_queue=233412, util=84.14%
sdd: ios=3010/5140, merge=77101/390954, ticks=119032/103236, in_queue=222536, util=85.87%
sde: ios=3039/5009, merge=79935/374563, ticks=196200/159792, in_queue=356236, util=92.45%
sdf: ios=3021/4883, merge=80483/356643, ticks=29976/24956, in_queue=55100, util=49.53%
sdg: ios=3033/4872, merge=81481/357261, ticks=288900/25060, in_queue=314136, util=76.17%
#################################
Post by Stan Hoeppner
read: (groupid=0, jobs=4): err= 0: pid=4411
read : io=4092.0MB, bw=214938KB/s, iops=139 , runt= 19495msec
slat (usec): min=192 , max=1468 , avg=226.97, stdev=87.18
clat (msec): min=4 , max=339 , avg=112.51, stdev=63.05
lat (msec): min=4 , max=339 , avg=112.73, stdev=63.05
bw (KB/s) : min= 0, max=96000, per=24.87%, avg=53457.49, stdev=13728.88
cpu : usr=0.02%, sys=0.89%, ctx=2591, majf=0, minf=6241
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=2728/0/0, short=0/0/0
lat (msec): 10=0.15%, 20=4.62%, 50=13.56%, 100=26.43%, 250=52.20%
lat (msec): 500=3.04%
write: (groupid=1, jobs=4): err= 0: pid=4419
write: io=4092.0MB, bw=146526KB/s, iops=95 , runt= 28597msec
slat (usec): min=822 , max=36057 , avg=1585.60, stdev=1174.08
clat (msec): min=14 , max=1132 , avg=142.49, stdev=126.05
lat (msec): min=15 , max=1134 , avg=144.08, stdev=126.08
bw (KB/s) : min= 0, max=83836, per=28.88%, avg=42315.69, stdev=18036.84
cpu : usr=1.42%, sys=2.85%, ctx=2937, majf=0, minf=94
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/2728/0, short=0/0/0
lat (msec): 20=0.18%, 50=10.08%, 100=37.61%, 250=40.62%, 500=8.50%
lat (msec): 750=2.46%, 1000=0.44%, 2000=0.11%
READ: io=4092.0MB, aggrb=214937KB/s, minb=220096KB/s, maxb=220096KB/s, mint=19495msec, maxt=19495msec
WRITE: io=4092.0MB, aggrb=146526KB/s, minb=150042KB/s, maxb=150042KB/s, mint=28597msec, maxt=28597msec
md1: ios=10912/10895, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=3013/5053, aggrmerge=78083/371841, aggrticks=151707/88554, aggrin_queue=240554, aggrutil=92.72%
sdc: ios=2981/5124, merge=77153/379556, ticks=142800/116248, in_queue=259356, util=87.13%
sdd: ios=2990/5182, merge=75443/390196, ticks=128184/107752, in_queue=236264, util=86.87%
sde: ios=3034/5057, merge=77882/372755, ticks=185048/159352, in_queue=344636, util=92.72%
sdf: ios=3035/4941, merge=79423/356692, ticks=29196/28100, in_queue=57520, util=50.56%
sdg: ios=3026/4963, merge=80518/360009, ticks=273308/31320, in_queue=304996, util=76.24%
###################################################
Post by Stan Hoeppner
read: (groupid=0, jobs=4): err= 0: pid=31359
read : io=4092.0MB, bw=125032KB/s, iops=81 , runt= 33513msec
slat (usec): min=86 , max=1404 , avg=125.10, stdev=97.21
clat (msec): min=26 , max=323 , avg=195.52, stdev=39.04
lat (msec): min=26 , max=323 , avg=195.65, stdev=39.03
bw (KB/s) : min=21982, max=41890, per=25.12%, avg=31405.62, stdev=3662.77
cpu : usr=0.02%, sys=0.31%, ctx=2738, majf=0, minf=6233
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=2728/0/0, short=0/0/0
lat (msec): 50=0.51%, 100=2.38%, 250=93.07%, 500=4.03%
write: (groupid=1, jobs=4): err= 0: pid=31364
write: io=4092.0MB, bw=158193KB/s, iops=102 , runt= 26488msec
slat (usec): min=181 , max=673 , avg=442.78, stdev=67.26
clat (msec): min=21 , max=555 , avg=154.33, stdev=45.35
lat (msec): min=21 , max=555 , avg=154.77, stdev=45.35
bw (KB/s) : min=19097, max=45988, per=25.04%, avg=39617.62, stdev=5714.79
cpu : usr=1.00%, sys=0.25%, ctx=2737, majf=0, minf=86
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/2728/0, short=0/0/0
lat (msec): 50=0.44%, 100=1.54%, 250=94.61%, 500=3.15%, 750=0.26%
READ: io=4092.0MB, aggrb=125032KB/s, minb=128033KB/s, maxb=128033KB/s, mint=33513msec, maxt=33513msec
WRITE: io=4092.0MB, aggrb=158192KB/s, minb=161989KB/s, maxb=161989KB/s, mint=26488msec, maxt=26488msec
dm-1: ios=8184/8196, merge=0/0, ticks=1521200/1235264, in_queue=2759104, util=99.58%, aggrios=8184/8225, aggrmerge=0/31, aggrticks=1421660/1137984, aggrin_queue=2559612, aggrutil=99.49%
sda: ios=8184/8225, merge=0/31, ticks=1421660/1137984, in_queue=2559612, util=99.49%
##############################
Post by Stan Hoeppner
read: (groupid=0, jobs=4): err= 0: pid=31121
read : io=4092.0MB, bw=137700KB/s, iops=89 , runt= 30430msec
slat (usec): min=123 , max=1353 , avg=157.35, stdev=91.53
clat (msec): min=15 , max=545 , avg=148.80, stdev=61.30
lat (msec): min=15 , max=545 , avg=148.96, stdev=61.29
bw (KB/s) : min= 0, max=129024, per=29.75%, avg=40959.72, stdev=21470.47
cpu : usr=0.01%, sys=0.51%, ctx=2797, majf=0, minf=6241
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=2728/0/0, short=0/0/0
lat (msec): 20=0.29%, 50=9.90%, 100=12.76%, 250=75.07%, 500=1.58%
lat (msec): 750=0.40%
write: (groupid=1, jobs=4): err= 0: pid=31131
write: io=4092.0MB, bw=41225KB/s, iops=26 , runt=101643msec
slat (usec): min=197 , max=1309 , avg=557.69, stdev=124.14
clat (msec): min=42 , max=2648 , avg=586.07, stdev=393.18
lat (msec): min=43 , max=2649 , avg=586.63, stdev=393.19
bw (KB/s) : min= 0, max=35583, per=26.36%, avg=10867.00, stdev=5232.75
cpu : usr=0.33%, sys=0.09%, ctx=2699, majf=0, minf=94
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/2728/0, short=0/0/0
lat (msec): 50=0.07%, 100=1.39%, 250=18.55%, 500=31.82%, 750=20.49%
lat (msec): 1000=12.57%, 2000=14.66%, >=2000=0.44%
READ: io=4092.0MB, aggrb=137699KB/s, minb=141004KB/s, maxb=141004KB/s, mint=30430msec, maxt=30430msec
WRITE: io=4092.0MB, aggrb=41224KB/s, minb=42214KB/s, maxb=42214KB/s, mint=101643msec, maxt=101643msec
dm-0: ios=10912/10906, merge=0/0, ticks=1385524/5596828, in_queue=6983364, util=99.89%, aggrios=10912/10940, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
md2: ios=10912/10940, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=20710/27494, aggrmerge=358521/573067, aggrticks=208334/119186, aggrin_queue=327551, aggrutil=61.95%
sdc: ios=21005/27276, merge=366645/567533, ticks=224772/179916, in_queue=404728, util=61.95%
sdd: ios=21154/26926, merge=369215/564213, ticks=212876/160624, in_queue=373552, util=60.42%
sde: ios=20460/27449, merge=354440/571659, ticks=127008/102188, in_queue=229220, util=48.28%
sdf: ios=20464/27908, merge=350591/581422, ticks=241608/78180, in_queue=319784, util=55.24%
sdg: ios=20470/27915, merge=351718/580509, ticks=235408/75024, in_queue=310472, util=55.91%
###############################
Post by Stan Hoeppner
read: (groupid=0, jobs=4): err= 0: pid=30910
read : io=4092.0MB, bw=237500KB/s, iops=154 , runt= 17643msec
slat (usec): min=134 , max=1511 , avg=165.25, stdev=96.45
clat (msec): min=16 , max=859 , avg=101.07, stdev=61.40
lat (msec): min=17 , max=859 , avg=101.23, stdev=61.40
bw (KB/s) : min=32807, max=97523, per=25.40%, avg=60325.02, stdev=15396.63
cpu : usr=0.02%, sys=0.76%, ctx=2709, majf=0, minf=6233
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=2728/0/0, short=0/0/0
lat (msec): 20=0.07%, 50=13.38%, 100=48.90%, 250=34.38%, 500=3.08%
lat (msec): 750=0.15%, 1000=0.04%
write: (groupid=1, jobs=4): err= 0: pid=30915
write: io=4092.0MB, bw=90652KB/s, iops=59 , runt= 46223msec
slat (usec): min=173 , max=983 , avg=599.60, stdev=141.06
clat (msec): min=34 , max=1185 , avg=258.48, stdev=135.97
lat (msec): min=34 , max=1185 , avg=259.08, stdev=135.98
bw (KB/s) : min= 0, max=88722, per=25.95%, avg=23524.06, stdev=9036.31
cpu : usr=0.77%, sys=0.22%, ctx=2576, majf=0, minf=86
IO depths : 1=0.1%, 2=0.3%, 4=99.6%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/2728/0, short=0/0/0
lat (msec): 50=0.48%, 100=7.00%, 250=49.01%, 500=37.72%, 750=5.10%
lat (msec): 1000=0.59%, 2000=0.11%
READ: io=4092.0MB, aggrb=237499KB/s, minb=243199KB/s, maxb=243199KB/s, mint=17643msec, maxt=17643msec
WRITE: io=4092.0MB, aggrb=90652KB/s, minb=92827KB/s, maxb=92827KB/s, mint=46223msec, maxt=46223msec
dm-0: ios=10912/10875, merge=0/0, ticks=544880/2481824, in_queue=3026996, util=99.57%, aggrios=10912/10930, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
md2: ios=10912/10930, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=5058/6798, aggrmerge=153871/404343, aggrticks=195870/113487, aggrin_queue=309447, aggrutil=92.15%
sdc: ios=5024/7100, merge=152571/421782, ticks=332612/204936, in_queue=537660, util=92.15%
sdd: ios=4961/7109, merge=150591/422166, ticks=257308/161200, in_queue=418608, util=88.11%
sde: ios=5019/6697, merge=154781/395583, ticks=226120/152068, in_queue=378296, util=85.49%
sdf: ios=5118/6463, merge=156068/385905, ticks=65716/23932, in_queue=89724, util=55.85%
sdg: ios=5172/6624, merge=155347/396280, ticks=97596/25300, in_queue=122948, util=60.00%
Post by Kevin Richter
Okay, well, I've run your fio config... but with so many results and
I'm glad you changed to size=1G instead of size=10G lest the test would
have read/written 80GB instead of 8GB, and taken forever. I often
forget the size parameter is per job (thread) not global.
Post by Kevin Richter
abbrevations I currently feel a bit overchallenged :)
So, please dont take badly, if I just paste the whole results.
I have tried with a stripe cache size of 1024, 2048 and 4096.
The full output is good. Going from 1024 to 2048 gained you ~17MB/s
throughput with this job file. But you didn't run with the default 256
for comparison, which is very useful to know. That's ok, as you can do
so with the next set of test runs. Given the huge latencies I'm
thinking this config may be seeking the disks too much. Modify your fio
numjobs=4
blocksize=1536k
iodepth=4
size=1g
as this may yield a little better results. ~190MB/s read and 75MB/s
write is very low for three effective 7.2K RPM SATA spindles, especially
the write throughput. Compare to a similar FIO job with 2 threads,
iodepth 16, directio, run against a single 500GB 7.2K WD SATA drive, NQC
READ: io=2048MB, aggrb=88663KB/s, minb=90791KB/s, maxb=90791KB/s
WRITE: io=2048MB, aggrb=108089KB/s, minb=110683KB/s, maxb=110683KB/s
88MB/s read, 108MB/s write. Write is faster apparently due to the small
16MB drive write cache. With your eight 2.4GHz SandyBridge cores and
md-RAID6 on much newer larger 7.2K drives w/larger caches, you should
have significantly greater write speed than this old single SATA drive.
Currently you're showing 30MB/s slower write throughput, some 40%
slower than this single drive. With aligned writes, thus no RMW, you
should be getting at least 2 spindles worth of throughput, or ~200MB/s.
So, run FIO again with the mods above, once with stripe_cache_size 256
and once with 2048. Something seems to be seriously dragging down your
throughput. I'm sure you'd like to know what, and how to fix it.
Post by Kevin Richter
Btw, I also have /dev/md1, this is a 15GB unencrypted array using the
same 5 disks like my LUKS-md2 array, so nearly the same just without
this LUKS layer. If helpful, I can execute some fio tests on this
filesystem, too.
This is also a RAID6 array, without dm-crypt? If so this will afford a
great comparison, and may be very informative. Run FIO against this
filesystem (I assume it's also XFS) as well with stripe_cache_size 512
and 2048. Four FIO tests total including the two on the big array.
It'll be interesting to see where this leads. At this point the
candidates I see are a single dm-crypt thread peaking a core, dm-crypt
taking aligned writes and misaligning them causing all writes to be
RMWs, or AF drives with misaligned partitions causing every write to be
an internal RMW. Or maybe something I've not considered.
Post by Kevin Richter
So long :)
Kevin
$ echo 1024 > /sys/block/md2/md/stripe_cache_size
Jobs: 1 (f=1): [____________W___] [99.7% done] [0K/99.24M /s] [0 /193 iops] [eta 00m:06s]
read: (groupid=0, jobs=8): err= 0: pid=12987
read : io=81920MB, bw=189835KB/s, iops=370 , runt=441890msec
slat (usec): min=32 , max=4561 , avg=76.28, stdev=28.66
clat (msec): min=5 , max=1115 , avg=334.19, stdev=151.18
lat (msec): min=5 , max=1115 , avg=334.26, stdev=151.18
bw (KB/s) : min= 0, max=261120, per=12.79%, avg=24288.95, stdev=11586.29
cpu : usr=0.05%, sys=0.50%, ctx=157180, majf=0, minf=16982
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=163840/0/0, short=0/0/0
lat (msec): 10=0.52%, 20=0.71%, 50=3.01%, 100=6.17%, 250=14.59%
lat (msec): 500=68.81%, 750=4.87%, 1000=1.08%, 2000=0.25%
write: (groupid=1, jobs=8): err= 0: pid=13202
write: io=81920MB, bw=58504KB/s, iops=114 , runt=1433851msec
slat (usec): min=45 , max=1729 , avg=212.20, stdev=56.68
clat (msec): min=14 , max=11691 , avg=1101.17, stdev=1116.82
lat (msec): min=14 , max=11691 , avg=1101.39, stdev=1116.82
bw (KB/s) : min= 0, max=106666, per=14.35%, avg=8395.94, stdev=6752.35
cpu : usr=0.28%, sys=0.10%, ctx=117451, majf=0, minf=3410
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/163840/0, short=0/0/0
lat (msec): 20=0.01%, 50=0.17%, 100=2.03%, 250=17.02%, 500=22.20%
lat (msec): 750=12.22%, 1000=8.66%, 2000=20.04%, >=2000=17.64%
READ: io=81920MB, aggrb=189834KB/s, minb=194390KB/s, maxb=194390KB/s, mint=441890msec, maxt=441890msec
WRITE: io=81920MB, aggrb=58504KB/s, minb=59908KB/s, maxb=59908KB/s, mint=1433851msec, maxt=1433851msec
dm-0: ios=327681/327756, merge=0/0, ticks=78591352/353235376, in_queue=431834680, util=100.00%, aggrios=327681/327922, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
md2: ios=327681/327922, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=171660/222845, aggrmerge=4561629/9494657, aggrticks=16341417/4008187, aggrin_queue=20351472, aggrutil=85.78%
sdc: ios=181512/222455, merge=4583055/9505634, ticks=15650600/5944080, in_queue=21596560, util=85.78%
sdd: ios=180545/224362, merge=4526197/9587956, ticks=14356708/5542120, in_queue=19900820, util=85.42%
sde: ios=179853/224317, merge=4519718/9540999, ticks=13375156/5676828, in_queue=19053876, util=83.90%
sdf: ios=157605/222569, merge=4551205/9459549, ticks=18828608/1234632, in_queue=20065204, util=75.36%
sdg: ios=158787/220525, merge=4627970/9379150, ticks=19496016/1643276, in_queue=21140904, util=77.26%
$ echo 2048 > /sys/block/md2/md/stripe_cache_size
Jobs: 1 (f=1): [_________W______] [99.6% done] [0K/92182K /s] [0 /175 iops] [eta 00m:06s]
read: (groupid=0, jobs=8): err= 0: pid=6392
read : io=81920MB, bw=185893KB/s, iops=363 , runt=451259msec
slat (usec): min=32 , max=524 , avg=75.08, stdev=26.00
clat (msec): min=8 , max=1849 , avg=335.65, stdev=149.76
lat (msec): min=8 , max=1849 , avg=335.72, stdev=149.76
bw (KB/s) : min= 0, max=105860, per=13.08%, avg=24308.99, stdev=7467.64
cpu : usr=0.05%, sys=0.49%, ctx=157968, majf=0, minf=17171
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=163840/0/0, short=0/0/0
lat (msec): 10=0.01%, 20=0.06%, 50=2.32%, 100=7.44%, 250=15.36%
lat (msec): 500=69.09%, 750=4.50%, 1000=0.87%, 2000=0.35%
write: (groupid=1, jobs=8): err= 0: pid=6663
write: io=81920MB, bw=75377KB/s, iops=147 , runt=1112887msec
slat (usec): min=46 , max=6453.8K, avg=988.13, stdev=46664.22
clat (msec): min=10 , max=8631 , avg=854.18, stdev=683.37
lat (msec): min=10 , max=8631 , avg=855.17, stdev=684.82
bw (KB/s) : min= 0, max=100352, per=14.04%, avg=10581.97, stdev=6871.80
cpu : usr=0.34%, sys=0.11%, ctx=92502, majf=0, minf=1531
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/163840/0, short=0/0/0
lat (msec): 20=0.01%, 50=0.09%, 100=1.14%, 250=11.19%, 500=24.64%
lat (msec): 750=18.78%, 1000=13.40%, 2000=24.53%, >=2000=6.23%
READ: io=81920MB, aggrb=185893KB/s, minb=190354KB/s, maxb=190354KB/s, mint=451259msec, maxt=451259msec
WRITE: io=81920MB, aggrb=75376KB/s, minb=77186KB/s, maxb=77186KB/s, mint=1112887msec, maxt=1112887msec
dm-0: ios=327701/328169, merge=0/0, ticks=79402348/261890568, in_queue=341301888, util=100.00%, aggrios=327701/328481, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
md2: ios=327701/328481, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=127112/146994, aggrmerge=3364061/7997232, aggrticks=18091347/5817941, aggrin_queue=23914356, aggrutil=94.20%
sdc: ios=135729/149357, merge=3341205/8134447, ticks=18312932/8398716, in_queue=26721324, util=92.77%
sdd: ios=135561/151584, merge=3312121/8238249, ticks=16877204/8190968, in_queue=25077332, util=92.44%
sde: ios=135741/146023, merge=3345948/7981968, ticks=17659792/8659780, in_queue=26322124, util=94.20%
sdf: ios=114396/143768, merge=3413295/7801050, ticks=18551976/1652924, in_queue=20207384, util=72.89%
sdg: ios=114134/144241, merge=3407738/7830447, ticks=19054832/2187320, in_queue=21243620, util=74.96%
$ echo 4096 > /sys/block/md2/md/stripe_cache_size
Jobs: 1 (f=1): [________W_______] [100.0% done] [0K/95848K /s] [0 /182 iops] [eta 00m:00s]
read: (groupid=0, jobs=8): err= 0: pid=11787
read : io=81920MB, bw=189274KB/s, iops=369 , runt=443200msec
slat (usec): min=31 , max=4511 , avg=75.47, stdev=29.74
clat (msec): min=5 , max=1338 , avg=336.39, stdev=155.14
lat (msec): min=5 , max=1338 , avg=336.47, stdev=155.14
bw (KB/s) : min= 0, max=253455, per=12.77%, avg=24162.01, stdev=11368.71
cpu : usr=0.05%, sys=0.49%, ctx=157193, majf=0, minf=17313
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=163840/0/0, short=0/0/0
lat (msec): 10=0.47%, 20=0.66%, 50=2.95%, 100=6.33%, 250=14.56%
lat (msec): 500=68.12%, 750=5.42%, 1000=1.15%, 2000=0.33%
write: (groupid=1, jobs=8): err= 0: pid=12060
write: io=81920MB, bw=64993KB/s, iops=126 , runt=1290687msec
slat (usec): min=61 , max=16991 , avg=197.22, stdev=110.87
clat (msec): min=14 , max=2820 , avg=980.92, stdev=366.56
lat (msec): min=14 , max=2821 , avg=981.12, stdev=366.56
bw (KB/s) : min= 0, max=103770, per=13.11%, avg=8517.92, stdev=3794.28
cpu : usr=0.28%, sys=0.08%, ctx=84352, majf=0, minf=723
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w/d: total=0/163840/0, short=0/0/0
lat (msec): 20=0.01%, 50=0.03%, 100=0.28%, 250=0.22%, 500=5.37%
lat (msec): 750=22.02%, 1000=31.66%, 2000=39.27%, >=2000=1.16%
READ: io=81920MB, aggrb=189273KB/s, minb=193816KB/s, maxb=193816KB/s, mint=443200msec, maxt=443200msec
WRITE: io=81920MB, aggrb=64993KB/s, minb=66553KB/s, maxb=66553KB/s, mint=1290687msec, maxt=1290687msec
dm-0: ios=327681/327629, merge=0/0, ticks=78990724/301988444, in_queue=380991692, util=100.00%, aggrios=327681/327709, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
md2: ios=327681/327709, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=120158/119573, aggrmerge=2858405/7663126, aggrticks=17291831/9119008, aggrin_queue=26414023, aggrutil=99.60%
sdc: ios=135579/119976, merge=2813832/7324879, ticks=13974928/2192484, in_queue=16167996, util=66.57%
sdd: ios=136115/127048, merge=2826584/7736191, ticks=12932248/2477796, in_queue=15410924, util=68.08%
sde: ios=136007/130908, merge=2844473/7936354, ticks=12642232/3141268, in_queue=15784336, util=71.86%
sdf: ios=78473/94458, merge=2882361/7865984, ticks=29053772/37421808, in_queue=66488856, util=99.60%
sdg: ios=114620/125479, merge=2924777/7452224, ticks=17855976/361684, in_queue=18218004, util=54.84%
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
Arkadiusz Miśkiewicz
2013-12-20 22:43:47 UTC
Permalink
Post by Kevin Richter
Post by Stan Hoeppner
Post by Kevin Richter
$ cat /sys/block/md2/md/stripe_cache_size
256
256 is the default and it is way too low. This is limiting your write
througput. Increase this to a minimum of 1024 which will give you a
20MB stripe cache buffer. This should become active immediately. Add
it to a startup script to make it permanent.
$ echo 256 > /sys/block/md2/md/stripe_cache_size
$ time cp -a /olddisk/testfolder /6tb/foo1/
real 25m38.925s
user 0m0.595s
sys 1m23.182s
$ echo 1024 > /sys/block/md2/md/stripe_cache_size
$ time cp -a /olddisk/testfolder /raid/foo2/
real 7m32.824s
user 0m0.438s
sys 1m6.759s
$ echo 2048 > /sys/block/md2/md/stripe_cache_size
$ time cp -a /olddisk/testfolder /raid/foo3/
real 5m32.847s
user 0m0.418s
sys 1m5.671s
$ echo 4096 > /sys/block/md2/md/stripe_cache_size
$ time cp -a /olddisk/testfolder /raid/foo4/
real 5m54.554s
user 0m0.437s
sys 1m6.268s
The difference is really amazing! So 2048 seems to be the best choice.
60GB in 5,5minutes are 180MB/sek. That sounds a bit high, doesnt it?
The RAID only consist of 5 SATA disks with 7200rpm.
I wonder why kernel is giving defaults that everyone repeatly recommends to
change/increase? Has anyone tried to bugreport that for stripe_cache_size
case?
--
Arkadiusz Miśkiewicz, arekm / maven.pl
Stan Hoeppner
2013-12-21 11:18:42 UTC
Permalink
I renamed the subject as your question doesn't really apply to XFS, or
the OP, but to md-RAID.
Post by Arkadiusz Miśkiewicz
I wonder why kernel is giving defaults that everyone repeatly recommends to
change/increase? Has anyone tried to bugreport that for stripe_cache_size
case?
The answer is balancing default md-RAID5/6 write performance against
kernel RAM consumption, with more weight given to the latter. The formula:

((4096*stripe_cache_size)*num_drives)= RAM consumed for stripe cache

High stripe_cache_size values will cause the kernel to eat non trivial
amounts of RAM for the stripe cache buffer. This table demonstrates the
effect today for typical RAID5/6 disk counts.

stripe_cache_size drives RAM consumed
256 4 4 MB
8 8 MB
16 16 MB
512 4 8 MB
8 16 MB
16 32 MB
1024 4 16 MB
8 32 MB
16 64 MB
2048 4 32 MB
8 64 MB
16 128 MB
4096 4 64 MB
8 128 MB
16 256 MB

The powers that be, Linus in particular, are not fond of default
settings that create a lot of kernel memory structures. The default
md-RAID5/6 stripe_cache-size yields 1MB consumed per member device.

With SSDs becoming mainstream, and becoming ever faster, at some point
the md-RAID5/6 architecture will have to be redesigned because of the
memory footprint required for performance. Currently the required size
of the stripe cache appears directly proportional to the aggregate write
throughput of the RAID devices. Thus the optimal value will vary
greatly from one system to another depending on the throughput of the
drives.

For example, I assisted a user with 5x Intel SSDs back in January and
his system required 4096, or 80MB of RAM for stripe cache, to reach
maximum write throughput of the devices. This yielded 600MB/s or 60%
greater throughput than 2048, or 40MB RAM for cache. In his case 60MB
more RAM than the default was well worth the increase as the machine was
an iSCSI target server with 8GB RAM.

In the previous case with 5x rust RAID6 the 2048 value seemed optimal
(though not yet verified), requiring 40MB less RAM than the 5x Intel
SSDs. For a 3 modern rust RAID5 the default of 256, or 3MB, is close to
optimal but maybe a little low. Consider that 256 has been the default
for a very long time, and was selected back when average drive
throughput was much much lower, as in 50MB/s or less, SSDs hadn't yet
been invented, and system memories were much smaller.

Due to the massive difference in throughput between rust and SSD, any
meaningful change in the default really requires new code to sniff out
what type of devices constitute the array, if that's possible, and it
probably isn't, and set a lowish default accordingly. Again, SSDs
didn't exist when md-RAID was coded, nor when this default was set, and
this throws a big monkey wrench into these spokes.
--
Stan
Piergiorgio Sartor
2013-12-21 12:20:14 UTC
Permalink
Post by Stan Hoeppner
I renamed the subject as your question doesn't really apply to XFS, or
the OP, but to md-RAID.
Post by Arkadiusz Miśkiewicz
I wonder why kernel is giving defaults that everyone repeatly recommends to
change/increase? Has anyone tried to bugreport that for stripe_cache_size
case?
The answer is balancing default md-RAID5/6 write performance against
((4096*stripe_cache_size)*num_drives)= RAM consumed for stripe cache
High stripe_cache_size values will cause the kernel to eat non trivial
amounts of RAM for the stripe cache buffer. This table demonstrates the
effect today for typical RAID5/6 disk counts.
stripe_cache_size drives RAM consumed
256 4 4 MB
8 8 MB
16 16 MB
512 4 8 MB
8 16 MB
16 32 MB
1024 4 16 MB
8 32 MB
16 64 MB
2048 4 32 MB
8 64 MB
16 128 MB
4096 4 64 MB
8 128 MB
16 256 MB
The powers that be, Linus in particular, are not fond of default
settings that create a lot of kernel memory structures. The default
md-RAID5/6 stripe_cache-size yields 1MB consumed per member device.
With SSDs becoming mainstream, and becoming ever faster, at some point
the md-RAID5/6 architecture will have to be redesigned because of the
memory footprint required for performance. Currently the required size
of the stripe cache appears directly proportional to the aggregate write
throughput of the RAID devices. Thus the optimal value will vary
greatly from one system to another depending on the throughput of the
drives.
For example, I assisted a user with 5x Intel SSDs back in January and
his system required 4096, or 80MB of RAM for stripe cache, to reach
maximum write throughput of the devices. This yielded 600MB/s or 60%
greater throughput than 2048, or 40MB RAM for cache. In his case 60MB
more RAM than the default was well worth the increase as the machine was
an iSCSI target server with 8GB RAM.
In the previous case with 5x rust RAID6 the 2048 value seemed optimal
(though not yet verified), requiring 40MB less RAM than the 5x Intel
SSDs. For a 3 modern rust RAID5 the default of 256, or 3MB, is close to
optimal but maybe a little low. Consider that 256 has been the default
for a very long time, and was selected back when average drive
throughput was much much lower, as in 50MB/s or less, SSDs hadn't yet
been invented, and system memories were much smaller.
Due to the massive difference in throughput between rust and SSD, any
meaningful change in the default really requires new code to sniff out
what type of devices constitute the array, if that's possible, and it
probably isn't, and set a lowish default accordingly. Again, SSDs
didn't exist when md-RAID was coded, nor when this default was set, and
this throws a big monkey wrench into these spokes.
Hi Stan,

nice analytical report, as usual...

My dumb suggestion would be to simply use udev to
setup the drives.
Everything, stripe_cache, read_ahead, stcerr, etc.
can be configured, I suppose, by udev rules.

bye,
--
piergiorgio
Stan Hoeppner
2013-12-22 01:41:37 UTC
Permalink
On 12/21/2013 5:18 AM, Stan Hoeppner wrote:
...
Post by Stan Hoeppner
For example, I assisted a user with 5x Intel SSDs back in January and
his system required 4096, or 80MB of RAM for stripe cache, to reach
maximum write throughput of the devices. This yielded 600MB/s or 60%
greater throughput than 2048, or 40MB RAM for cache. In his case 60MB
more RAM than the default was well worth the increase as the machine was
an iSCSI target server with 8GB RAM.
Correction here. I said above that 80MB was 60MB greater than the
default for his 5 drives. This should have said 75MB greater than the
default which is 1MB per member device, or 5MB for 5 drives.
--
Stan
Christoph Hellwig
2013-12-26 08:55:10 UTC
Permalink
Post by Stan Hoeppner
The powers that be, Linus in particular, are not fond of default
settings that create a lot of kernel memory structures. The default
md-RAID5/6 stripe_cache-size yields 1MB consumed per member device.
The default sizing is stupid as it basically makes RAID unusable out
of the box, I always have to fix that up, as well as a somewhat
reasonable chunk size for parity RAID to make it usable. I'm also
pretty sure I complained about it at least once a while ago, but never
got a reply.
Stan Hoeppner
2013-12-26 09:24:00 UTC
Permalink
Post by Christoph Hellwig
Post by Stan Hoeppner
The powers that be, Linus in particular, are not fond of default
settings that create a lot of kernel memory structures. The default
md-RAID5/6 stripe_cache-size yields 1MB consumed per member device.
The default sizing is stupid as it basically makes RAID unusable out
of the box, I always have to fix that up, as well as a somewhat
reasonable chunk size for parity RAID to make it usable. I'm also
pretty sure I complained about it at least once a while ago, but never
got a reply.
IIRC you Dave C. and myself all voiced criticism after the default chunk
size was changed from 64KB to 512KB. I guess we didn't make a strong
enough case to have it reduced, or maybe didn't use the right approach.

Maybe Neil is waiting for patches to be submitted for changing these
defaults, and to argue the merits in that context instead of pure
discussion? Dunno. Just guessing. Maybe he'll read this and jump in.
--
Stan
NeilBrown
2013-12-26 22:14:07 UTC
Permalink
Post by Stan Hoeppner
Post by Christoph Hellwig
Post by Stan Hoeppner
The powers that be, Linus in particular, are not fond of default
settings that create a lot of kernel memory structures. The default
md-RAID5/6 stripe_cache-size yields 1MB consumed per member device.
The default sizing is stupid as it basically makes RAID unusable out
of the box, I always have to fix that up, as well as a somewhat
reasonable chunk size for parity RAID to make it usable. I'm also
pretty sure I complained about it at least once a while ago, but never
got a reply.
IIRC you Dave C. and myself all voiced criticism after the default chunk
size was changed from 64KB to 512KB. I guess we didn't make a strong
enough case to have it reduced, or maybe didn't use the right approach.
Maybe Neil is waiting for patches to be submitted for changing these
defaults, and to argue the merits in that context instead of pure
discussion? Dunno. Just guessing. Maybe he'll read this and jump in.
Good guess.

NeilBrown

Stefan Ring
2013-12-18 08:33:22 UTC
Permalink
Post by Kevin Richter
Even a reboot is now not possible anymore, because the whole system
hangs while executing the "sync" command in one of the shutdown scripts.
This is unusual. How long have you waited?
Dave Chinner
2013-12-18 22:21:14 UTC
Permalink
Post by Kevin Richter
Hi,
around April 2012 there was a similar thread on this list which I have
found via Google, so my mail topic is the same.
I have a RAID6 array with 5 disks (each 2TB, net: 6TB). While copying
under heavy load there are always these blocks. At the bottom of this
message I have included some line from the syslog.
Even a reboot is now not possible anymore, because the whole system
hangs while executing the "sync" command in one of the shutdown scripts.
Of course it will hang - it's waiting for the filesystem to make
progress, and clearly it isn't....
Post by Kevin Richter
So... first I have thought that my disks are faulty.
But with smartmontools I have started a short and a long test on all of
the 5 disks: no errors
Then I have even recreated the whole array, but no improvement.
Details about my server: 3.2.0-57-generic, Ubuntu 12.04.3 LTS
Oh, that's quite old. I'd suggest upgrading to a much more recent
kernel as we've fixed lots of issues in this area since then.
Post by Kevin Richter
kswapd0 D ffffffff81806240 0 51 2 0x00000000
ffff8804658099f0 0000000000000046 ffff880465809a74 ffff880465809a78
ffff880465809fd8 ffff880465809fd8 ffff880465809fd8 00000000000137c0
ffff880465dd1700 ffff880465818000 0000000000000000 ffff88046175aa28
[<ffffffff8165f62f>] schedule+0x3f/0x60
[<ffffffff81660437>] __mutex_lock_slowpath+0xd7/0x150
[<ffffffff8166004a>] mutex_lock+0x2a/0x50
[<ffffffffa029dcf8>] xfs_reclaim_inodes_ag+0x2d8/0x3a0 [xfs]
So, reclaim is stalled, meaning there there's other threads blocked
reclaiming inodes....
Post by Kevin Richter
INFO: task kswapd1:52 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kswapd1 D 0000000000000007 0 52 2 0x00000000
ffff88046580b840 0000000000000046 0000000000000001 00000000ffffffff
ffff88046580bfd8 ffff88046580bfd8 ffff88046580bfd8 00000000000137c0
ffff880865e8ae00 ffff880465819700 ffff88046580b830 0000000000000004
[<ffffffff8165f62f>] schedule+0x3f/0x60
[<ffffffffa02e4d17>] xlog_cil_force_lsn+0xf7/0x120 [xfs]
[<ffffffffa02e3367>] _xfs_log_force_lsn+0x57/0x2e0 [xfs]
[<ffffffffa02e3608>] xfs_log_force_lsn+0x18/0x40 [xfs]
[<ffffffffa02cfdd4>] xfs_iunpin_nowait+0x44/0xb0 [xfs]
[<ffffffffa02d1c65>] xfs_iunpin_wait+0x35/0xb0 [xfs]
[<ffffffffa029cf72>] xfs_reclaim_inode+0xb2/0x2a0 [xfs]
Yup, there's the stalled reclaim, waiting on a log force in progress
to complete.
Post by Kevin Richter
Dec 18 02:19:19 kernel: [390470.199017] INFO: task xfsaild/dm-0:19192 blocked for more than 120 seconds.
Dec 18 02:19:19 kernel: [390470.199828] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 18 02:19:19 kernel: [390470.200262] xfsaild/dm-0 D 0000000000000001 0 19192 2 0x00000000
Dec 18 02:19:19 kernel: [390470.200268] ffff88084e847a60 0000000000000046 ffff88046454d500 0000000000000000
Dec 18 02:19:19 kernel: [390470.200274] ffff88084e847fd8 ffff88084e847fd8 ffff88084e847fd8 00000000000137c0
Dec 18 02:19:19 kernel: [390470.200281] ffff88045fdd0000 ffff880461b1dc00 ffff88084e847a50 ffff88085f47d080
Dec 18 02:19:19 kernel: [390470.200293] [<ffffffff8165f62f>] schedule+0x3f/0x60
Dec 18 02:19:19 kernel: [390470.200323] [<ffffffffa02e25da>] xlog_state_get_iclog_space+0xea/0x2e0 [xfs]
Dec 18 02:19:19 kernel: [390470.200330] [<ffffffff810608e0>] ? try_to_wake_up+0x200/0x200
Dec 18 02:19:19 kernel: [390470.200358] [<ffffffffa02e2b99>] xlog_write+0xe9/0x400 [xfs]
Dec 18 02:19:19 kernel: [390470.200382] [<ffffffffa02a2877>] ? kmem_zone_alloc+0x67/0xe0 [xfs]
Dec 18 02:19:19 kernel: [390470.200410] [<ffffffffa02e440c>] xlog_cil_push+0x20c/0x380 [xfs]
Dec 18 02:19:19 kernel: [390470.200438] [<ffffffffa02e4d2e>] xlog_cil_force_lsn+0x10e/0x120 [xfs]
Dec 18 02:19:19 kernel: [390470.200446] [<ffffffff810136e5>] ? __switch_to+0xf5/0x360
Dec 18 02:19:19 kernel: [390470.200473] [<ffffffffa02e3098>] _xfs_log_force+0x68/0x2a0 [xfs]
Dec 18 02:19:19 kernel: [390470.200481] [<ffffffff81078cb2>] ? try_to_del_timer_sync+0x92/0x130
Dec 18 02:19:19 kernel: [390470.200507] [<ffffffffa02e32e8>] xfs_log_force+0x18/0x40 [xfs]
Dec 18 02:19:19 kernel: [390470.200535] [<ffffffffa02e7743>] xfsaild_push+0x213/0x680 [xfs]
Dec 18 02:19:19 kernel: [390470.200540] [<ffffffff8165fb45>] ? schedule_timeout+0x175/0x320
Dec 18 02:19:19 kernel: [390470.200568] [<ffffffffa02e7c12>] xfsaild+0x62/0xc0 [xfs]
And there's the log force that is in progress, and that is waiting
on log buffer space to become available to finish the log force.
That's waiting on IO completion to occur.

IOWs, There's nothing that points directly at an XFS issue here.
There might be a problem, but it's just as likely a hardware or IO
path bug that is preventing an IO completion from occurring and that
is what XFS is getting hung up on....

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Continue reading on narkive:
Loading...