Problem about very high Average Read/Write Request Time

Discussion:

quanjun hu

2014-10-18 09:26:40 UTC

Hi,
I am using xfs on a raid 5 (~100TB) and put log on external ssd device,
the mount information is:
/dev/sdc on /data/fhgfs/fhgfs_storage type xfs
(rw,relatime,attr2,delaylog,logdev=/dev/sdb1,sunit=512,swidth=15872,noquota).
when doing only reading / only writing , the speed is very fast(~1.5G),
but when do both the speed is very slow (100M), and high r_await(160) and
w_await(200000).
1. how can I reduce average request time?
2. can I use ssd as write/read cache for xfs?

Best regards,
Quanjun

Emmanuel Florac

2014-10-18 12:38:48 UTC

Permalink

Post by quanjun hu
Hi,
I am using xfs on a raid 5 (~100TB) and put log on external ssd
/dev/sdc on /data/fhgfs/fhgfs_storage type xfs
(rw,relatime,attr2,delaylog,logdev=/dev/sdb1,sunit=512,swidth=15872,noquota).
when doing only reading / only writing , the speed is very
fast(~1.5G), but when do both the speed is very slow (100M), and high
r_await(160) and w_await(200000).

What are your kernel version, mount options and xfs_info output ?

Post by quanjun hu
1. how can I reduce average request time?
2. can I use ssd as write/read cache for xfs?

Sure, using bcache and other similar tools.

--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <***@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------

Peter Grandi

2014-10-19 10:10:20 UTC

Permalink

Post by Emmanuel Florac

Post by quanjun hu
I am using xfs on a raid 5 (~100TB) and put log on external
ssd device, the mount information is: /dev/sdc on
/data/fhgfs/fhgfs_storage type xfs
(rw,relatime,attr2,delaylog,logdev=/dev/sdb1,sunit=512,swidth=15872,noquota).
when doing only reading / only writing , the speed is very
fast(~1.5G), but when do both the speed is very slow (100M),
and high r_await(160) and w_await(200000).

What are your kernel version, mount options and xfs_info output ?

Those are usually important details, but in this case the
information that matters is already present.

There is a ratio of 31 (thirty one) between 'swidth' and 'sunit'
and assuming that this reflects the geometry of the RAID5 set
and given commonly available disk sizes it can be guessed that
with amazing "bravery" someone has configured a RAID5 out of 32
(thirty two) high capacity/low IOPS 3TB drives, or something
similar.

It is even "braver" than that: if the device name
"/data/fhgfs/fhgfs_storage" is dedscriptive, this "brave"
RAID5 set is supposed to hold the object storage layer of a
BeeFS highly parallel filesystem, and therefore will likely
have mostly-random accesses.

This issue should be moved to the 'linux-raid' mailing list as
from the reported information it has nothing to do with XFS.

BTW the 100MB/s aggregate over 31 drives means around 3MB/s per
drive, which seems pretty good for a RW workload with
mostly-random accesses with high RMW correlation.

It is notable but not surprising that XFS works well even with
such a "brave" choice of block storage layer, untainted by any
"cowardly" consideration of the effects of RMW and using drives
designed for capacity rather than IOPS.

Bernd Schubert

2014-10-20 08:00:34 UTC

Permalink

Post by Peter Grandi

Post by Emmanuel Florac

What are your kernel version, mount options and xfs_info output ?

Where do you get the assumption from that FhGFS/BeeGFS is going to do
random reads/writes or the application of top of it is going to do that?

Bernd

Peter Grandi

2014-10-21 18:27:26 UTC

Permalink

Post by Bernd Schubert

[ ... ] supposed to hold the object storage layer of a BeeFS
highly parallel filesystem, and therefore will likely have
mostly-random accesses.

Where do you get the assumption from that FhGFS/BeeGFS is
going to do random reads/writes or the application of top of
it is going to do that?

In this specific case it is not an assumption, thanks to the
prominent fact that the original poster was testing (locally I
guess) and complaining about concurrent read/writes, which
result in random like arm movement even if each of the read and
write streams are entirely sequential. I even pointed this out,

Post by Bernd Schubert

when doing only reading / only writing , the speed is very
fast(~1.5G), but when do both the speed is very slow
(100M), and high r_await(160) and w_await(200000).

BTW the 100MB/s aggregate over 31 drives means around 3MB/s
per drive, which seems pretty good for a RW workload with
mostly-random accesses with high RMW correlation.

Also if this testing was appropriate then it was because the
intended workload was indeed concurrent reads and writes to the
object store.

It is not a mere assumption in the general case either; it
is both commonly observed and a simple deduction, because of
the nature of distributed filesystems and in particular parallel
HPC ones like Lustre or BeeGFS, but also AFS and even NFS ones.

* Clients have caches. Therefore most of the locality in the
(read) access patterns will hopefully be filtered out by the
client cache. This applies (ideally) to any distributed
filesystem.
* HPC/parallel servers tend to whave many clients (e.g. for an
it could be 10,000 clients and 500 object storage servers) and
hopefully each client works on a different subset of the data
tree, and distribution of data objects onto servers hopefully
random.
Therefore it is likely that many clients will access with
concurrent read and write many different files on the same
server resulting in many random "hotspots" in each server's
load.
Note that each client could be doing entirely sequential IO to
each file they access, but the concurrent accesses do possibly
widely scattered files will turn that into random IO at the
server level.

Just about the only case where sequential client workloads don't
become random workloads at the server is when the client
workload is such that only one file is "hot" per server.

There is an additional issue favouring random access patterns:

* Typically large fileservers are setup with a lot of storage
because of anticipated lifetime usage, so they start mostly
empty.
* Most filesystems then allocate new data in regular patterns,
often starting from the beginning of available storage, in
an attempt to minimize arm travel time usually (XFS uses
various heuristics, which are somewhat different whether the
option 'inode64' is specified or not).
* Unfortunately as the filetree becomes larger new allocations
have to be made farther away, resulting in longer travel
times and more apparent randomness at the storage server
level.
* Eventually if the object server reaches a steady state where
roughly as much data is deleted and created the free storage
areas will become widely scattered, leading to essentially
random allocation, the more random the more capacity used.

Leaving a significant percentage of capacity free, like at
least 10% and more like 20%, greatly increases the chance of
finding free space near to put new data near to existing
"related" data. This increases locality, but only at the
single-stream level; therefore is usually does not help that
much widely shared distributed servers; and in particular does
not apply that much to object stores, because usually they
obscure which data object is related to which data object.

The above issues are pretty much "network and distributed
filesystems for beginners" notes, but in significant part also
apply to widely shared non network and non distributed servers
on which XFS is often used, so they may be usefully mentioned
in this list.

Stan Hoeppner

2014-10-19 21:16:56 UTC

Permalink

Hi,
/dev/sdc on /data/fhgfs/fhgfs_storage type xfs (rw,relatime,attr2,delaylog,logdev=/dev/sdb1,sunit=512,swidth=15872,noquota).
when doing only reading / only writing , the speed is very fast(~1.5G), but when do both the speed is very slow (100M), and high r_await(160) and w_await(200000).
1. how can I reduce average request time?
2. can I use ssd as write/read cache for xfs?

You apparently have 31 effective SATA 7.2k RPM spindles with 256 KiB chunk, 7.75 MiB stripe width, in RAID5. That should yield 3-4.6 GiB/s of streaming throughput assuming no cable, expander, nor HBA limitations. You're achieving only 1/3rd to 1/2 of this. Which hardware RAID controller is this? What are the specs? Cache RAM, host and back end cable count and type?

When you say read or write is fast individually, but read+write is slow, what types of files are you reading and writing, and how many in parallel? This combined pattern is likely the cause of the slowdown due to excessive seeking in the drives.

As others mentioned this isn't an XFS problem. The problem is that your RAID geometry doesn't match your workload. Your very wide parity stripe is apparently causing excessive seeking with your read+write workload due to read-modify-write operations. To mitigate this, and to increase resiliency, you should switch to RAID6 with a smaller chunk. If you need maximum capacity make a single RAID6 array with 16 KiB chunk size. This will yield a 496 KiB stripe width, increasing the odds that all writes are a full stripe, and hopefully eliminating much of the RMW problem.

A better option might be making three 10 drive RAID6 arrays (two spares) with 32 KiB chunk, 256 KiB stripe width, and concatenating the 3 arrays with mdadm --linear. You'd have 24 spindles of capacity and throughput instead of 31, but no more RMW operations, or at least very few. You'd format the linear md device with

# mkfs.xfs -d su=32k,sw=8 /dev/mdX

As long as your file accesses are spread fairly evenly across at least 3 directories you should achieve excellent parallel throughput, though single file streaming throughput will peak at 800-1200 MiB/s, that of 8 drives. With a little understanding of how this setup works, you can write two streaming files and read a third without any of the 3 competing with one another for disk seeks/bandwidth--which is your current problem. Or you could do one read and one write to each of 3 directories, and no pair of two would interfere with the other pairs. Scale up from here.

Basically what we're doing is isolating each RAID LUN into a set of directories. When you write to one of those directories the file goes into only one of the 3 RAID arrays. Doing this isolates RMWs for a given write to only a subset of your disks, and minimizes the amount of seeks generated by parallel accesses.

Cheers,
Stan