Post by Bernd Schubert[ ... ] supposed to hold the object storage layer of a BeeFS
highly parallel filesystem, and therefore will likely have
mostly-random accesses.
Where do you get the assumption from that FhGFS/BeeGFS is
going to do random reads/writes or the application of top of
it is going to do that?
In this specific case it is not an assumption, thanks to the
prominent fact that the original poster was testing (locally I
guess) and complaining about concurrent read/writes, which
result in random like arm movement even if each of the read and
write streams are entirely sequential. I even pointed this out,
Post by Bernd Schubertwhen doing only reading / only writing , the speed is very
fast(~1.5G), but when do both the speed is very slow
(100M), and high r_await(160) and w_await(200000).
BTW the 100MB/s aggregate over 31 drives means around 3MB/s
per drive, which seems pretty good for a RW workload with
mostly-random accesses with high RMW correlation.
Also if this testing was appropriate then it was because the
intended workload was indeed concurrent reads and writes to the
object store.
It is not a mere assumption in the general case either; it
is both commonly observed and a simple deduction, because of
the nature of distributed filesystems and in particular parallel
HPC ones like Lustre or BeeGFS, but also AFS and even NFS ones.
* Clients have caches. Therefore most of the locality in the
(read) access patterns will hopefully be filtered out by the
client cache. This applies (ideally) to any distributed
filesystem.
* HPC/parallel servers tend to whave many clients (e.g. for an
it could be 10,000 clients and 500 object storage servers) and
hopefully each client works on a different subset of the data
tree, and distribution of data objects onto servers hopefully
random.
Therefore it is likely that many clients will access with
concurrent read and write many different files on the same
server resulting in many random "hotspots" in each server's
load.
Note that each client could be doing entirely sequential IO to
each file they access, but the concurrent accesses do possibly
widely scattered files will turn that into random IO at the
server level.
Just about the only case where sequential client workloads don't
become random workloads at the server is when the client
workload is such that only one file is "hot" per server.
There is an additional issue favouring random access patterns:
* Typically large fileservers are setup with a lot of storage
because of anticipated lifetime usage, so they start mostly
empty.
* Most filesystems then allocate new data in regular patterns,
often starting from the beginning of available storage, in
an attempt to minimize arm travel time usually (XFS uses
various heuristics, which are somewhat different whether the
option 'inode64' is specified or not).
* Unfortunately as the filetree becomes larger new allocations
have to be made farther away, resulting in longer travel
times and more apparent randomness at the storage server
level.
* Eventually if the object server reaches a steady state where
roughly as much data is deleted and created the free storage
areas will become widely scattered, leading to essentially
random allocation, the more random the more capacity used.
Leaving a significant percentage of capacity free, like at
least 10% and more like 20%, greatly increases the chance of
finding free space near to put new data near to existing
"related" data. This increases locality, but only at the
single-stream level; therefore is usually does not help that
much widely shared distributed servers; and in particular does
not apply that much to object stores, because usually they
obscure which data object is related to which data object.
The above issues are pretty much "network and distributed
filesystems for beginners" notes, but in significant part also
apply to widely shared non network and non distributed servers
on which XFS is often used, so they may be usefully mentioned
in this list.