Discussion:
allocsize mount option
Gim Leong Chin
2010-01-11 17:25:14 UTC
Permalink
Hi,

Mount options for xfs
allocsize=size
Sets the buffered I/O end-of-file preallocation size when doing delayed allocation writeout (default size is 64KiB).


I read that setting allocsize to a big value can be used to combat filesystem fragmentation when writing big files.

I do not understand how allocsize works. Say I set allocsize=1g, but my file size is only 1 MB or even smaller. Will the rest of the 1 GB file extent be allocated, resulting in wasted space and even file fragmentation problem?

Does setting allocsize to a big value result in performance gain when writing big files? Is performance hurt by a big value setting when writing files smaller than the allocsize value?

I am setting up a system for HPC, where two different applications have different file size characteristics, one writes files of GBs and even 128 GB, the other is in MBs to tens of MBs.

I am not able to find documentation on the behaviour of allocsize mount option.

Thank you.


Chin Gim Leong


New Email names for you!
Get the Email name you've always wanted on the new @ymail and @rocketmail.
Hurry before someone else does!
http://mail.promotions.yahoo.com/newdomains/sg/
Eric Sandeen
2010-01-11 18:16:08 UTC
Permalink
Hi,
Mount options for xfs allocsize=size Sets the buffered I/O
end-of-file preallocation size when doing delayed allocation writeout
(default size is 64KiB).
I read that setting allocsize to a big value can be used to combat
filesystem fragmentation when writing big files.
That's not universally necessary though, depending on how you are
writing them. I've only used it in the very specific case of mythtv
calling "sync" every couple seconds, and defeating delalloc.
I do not understand how allocsize works. Say I set allocsize=1g, but
my file size is only 1 MB or even smaller. Will the rest of the 1 GB
file extent be allocated, resulting in wasted space and even file
fragmentation problem?
possibly :) It's only speculatively allocated, though, so you won't
have 1g for every file; when it's closed the preallocation goes
away, IIRC.
Does setting allocsize to a big value result in performance gain when
writing big files? Is performance hurt by a big value setting when
writing files smaller than the allocsize value?
I am setting up a system for HPC, where two different applications
have different file size characteristics, one writes files of GBs and
even 128 GB, the other is in MBs to tens of MBs.
We should probably back up and say: are you seeing fragmentation
problems -without- the mount option, and if so, what is your write pattern?

-Eric
I am not able to find documentation on the behaviour of allocsize mount option.
Thank you.
Chin Gim Leong
New Email names for you! Get the Email name you've always wanted
http://mail.promotions.yahoo.com/newdomains/sg/
_______________________________________________ xfs mailing list
Gim Leong Chin
2010-01-13 09:42:16 UTC
Permalink
Hi,


The application is ANSYS, which writes 128 GB files.  The existing computer with SUSE Linux Enterprise Desktop 11 which is used for running ANSYS, has two software RAID 0 devices made up of five 1 TB drives.  The /home partition is 4.5 T, and it is now 4 TB full.  I see a fragmentation > 19%.


I have just set up a new computer with 16 WD Cavair Black 1 TB drives connected to an Areca 1680ix-16 RAID with 4 GB cache.  14 of these drives are in RAID 6 with 128 kB stripes. The OS is also SLED 11. The system has 16 GB memory, and AMD Phenom II X4 965 CPU.

I have done tests writing 100 30 MB files and 1 GB, 10 GB and 20 GB files, with single instance and multiple instances.

There is a big difference in writing speed when writing 20 GB files when using allocsize=1g and not using the option. That is without the inode64 option, which gives further speed gains.

I use dd for writing the 1 GB, 10 GB and 20 GB files.

mkfs.xfs -f -b size=4k -d agcount=32,su=128k,sw=12 -i size=256,align=1,attr=2 -l version=2,su=128k,lazy-count=1 -n version=2 -s size=512 -L /data /dev/sdb1


defaults,nobarrier,usrquota,grpquota,noatime,nodiratime,allocsize=1g,logbufs=8,logbsize=256k,largeio,swalloc

The start of the partition has been set to LBA 3072 using GPT Fdisk to align the stripes.

The dd command is:

***@tsunami:/data/test/t2> dd if=/dev/zero of=bigfile20GB bs=1073741824 count=20

Single instance of 20 GB dd repeats were 214, 221, 123 MB/s with allocsize=1g, compared to 94, 126 MB/s without.

Two instances of 20 GB dd repeats were aggregate 331, 372 MB/s with allocsize=1g, compared to 336, 296 MB/s without.

Three instances of 20 GB dd was aggregate 400 MB/s with, 326 MB/s without.

Six instances of 20 GB dd was 606 MB/s with, 473 MB/s without.


My production configuration is

defaults,nobarrier,usrquota,grpquota,noatime,nodiratime,allocsize=1g,logbufs=8,logbsize=256k,largeio,swalloc,inode64

for which I got up to 297 MB/s for single instance 20 GB dd.



Chin Gim Leong
Subject: Re: allocsize mount option
Date: Tuesday, 12 January, 2010, 2:16 AM
Hi,
Mount options for xfs allocsize=size Sets  the
buffered I/O
end-of-file preallocation size when doing delayed
allocation writeout
(default size is 64KiB).
I read that setting allocsize to a big value can be
used to combat
filesystem fragmentation when writing big files.
That's not universally necessary though, depending on how
you are
writing them.  I've only used it in the very specific
case of mythtv
calling "sync" every couple seconds, and defeating
delalloc.
I do not understand how allocsize works.  Say I
set allocsize=1g, but
my file size is only 1 MB or even smaller.  Will
the rest of the 1 GB
file extent be allocated, resulting in wasted space
and even file
fragmentation problem?
possibly :)  It's only speculatively allocated,
though, so you won't
have 1g for every file; when it's closed the preallocation
goes
away, IIRC.
Does setting allocsize to a big value result in
performance gain when
writing big files?  Is performance hurt by a big
value setting when
writing files smaller than the allocsize value?
I am setting up a system for HPC, where two different
applications
have different file size characteristics, one writes
files of GBs and
even 128 GB, the other is in MBs to tens of MBs.
We should probably back up and say:  are you seeing
fragmentation
problems -without- the mount option, and if so, what is
your write pattern?
-Eric
I am not able to find documentation on the behaviour
of allocsize
mount option.
Thank you.
Chin Gim Leong
New Email names for you! Get the Email name you've
always wanted
someone else does!
http://mail.promotions.yahoo.com/newdomains/sg/
_______________________________________________ xfs
mailing list
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
New Email addresses available on Yahoo!
Get the Email name you've always wanted on the new @ymail and @rocketmail.
Hurry before someone else does!
http://mail.promotions.yahoo.com/newdomains/sg/
Dave Chinner
2010-01-13 10:50:18 UTC
Permalink
Post by Gim Leong Chin
Hi,
The application is ANSYS, which writes 128 GB files.  The existing
computer with SUSE Linux Enterprise Desktop 11 which is used for
running ANSYS, has two software RAID 0 devices made up of five 1
TB drives.  The /home partition is 4.5 T, and it is now 4 TB
full.  I see a fragmentation > 19%.
XFS will start to fragment when the filesystem gets beyond 85%
full - it seems that you are very close to that threshold.

That being said, if you've pulled the figure of 19% from the xfs_db
measure of fragmentation, that doesn't mean the filesystem is badly
fragmented, it just means that that there are 19% more fragments
than the ideal. In 4TB of data with 1GB sized files, that would mean
there are 4800 extents (average length ~800MB, which is excellent)
instead of the perfect 4000 extents (@1GB each). Hence you can see
how misleading this "19% fragmentation" number can be on an extent
based filesystem...
Post by Gim Leong Chin
I have just set up a new computer with 16 WD Cavair Black 1 TB
drives connected to an Areca 1680ix-16 RAID with 4 GB cache.  14
of these drives are in RAID 6 with 128 kB stripes. The OS is also
SLED 11. The system has 16 GB memory, and AMD Phenom II X4 965
CPU.
I have done tests writing 100 30 MB files and 1 GB, 10 GB and 20
GB files, with single instance and multiple instances.
There is a big difference in writing speed when writing 20 GB
files when using allocsize=1g and not using the option. That is
without the inode64 option, which gives further speed gains.
I use dd for writing the 1 GB, 10 GB and 20 GB files.
mkfs.xfs -f -b size=4k -d agcount=32,su=128k,sw=12 -i size=256,align=1,attr=2 -l version=2,su=128k,lazy-count=1 -n version=2 -s size=512 -L /data /dev/sdb1
defaults,nobarrier,usrquota,grpquota,noatime,nodiratime,allocsize=1g,logbufs=8,logbsize=256k,largeio,swalloc
The start of the partition has been set to LBA 3072 using GPT Fdisk to align the stripes.
This all looks good - it certainly seems that you have done your
research. ;) The only thing I'd do differently is that if you have
only one partition on the drives, I wouldn't even put a partition on it.
I'd significantly reduce the size of that buffer - too large a
buffer can slow down IO due to the memory it consumes and TLB misses
it causes. I'd typically use something like:

$ dd if=/dev/zero of=bigfile bs=1024k count=20k

Which does 20,000 writes of 1MB each and ensures the dd process
doesn't consume over a GB of RAM.
Post by Gim Leong Chin
Single instance of 20 GB dd repeats were 214, 221, 123 MB/s with
allocsize=1g, compared to 94, 126 MB/s without.
This seems rather low for a buffered write on hardware that can
clearly go faster. SLED11 is based on 2.6.27, right? I suspect that
many of the buffered writeback issues that have been fixed since
2.6.30 are present in the SLED11 kernel, and if that is the case I
can see why the allocsize mount option makes such a big
difference.

It might be worth checking how well direct IO writes run to take
buffered writeback issues out ofthe equation. In that case, I'd use
stripe width multiple sized buffers like:

$ dd if=/dev/zero of=bigfile bs=3072k count=7k oflag=direct

I'd suggest that you might need to look at increasing the maximum IO
size for the block device (/sys/block/sdb/queue/max_sectors_kb),
maybe the request queue depth as well to get larger IOs to be pushed
to the raid controller. if you can, at least get it to the stripe
width of 1536k....

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Alex Elder
2010-01-13 22:59:04 UTC
Permalink
We don't need to see every compiler command line for every file that
is compiled. This makes it hard to see warnings and errors during
compile. For progress notification, we really only need to see the
directory/file being operated on.
Turn down the verbosity of output by suppressing various make output
and provide better overall visibility of which directory is being
operated on, what the operation is and what is being done to the
files by the build/clean process.
While doing this, remove explicit target-per-file rules in the
subdirectories being built and replace them with target based rules
using the buildrules hooks for doing this. This results in the
makefiles being simpler, smaller and more consistent.
This patch does not address the dmapi subdirectory of the xfstests
build system.
The old style verbose builds can still be run via "make V=1 ..."
Didn't get this in my mail box but it looks good to me.

Reviewed-by: Alex Elder <***@sgi.com>
Gim Leong Chin
2010-01-14 17:25:15 UTC
Permalink
Hi Dave,
Post by Dave Chinner
fragmented, it just means that that there are 19% more
fragments
than the ideal. In 4TB of data with 1GB sized files, that
would mean
there are 4800 extents (average length ~800MB, which is
excellent)
can see
how misleading this "19% fragmentation" number can be on an
extent
based filesystem...
There are many files that are 128 GB.

When I did the tests with dd on this computer, the 20 GB files had up to > 50 extents.
Post by Dave Chinner
This all looks good - it certainly seems that you have done
your
research. ;) The only thing I'd do differently is that if
you have
only one partition on the drives, I wouldn't even put a
partition on it.
I just learnt from you that I can have a filesystem without a partition table!  That takes care of having to calculate the start of the partition!  Are there any other benefits?  But are there any down sides to not having a partition table?
Post by Dave Chinner
I'd significantly reduce the size of that buffer - too
large a
buffer can slow down IO due to the memory it consumes and
TLB misses
$ dd if=/dev/zero of=bigfile bs=1024k count=20k
Which does 20,000 writes of 1MB each and ensures the dd
process
doesn't consume over a GB of RAM.
I did try with 1 MB.  I have attached the raw test result file.  As you can see from line 261, in writing 10 GB with bs=1MB, the speed was no faster two out of three times, so I dropped it.  I could re-try that next time.
Post by Dave Chinner
This seems rather low for a buffered write on hardware that
can
clearly go faster. SLED11 is based on 2.6.27, right? I
suspect that
many of the buffered writeback issues that have been fixed
since
2.6.30 are present in the SLED11 kernel, and if that is the
case I
can see why the allocsize mount option makes such a big
difference.
Is it possible for the fixes in the 2.6.30 kernel to be backported to the 2.6.27 kernel in SLE 11?
If so, I would like to open a service request to Novell to do that to fix the performance issues in SLE 11.
Post by Dave Chinner
It might be worth checking how well direct IO writes run to
take
buffered writeback issues out ofthe equation. In that case,
I'd use
$ dd if=/dev/zero of=bigfile bs=3072k count=7k
oflag=direct
I would like to do that tomorrow when I go back to work, but on my openSUSE 11.1 AMD Turion RM-74 notebook with 2.6.27.39-0.2-default kernel, on the system WD Scorpio Black 7200 RPM drive, I get 62 MB/s with dd bs=1GB for writing 20 GB file with Direct IO, and 56 MB/s without Direct IO.  You are on to something!

As for the hardware performance potential, see below.
Post by Dave Chinner
I'd suggest that you might need to look at increasing the
maximum IO
size for the block device
(/sys/block/sdb/queue/max_sectors_kb),
maybe the request queue depth as well to get larger IOs to
be pushed
to the raid controller. if you can, at least get it to the
stripe
width of 1536k....
Could you give a good reference for performance tuning of these parameters?  I am at a total loss here.


As seen from the results file, I have tried different configurations of RAID 0, 5 and 6, with different number of drives.  I am pretty confused by the results I see, although only the 20 GB file writes were done with allocsize=1g. I also did not lock the CPU frequency governor at the top clock except for the RAID 6 tests.

I decided on the allocsize=1g after checking that the multiple instance 30 MB writes have only one extent for each file, without holes or unused space.

It appears that RAID 6 writes are faster than RAID 5!  And RAID 6 can even match RAID 0!  The system seems to thrive on throughput, when doing multiple instances of writes, for getting high aggregate bandwidth.

I will put the performance potential of the system in context by giving some details.

The system has four Kingston DDR2-800 MHz CL6 4 GB unbuffered ECC DIMMs, set to unganged mode, so each thread has up to 6.4 GB of memory bandwidth, from one of two independent memory channels.

The AMD Phenom II X4 965 has three levels of cache, and data from memory goes directly to the L1 caches. The four cores have dedicated L1 and L2 caches, and a shared 6 MB L3. Thread switching will result in cache misses if more than four threads are running.

The IO through the HyperTransport 3.0 from CPU to the AMD 790FX chipset is at 8 GB/s.  The Areca ARC-1680ix-16 is PCI-E Gen 1 x8, so the maximum bandwidth is 2 GB/s.  The cache is Kingston DDR-667 CL5 4 GB unbuffered ECC, although it runs at 533 MHz, so the maximum bandwidth is 4.2 GB/s.  The Intel IOP 348 1200 MHz on the card has two cores.

There are sixteen WD Caviar Black 1 TB drives in the Lian-Li PC-V2110 chassis.  For the folks reading this, please do not follow this set-up, as the Caviar Blacks are a mistake.  WD quietly disabled the use of WD time limited error recovery utility since the September 2009 manufactured  Caviar Black drives, so I have an array of drives that can pop out of the RAID any time if I am unlucky, and I got screwed here.

There is a battery back-up module for the cache, and the drive caches are disabled.  Tests run with the drive caches enabled showed quite some bit of speed up in RAID 0.

We previously did tests of the Caviar Black 1 TB writing 100 MB chuncks to the device without a file system, with the drive connected to the SATA ports on a Tyan Opteron motherboard with nVidia nForce 4 Professional chipset.  With the drive cache disabled, the sequential write speed was 30+ MB/s if I remember correctly, versus sub 100 MB/s with cache enabled.  That is a big fall-off in speed, and that was writing at the outer diameter of the platter; speed would be halved at the inner diameter.  It seems the controller firmware is meant to work with cache enabled for proper functioning.

The desktop Caviar Black also does not have rotatry vibration compensation, unlike the Caviar RE nearline drives.  WD has a document showing the performance difference having rotary vibration compensation makes.  I am not trying to save pennies here, but the local distributor refuses to bring in the Caviar REs, and I am stuck in one man's land.

The system has sixteen hard drives, and ten fans of difference sizes and purposes in total, so that is quite some bit of rotary vibration, which I can feel when I place my hand on the side panels.  I really do not know how badly the drive performance suffers as a result. The drives are attached with rubber dampers on the mounting screws.

I did the 20 GB dd test on the RAID 1 system drive, also with XFS, and got 53 MB/s with disabled drive caches, 63 MB/s enabled.  That is pretty disappointing, but in light of all the above considerations, plus the kernel buffer issues, I do not really know if that is a good figure.

NCQ is enabled at depth 32.  NCQ should cause performance loss for single writes, but gains for multiple writes.

Areca has a document showing that this card can do RAID 6 800 MB/s with Seagate nearline drives, with the standard 512 MB cache. That is in Windows Server. I do not know if the caches are disabled. The benchmark is IO Meter workstation sequential write. IO Meter requries WIndows for the front end, which causes me great difficulties, so I gave up trying to figure it out and I do not understand what the workstation test does. However, in writing 30 MB files, I already exceed 1 GB/s.
Eric Sandeen
2010-01-14 17:42:11 UTC
Permalink
Post by Gim Leong Chin
Hi Dave,
Post by Dave Chinner
fragmented, it just means that that there are 19% more fragments
than the ideal. In 4TB of data with 1GB sized files, that would
mean there are 4800 extents (average length ~800MB, which is
you can see how misleading this "19% fragmentation" number can be
on an extent based filesystem...
There are many files that are 128 GB.
When I did the tests with dd on this computer, the 20 GB files had up to > 50 extents.
which is at least 400mb per extent, which is really not so bad.

-Eric
Dave Chinner
2010-01-14 23:28:09 UTC
Permalink
Post by Gim Leong Chin
Hi Dave,
Post by Dave Chinner
fragmented, it just means that that there are 19% more
fragments
than the ideal. In 4TB of data with 1GB sized files, that
would mean
there are 4800 extents (average length ~800MB, which is
excellent)
can see
how misleading this "19% fragmentation" number can be on an
extent
based filesystem...
There are many files that are 128 GB.
When I did the tests with dd on this computer, the 20 GB files had up to > 50 extents.
That's still an average of 400MB extents, which is more than large
enough to guarantee optimal disk bandwidth when reading or writing them
on your setup....
Post by Gim Leong Chin
Post by Dave Chinner
This all looks good - it certainly seems that you have done
your
research. ;) The only thing I'd do differently is that if
you have
only one partition on the drives, I wouldn't even put a
partition on it.
I just learnt from you that I can have a filesystem without a
partition table!  That takes care of having to calculate the start
of the partition!  Are there any other benefits?  But are there
any down sides to not having a partition table?
That's the main benefit, though there are others like no limit on
the partition size (e.g. msdos partitions are a max of 2TB) but
you avoided most of those problems by using GPT labels.

There aren't any real downsides that I am aware of, except
maybe that future flexibilty of the volume is reduced. e.g.
if you grow the volume, then you can still only have one filesystem
on it....
Post by Gim Leong Chin
Post by Dave Chinner
This seems rather low for a buffered write on hardware that
can
clearly go faster. SLED11 is based on 2.6.27, right? I
suspect that
many of the buffered writeback issues that have been fixed
since
2.6.30 are present in the SLED11 kernel, and if that is the
case I
can see why the allocsize mount option makes such a big
difference.
Is it possible for the fixes in the 2.6.30 kernel to be backported to the 2.6.27 kernel in SLE 11?
If so, I would like to open a service request to Novell to do that to fix the performance issues in SLE 11.
Youἀd have to get all the fixes from 2.6.30 to 2.6.32, and the
backport would be very difficult to get right. Better would
be طust to upgrade the kernel to 2.6.32 ;)
Post by Gim Leong Chin
Post by Dave Chinner
I'd suggest that you might need to look at increasing the
maximum IO
size for the block device
(/sys/block/sdb/queue/max_sectors_kb),
maybe the request queue depth as well to get larger IOs to
be pushed
to the raid controller. if you can, at least get it to the
stripe
width of 1536k....
Could you give a good reference for performance tuning of these
parameters?  I am at a total loss here.
Welcome to the black art of storage subsystem tuning ;)

I'm not sure there is a good reference for tuning the block device
parameters - most of what I know was handed down by word of mouth
from gurus on high mountains.

The overriding principle, though, is to try to ensure that the
stripe width sized IOs can be issued right through the IO stack to
the hardware, and that those IOs are correctly aligned to the
stripes. You've got the filesystem configuration and layout part
correct, now it's just tuning the block layer to pass the IO's
through.

I'd be looking in the Documentation/block directory
of the kernel source and googling for other documentation....
Post by Gim Leong Chin
As seen from the results file, I have tried different
configurations of RAID 0, 5 and 6, with different number of
drives.  I am pretty confused by the results I see, although only
the 20 GB file writes were done with allocsize=1g. I also did not
lock the CPU frequency governor at the top clock except for the
RAID 6 tests.
FWIW, your tests are not timing how longit takes for all the
data to hit the disk, only how long it takes to get into cache.
You really need to do for single threads:

$ time (dd if=/dev/zero of=<file> bs=XXX count=YYY; sync)

and something like this for multiple (N) threads:

time (
for i in `seq 0 1 N`; do
dd if=/dev/zero of=<file>.$i bs=XXX count=YYY &
done
wait
sync
)

And that will give you a much more accurate measure across all file
sizes of the throughput rate. You'll need to manually calculate
the rate from the output of the time command and the amount of
data that the test runs.

Or, alternatively, you could just use direct IO which avoids
such cache affects by bypassing it....
Post by Gim Leong Chin
I decided on the allocsize=1g after checking that the multiple
instance 30 MB writes have only one extent for each file, without
holes or unused space.
It appears that RAID 6 writes are faster than RAID 5!  And RAID 6
can even match RAID 0!  The system seems to thrive on throughput,
when doing multiple instances of writes, for getting high
aggregate bandwidth.
Given my above comments, that may not be true.

[....]
Post by Gim Leong Chin
We previously did tests of the Caviar Black 1 TB writing 100 MB
chuncks to the device without a file system, with the drive
connected to the SATA ports on a Tyan Opteron motherboard with
nVidia nForce 4 Professional chipset.  With the drive cache
disabled, the sequential write speed was 30+ MB/s if I remember
correctly, versus sub 100 MB/s with cache enabled.  That is a big
fall-off in speed, and that was writing at the outer diameter of
the platter; speed would be halved at the inner diameter.  It
seems the controller firmware is meant to work with cache enabled
for proper functioning.
That sounds wrong - it sounds like NCQ is not functioning properly
as with NCQ enabled, disabling the drive cache should not impact
throughput at all....

FWIW, for SAS and SCSI drives, I recommend turning the drive caches
off as the impact of filesystem issued barrier writes on performance
is worse than disabling the drive caches....
Post by Gim Leong Chin
The desktop Caviar Black also does not have rotatry vibration
compensation, unlike the Caviar RE nearline drives.  WD has a
document showing the performance difference having rotary
vibration compensation makes.  I am not trying to save pennies
here, but the local distributor refuses to bring in the Caviar
REs, and I am stuck in one man's land.
I'd suggest trying to find another distributor that will bring them
in for you. Putting that many drives in a single chassis is almost
certainly going to cause vibration problems, especially if you get
all the disk heads moving in close synchronisation (which is what
happens when you get all your IO sizing and alignment right).

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Gim Leong Chin
2010-01-15 03:08:22 UTC
Permalink
Hi Dave,

Thank you for the advice!

I have done Direct IO dd tests writing the same 20 GB files.  The results are an eye opener!  bs=1GB, count=2

Single instance repeats of 830, 800 MB/s, compared to >100 to under 300 MB/s for buffered.

Two instances aggregate of 304 MB/s, six instances aggregate of 587 MB/s.

System drive /home RAID 1 of 130 MB/s compared to 51 MB/s buffered.

So the problem is with the buffered writes.
Post by Dave Chinner
Youἀd have to get all the fixes from 2.6.30 to 2.6.32,
and the
backport would be very difficult to get right. Better
would
be طust to upgrade the kernel to 2.6.32 ;)
If I change the kernel, I would have no support from Novell.  I would try my luck and convince them.
Post by Dave Chinner
Post by Gim Leong Chin
Post by Dave Chinner
I'd suggest that you might need to look at
increasing the
Post by Gim Leong Chin
Post by Dave Chinner
maximum IO
size for the block device
(/sys/block/sdb/queue/max_sectors_kb),
maybe the request queue depth as well to get
larger IOs to
Post by Gim Leong Chin
Post by Dave Chinner
be pushed
to the raid controller. if you can, at least get
it to the
Post by Gim Leong Chin
Post by Dave Chinner
stripe
width of 1536k....
Could you give a good reference for performance tuning
of these
Post by Gim Leong Chin
parameters?  I am at a total loss here.
Welcome to the black art of storage subsystem tuning ;)
I'm not sure there is a good reference for tuning the block
device
parameters - most of what I know was handed down by word of
mouth
from gurus on high mountains.
The overriding principle, though, is to try to ensure that
the
stripe width sized IOs can be issued right through the IO
stack to
the hardware, and that those IOs are correctly aligned to
the
stripes. You've got the filesystem configuration and layout
part
correct, now it's just tuning the block layer to pass the
IO's
through.
Can I confirm that
(/sys/block/sdb/queue/max_sectors_kb)=stripe width 1536 kB

Which parameter is "request queue depth"?  What should be the value?
Post by Dave Chinner
FWIW, your tests are not timing how longit takes for all
the
data to hit the disk, only how long it takes to get into
cache.
Thank you! I do know that XFS buffers writes extensively. The drive LEDs remain lighted long after the OS says the writes are completed. Plus some timings are physically impossible.
Post by Dave Chinner
That sounds wrong - it sounds like NCQ is not functioning
properly
as with NCQ enabled, disabling the drive cache should not
impact
throughput at all....
I do not remember clearly if NCQ is available for that motherboard, it is an Ubuntu 32-bit, but I do remember seeing queue depth in the kernel. I will check it out next week.

But what I read is that NCQ hurts single write performance. That is also what I found with another Areca SATA RAID in Windows XP.

What I found with all the drives we tested was that disabling the cache badly hurt sequential write performance (no file system, write data directly to designated LBA).
Post by Dave Chinner
I'd suggest trying to find another distributor that will
bring them
in for you. Putting that many drives in a single chassis is
almost
certainly going to cause vibration problems, especially if
you get
all the disk heads moving in close synchronisation (which
is what
happens when you get all your IO sizing and alignment
right).
I am working on changing to the WD Caviar RE4 drives. Not sure if I can pull it off.



Chin Gim Leong


New Email names for you!
Get the Email name you&#39;ve always wanted on the new @ymail and @rocketmail.
Hurry before someone else does!
http://mail.promotions.yahoo.com/newdomains/sg/

Loading...