Hi Dave,
Post by Dave Chinnerfragmented, it just means that that there are 19% more
fragments
than the ideal. In 4TB of data with 1GB sized files, that
would mean
there are 4800 extents (average length ~800MB, which is
excellent)
can see
how misleading this "19% fragmentation" number can be on an
extent
based filesystem...
There are many files that are 128 GB.
When I did the tests with dd on this computer, the 20 GB files had up to > 50 extents.
Post by Dave ChinnerThis all looks good - it certainly seems that you have done
your
research. ;) The only thing I'd do differently is that if
you have
only one partition on the drives, I wouldn't even put a
partition on it.
I just learnt from you that I can have a filesystem without a partition table! That takes care of having to calculate the start of the partition! Are there any other benefits? But are there any down sides to not having a partition table?
Post by Dave ChinnerI'd significantly reduce the size of that buffer - too
large a
buffer can slow down IO due to the memory it consumes and
TLB misses
$ dd if=/dev/zero of=bigfile bs=1024k count=20k
Which does 20,000 writes of 1MB each and ensures the dd
process
doesn't consume over a GB of RAM.
I did try with 1 MB. I have attached the raw test result file. As you can see from line 261, in writing 10 GB with bs=1MB, the speed was no faster two out of three times, so I dropped it. I could re-try that next time.
Post by Dave ChinnerThis seems rather low for a buffered write on hardware that
can
clearly go faster. SLED11 is based on 2.6.27, right? I
suspect that
many of the buffered writeback issues that have been fixed
since
2.6.30 are present in the SLED11 kernel, and if that is the
case I
can see why the allocsize mount option makes such a big
difference.
Is it possible for the fixes in the 2.6.30 kernel to be backported to the 2.6.27 kernel in SLE 11?
If so, I would like to open a service request to Novell to do that to fix the performance issues in SLE 11.
Post by Dave ChinnerIt might be worth checking how well direct IO writes run to
take
buffered writeback issues out ofthe equation. In that case,
I'd use
$ dd if=/dev/zero of=bigfile bs=3072k count=7k
oflag=direct
I would like to do that tomorrow when I go back to work, but on my openSUSE 11.1 AMD Turion RM-74 notebook with 2.6.27.39-0.2-default kernel, on the system WD Scorpio Black 7200 RPM drive, I get 62 MB/s with dd bs=1GB for writing 20 GB file with Direct IO, and 56 MB/s without Direct IO. You are on to something!
As for the hardware performance potential, see below.
Post by Dave ChinnerI'd suggest that you might need to look at increasing the
maximum IO
size for the block device
(/sys/block/sdb/queue/max_sectors_kb),
maybe the request queue depth as well to get larger IOs to
be pushed
to the raid controller. if you can, at least get it to the
stripe
width of 1536k....
Could you give a good reference for performance tuning of these parameters? I am at a total loss here.
As seen from the results file, I have tried different configurations of RAID 0, 5 and 6, with different number of drives. I am pretty confused by the results I see, although only the 20 GB file writes were done with allocsize=1g. I also did not lock the CPU frequency governor at the top clock except for the RAID 6 tests.
I decided on the allocsize=1g after checking that the multiple instance 30 MB writes have only one extent for each file, without holes or unused space.
It appears that RAID 6 writes are faster than RAID 5! And RAID 6 can even match RAID 0! The system seems to thrive on throughput, when doing multiple instances of writes, for getting high aggregate bandwidth.
I will put the performance potential of the system in context by giving some details.
The system has four Kingston DDR2-800 MHz CL6 4 GB unbuffered ECC DIMMs, set to unganged mode, so each thread has up to 6.4 GB of memory bandwidth, from one of two independent memory channels.
The AMD Phenom II X4 965 has three levels of cache, and data from memory goes directly to the L1 caches. The four cores have dedicated L1 and L2 caches, and a shared 6 MB L3. Thread switching will result in cache misses if more than four threads are running.
The IO through the HyperTransport 3.0 from CPU to the AMD 790FX chipset is at 8 GB/s. The Areca ARC-1680ix-16 is PCI-E Gen 1 x8, so the maximum bandwidth is 2 GB/s. The cache is Kingston DDR-667 CL5 4 GB unbuffered ECC, although it runs at 533 MHz, so the maximum bandwidth is 4.2 GB/s. The Intel IOP 348 1200 MHz on the card has two cores.
There are sixteen WD Caviar Black 1 TB drives in the Lian-Li PC-V2110 chassis. For the folks reading this, please do not follow this set-up, as the Caviar Blacks are a mistake. WD quietly disabled the use of WD time limited error recovery utility since the September 2009 manufactured Caviar Black drives, so I have an array of drives that can pop out of the RAID any time if I am unlucky, and I got screwed here.
There is a battery back-up module for the cache, and the drive caches are disabled. Tests run with the drive caches enabled showed quite some bit of speed up in RAID 0.
We previously did tests of the Caviar Black 1 TB writing 100 MB chuncks to the device without a file system, with the drive connected to the SATA ports on a Tyan Opteron motherboard with nVidia nForce 4 Professional chipset. With the drive cache disabled, the sequential write speed was 30+ MB/s if I remember correctly, versus sub 100 MB/s with cache enabled. That is a big fall-off in speed, and that was writing at the outer diameter of the platter; speed would be halved at the inner diameter. It seems the controller firmware is meant to work with cache enabled for proper functioning.
The desktop Caviar Black also does not have rotatry vibration compensation, unlike the Caviar RE nearline drives. WD has a document showing the performance difference having rotary vibration compensation makes. I am not trying to save pennies here, but the local distributor refuses to bring in the Caviar REs, and I am stuck in one man's land.
The system has sixteen hard drives, and ten fans of difference sizes and purposes in total, so that is quite some bit of rotary vibration, which I can feel when I place my hand on the side panels. I really do not know how badly the drive performance suffers as a result. The drives are attached with rubber dampers on the mounting screws.
I did the 20 GB dd test on the RAID 1 system drive, also with XFS, and got 53 MB/s with disabled drive caches, 63 MB/s enabled. That is pretty disappointing, but in light of all the above considerations, plus the kernel buffer issues, I do not really know if that is a good figure.
NCQ is enabled at depth 32. NCQ should cause performance loss for single writes, but gains for multiple writes.
Areca has a document showing that this card can do RAID 6 800 MB/s with Seagate nearline drives, with the standard 512 MB cache. That is in Windows Server. I do not know if the caches are disabled. The benchmark is IO Meter workstation sequential write. IO Meter requries WIndows for the front end, which causes me great difficulties, so I gave up trying to figure it out and I do not understand what the workstation test does. However, in writing 30 MB files, I already exceed 1 GB/s.