Optimal XFS formatting options?

Discussion:

MikeJeezy

2012-01-14 17:44:34 UTC

Hi, I have a 4.9 TB iSCSI LUN on a RAID 6 array with twelve 2 TB SATA disks
(4.9T is only one of the logical volumes). It will contain several million
files of various sizes, but 80% of them will be less than 50 MB. I'm a
novice at best and I usually just use the default #mkfs.xfs /dev/sdx1

This is server will be write heavy for about 8 hours a night, but every
morning there are many reads to the disk. There is rarely a time where it
will be write heavy and read heavy at the same time. Are there other XFS
format options that I could use to optimize performance?

Any input is greatly appreciated. Thank you.

--
View this message in context: http://old.nabble.com/Optimal-XFS-formatting-options--tp33140169p33140169.html
Sent from the Xfs - General mailing list archive at Nabble.com.

Stan Hoeppner

2012-01-14 22:23:43 UTC

Permalink

Post by MikeJeezy
Hi, I have a 4.9 TB iSCSI LUN on a RAID 6 array with twelve 2 TB SATA disks
(4.9T is only one of the logical volumes). It will contain several million
files of various sizes, but 80% of them will be less than 50 MB. I'm a
novice at best and I usually just use the default #mkfs.xfs /dev/sdx1
This is server will be write heavy for about 8 hours a night, but every
morning there are many reads to the disk. There is rarely a time where it
will be write heavy and read heavy at the same time. Are there other XFS
format options that I could use to optimize performance?

sunit=value

This is used to specify the stripe unit for a RAID device or a logical
volume. The value has to be specified in 512-byte block units. Use the
su suboption to specify the stripe unit size in bytes. This suboption
ensures that data allocations will be stripe unit aligned when the
current end of file is being extended and the file size is larger than
512KiB. Also inode allocations and the internal log will be stripe unit
aligned.

su=value

This is an alternative to using sunit. The su suboption is used to
specify the stripe unit for a RAID device or a striped logical volume.
The value has to be specified in bytes, (usually using the m or g
suffixes). This value must be a multiple of the filesystem block size.

swidth=value

This is used to specify the stripe width for a RAID device or a striped
logical volume. The value has to be specified in 512-byte block units.
Use the sw suboption to specify the stripe width size in bytes. This
suboption is required if -d sunit has been specified and it has to be a
multiple of the -d sunit suboption.

sw=value

suboption is an alternative to using swidth. The sw suboption is used to
specify the stripe width for a RAID device or striped logical volume.
The value is expressed as a multiplier of the stripe unit, usually the
same as the number of stripe members in the logical volume
configuration, or data disks in a RAID device.

Using su and sw is often easier due to less conversions.

With a 12 drive RAID6 array your stripe width, or sw, is 10. You will
need to consult the array controller admin interface and documentation
to discover the su value if you don't already know it. Different
vendors call this parameter by different names. It could be "chunk
size" or "strip size" or other. Some/many vendors don't specify this
value at all, giving you only static pre-defined total stripe size
options for the array, such as 64KB, 128KB, 1MB, etc, only in power of 2
values. In this case if you have 64KB stripe size and divide by 10
drives in the stripe you end up with a non filesystem block size
multiple: 6553.6 bytes. This presents serious problems for alignment.
In this case you must dig deep to find out exactly how your vendor
controller handles this situation when your effective RAID spindle count
is not a power of 2.

So let's assume your vendor does the smart thing and allows you
flexibility in specifying per drive strip size. Assume for example the
stripe unit (strip, chunk) of the array is 64KB, there are 10 stripe
spindles (12-2=10), and the local device name of the LUN is /dev/sdb.
To create an aligned XFS filesystem on this you would use something like:

$ mkfs.xfs -d su=64k sw=10 /dev/sdb

When using vendor array hardware that only allows one to define what XFS
calls swidth, it is best to use a power of 2 stripe spindle count to get
proper alignment. If you use a non power of 2 stripe spindle count the
vendor firmware will either round down or round up to create the stripe
unit size, and this formula is often not documented.

With such vendor hardware, for a RAID6 array you would want to have 6,
10, or 18 total drives in the array, giving you 4, 8, or 16 stripe
spindles. Alternatively, you need to know exactly how the firmware
rounds up or down to arrive at the strip block size (sunit).

If you find yourself in such a situation, and are unable to determine
the strip size the array firmware is using, you may be better off using
the mkfs.xfs defaults, vs guessing and ending up with unaligned writes.

--
Stan

MikeJeezy

2012-01-16 00:27:23 UTC

Permalink

Post by Stan Hoeppner
So let's assume your vendor does the smart thing and allows you
flexibility in specifying per drive strip size. Assume for example the
stripe unit (strip, chunk) of the array is 64KB, there are 10 stripe
spindles (12-2=10), and the local device name of the LUN is /dev/sdb.
$ mkfs.xfs -d su=64k sw=10 /dev/sdb

Great explanations! (some of it I am still trying to understand :-) In this
case on my HP P2000 G3, I do have a 64k chunk size so I will do:

$ mkfs.xfs -d su=64k,sw=10 /dev/sdd

Question: Does the above command assume I do not already have a partition
created? I was
http://www.fhgfs.com/wiki/wikka.php?wakka=PartitionAlignment reading here
that the easiest way to acheive partition alignment is to create the file
system directly on the storage device without any paritions - such as $
mkfs.xfs /dev/sdd (and your example above also hints at this)

When I created my current partiton, I used the following commands:

$ parted -a optimal /dev/sdd
$ mklabel gpt
$ mkpart primary 0 -0
$ q

I would like to align the partiton as well, but I am not sure how to acheive
this using parted. This will be the only partition on the LUN, so not sure
if I even need to create one (although I do like to stay consistent with my
other volumes).

When printing the partition info with parted I see:

# (parted) p
Model: HP P2000 G3 iSCSI (scsi)
Disk /dev/sdd: 4900GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
1 1049kB 4900GB 4900GB xfs primary

but from reading, I suspect the Sector size should be more like:
(logical/physical): 512B/65536B. Any thoughts on partition alignment or
other thoughts in general? Thank you.

--
View this message in context: http://old.nabble.com/Optimal-XFS-formatting-options--tp33140169p33145068.html
Sent from the Xfs - General mailing list archive at Nabble.com.

Stan Hoeppner

2012-01-16 04:56:22 UTC

Permalink

Post by MikeJeezy

Great explanations! (some of it I am still trying to understand :-) In this
$ mkfs.xfs -d su=64k,sw=10 /dev/sdd

That should be fine.

Post by MikeJeezy
Question: Does the above command assume I do not already have a partition
created? I was
http://www.fhgfs.com/wiki/wikka.php?wakka=PartitionAlignment reading here
that the easiest way to acheive partition alignment is to create the file
system directly on the storage device without any paritions - such as $
mkfs.xfs /dev/sdd (and your example above also hints at this)

That example and command assume you're not using partitions.

Post by MikeJeezy
$ parted -a optimal /dev/sdd
$ mklabel gpt
$ mkpart primary 0 -0
$ q
I would like to align the partiton as well, but I am not sure how to acheive
this using parted. This will be the only partition on the LUN, so not sure
if I even need to create one (although I do like to stay consistent with my
other volumes).

If your drives have 512 byte physical sectors (not advanced format
drives with 4096 byte sectors) then there is no need to worry about
partition alignment. And in fact, if you plan to put a single
filesystem on this entire 4.9TB virtual drive, you don't need to
partition the disk device at all. Recall the dictionary definition of
"partition". You're not dividing the whole into smaller pieces here.

Post by MikeJeezy
# (parted) p
Model: HP P2000 G3 iSCSI (scsi)
Disk /dev/sdd: 4900GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Number Start End Size File system Name Flags
1 1049kB 4900GB 4900GB xfs primary
(logical/physical): 512B/65536B.

No, that 65536 figure is wrong. Their are only two possibilities for
sector size (logical/physical): 512/512 and 512/4096. These are the
only two disk sector formats currently used on disk drives.
Partitioning utils look strictly at disk parameters, not RAID parameters.

Sectors deal with how many books (bytes) fit on each shelf (sector) in
the library, and which shelf (sector) we're going to store a given set
of books (bytes) on. RAID parameters, such as stripe unit, deal with
how many shelves (sectors) worth of books (bytes) we can carry most
efficiently down the isle and place on the shelves at one time.

In short, sectors are a destination where we store bytes, much like
books on a shelf. A stripe unit acts as a book cart in which we carry a
fixed number of books, allowing us to fill a fixed number of shelves
most efficiently per cart transported down the isle.

Post by MikeJeezy
Any thoughts on partition alignment or
other thoughts in general? Thank you.

Yes, don't use partitions if you don't need to divide your disk device
(LUN/virtual disk) into multiple pieces. Now, if you need to make use
of snapshots or other volume management features, you may want to create
an LVM device on top of the disk device (LUN) and then make your XFS on
top of the LVM device. If you have no need for LVM features, I'd say
directly format the LUN with XFS, no partition table necessary.

--
Stan

Dave Chinner

2012-01-16 23:11:21 UTC

Permalink

Post by Stan Hoeppner

Post by MikeJeezy
I would like to align the partiton as well, but I am not sure how to acheive
this using parted. This will be the only partition on the LUN, so not sure
if I even need to create one (although I do like to stay consistent with my
other volumes).

If your drives have 512 byte physical sectors (not advanced format
drives with 4096 byte sectors) then there is no need to worry about
partition alignment.

That is incorrect. Partitions need to be aligned to the underlying
stripe configuration, regardless of the sector size of the drives
that make up the stripe. If you do not align the partition to the
stripe, then the filesystem will be unaligned no matter how you
configure it. Every layer of the storage stack under the filesystem
needs to be correctly aligned and sized for filesystem alignment to
make any difference to performance.

Post by Stan Hoeppner

Post by MikeJeezy
Any thoughts on partition alignment or
other thoughts in general? Thank you.

If you use LVM, then you need to ensure that it is slicing up the
device in a manner that is aligned correctly to the underlying
stripe, just like if you are using partitions to provide the same
functionality. Different technologies, same problem.

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Stan Hoeppner

2012-01-17 03:31:59 UTC

Permalink

Post by Dave Chinner

Post by Stan Hoeppner

If your drives have 512 byte physical sectors (not advanced format
drives with 4096 byte sectors) then there is no need to worry about
partition alignment.

Thanks for the correction/reminder Dave. So in this case the first
sector of the first partition would need to reside at LBA1280 in this
array (655360 byte stripe width, 1280 sectors/stripe), as the partition
table itself is going to occupy some sectors at the beginning of the
first stripe. By creating the partition at LBA1280 we make sure the
first sector of the XFS filesystem is aligned with the first sector of
the 2nd stripe.

This exercise demonstrates why it's often preferable to directly format
the LUN. If you don't have a _need_ for a partition table, such as
cloning/backup software that works at the partition level, or something
of that nature, avoid partitions.

Post by Dave Chinner

Post by Stan Hoeppner

Post by MikeJeezy
Any thoughts on partition alignment or
other thoughts in general? Thank you.

If he's doing a single LVM volume then alignment should be automatic
during mkfs.xfs shouldn't it?

--
Stan

Michael Monnerie

2012-01-17 09:19:55 UTC

Permalink

Post by Stan Hoeppner
Thanks for the correction/reminder Dave. So in this case the first
sector of the first partition would need to reside at LBA1280 in this
array (655360 byte stripe width, 1280 sectors/stripe), as the
partition table itself is going to occupy some sectors at the
beginning of the first stripe. By creating the partition at LBA1280
we make sure the first sector of the XFS filesystem is aligned with
the first sector of the 2nd stripe.

There's one big problem with that: Many people will sooner or later
expand and existing array. If you add one drive, all your nice stripe
width alignment becomes bogus, and suddenly your performance will drop.

There's no real way out of that, but three solutions come to my mind:
- backup before expand/restore after expand with new alignment
- leave existing data, just change mount options so after expansion at
least new files are going to be aligned to the new stripe width.
- expand array by factors of two. So if you have 10 data drives, add 10
data drives. But that creates other problems (probability of single
drive failure + time to recover a single broken disk)

--
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services: Protéger
http://proteger.at [gesprochen: Prot-e-schee]
Tel: +43 660 / 415 6531

Emmanuel Florac

2012-01-17 11:17:49 UTC

Permalink

Le Tue, 17 Jan 2012 10:19:55 +0100

Post by Michael Monnerie
- expand array by factors of two. So if you have 10 data drives, add
10 data drives. But that creates other problems (probability of
single drive failure + time to recover a single broken disk)

From my experience 20 drives is OK for RAID-6. And rebuild time doesn't
change much with array size, anyway.

Misaligned partitions, on the other hand, can easily halve array
throughput from my own measurements.

--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <***@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------

Stan Hoeppner

2012-01-17 11:34:26 UTC

Permalink

Post by Michael Monnerie

So to be clear, your issue with the above isn't with my partition
alignment math WRT the OP's P2000 array, but is with using XFS stripe
alignment in general, correct?

Post by Michael Monnerie
- backup before expand/restore after expand with new alignment
- leave existing data, just change mount options so after expansion at
least new files are going to be aligned to the new stripe width.
- expand array by factors of two. So if you have 10 data drives, add 10
data drives. But that creates other problems (probability of single
drive failure + time to recover a single broken disk)

There is one really simple way around this issue you describe: don't add
drives to an existing array. Simply create another array with new
disks, create a new aligned XFS on the array, and mount the filesystem
in an appropriate location. There is no 11th Commandment stating one
must have a single massive XFS atop all of one's disks. ;)

There is little to no application software today that can't be
configured to store its data files across multiple directories. So
there's no need to box oneself into the corner you describe above.

--
Stan

Michael Monnerie

2012-01-20 15:52:09 UTC

Permalink

Post by Stan Hoeppner
So to be clear, your issue with the above isn't with my partition
alignment math WRT the OP's P2000 array, but is with using XFS stripe
alignment in general, correct?

Yes. I just wanted to document this as people often expand RAIDs and
forget to apply the changes to stripe width.

Post by Stan Hoeppner
There is one really simple way around this issue you describe: don't
add drives to an existing array. Simply create another array with
new disks, create a new aligned XFS on the array, and mount the
filesystem in an appropriate location. There is no 11th Commandment
stating one must have a single massive XFS atop all of one's disks.
;)
There is little to no application software today that can't be
configured to store its data files across multiple directories. So
there's no need to box oneself into the corner you describe above.

It's a management burden to do that. I've learned that systems usually
are strictly structured in their configuration, so it's often better to
extend a RAID and to keep the config, as this is cheaper in the end. At
least for the salaries of good admins here in Europe ;-)

--
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services: Protéger
http://proteger.at [gesprochen: Prot-e-schee]
Tel: +43 660 / 415 6531

Stan Hoeppner

2012-01-20 22:44:46 UTC

Permalink

Post by Michael Monnerie

Post by Stan Hoeppner
So to be clear, your issue with the above isn't with my partition
alignment math WRT the OP's P2000 array, but is with using XFS stripe
alignment in general, correct?

Yes. I just wanted to document this as people often expand RAIDs and
forget to apply the changes to stripe width.

If ease (or cost) of filesystem administration is of that much greater
priority than performance, then why are you using XFS in the first place
instead of EXT?

--
Stan

Michael Monnerie

2012-01-24 10:31:07 UTC

Permalink

Post by Stan Hoeppner
If ease (or cost) of filesystem administration is of that much
greater priority than performance, then why are you using XFS in the
first place instead of EXT?

Great experience in recovery of disaster filesystem problems on XFS. A
switch to another FS costs a lot of time, and why switch if it works
great? And administration comes down to mkfs, mount, maybe xfs_fsr, in
disaster xfs_repair, and sometimes xfs_growfs. Basically nothing.

Also, this list has been of great help during the years, whenever there
were problems they got fixed. That's ease of administration :-)

--
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services: Protéger
http://proteger.at [gesprochen: Prot-e-schee]
Tel: +43 660 / 415 6531

Peter Grandi

2012-01-15 01:14:54 UTC

Permalink

[ ... ]

The default :-) advice in this list and in the XFS FAQ is that
in any recent edition of the XFS tools and XFS code in the
kernel the defaults are usually best, unless you have a special
situation, for example if the kernel cannot get storage geometry
from the storage layer.

Also, "several million" in a about 5,000,000MB filesystem
indicates an average file size of 1MB. That's not too small,
fortunately. Anyhow consider how long it will take to 'fsck' all
that if it gets damaged, or the extra load to backup the whole
filetree if backups scan the tree (e.g. RYNC based).

Post by MikeJeezy
This is server will be write heavy for about 8 hours a night,
but every morning there are many reads to the disk. There is
rarely a time where it will be write heavy and read heavy at
the same time. Are there other XFS format options that I
could use to optimize performance? Any input is greatly
appreciated. Thank you.

As usual, the first note is that in general RAID6 is a bad idea,
with RMW and reliability (especially during rebuild) issues, but
salesmen and management usually love it because it embodies a
promise of something for nothing (let's say that the parity RAID
industry is the Wall Street of storage system :->).

To mitigate problems In general if you are doing a lot of
writing it is very important that the filesystem try to align to
address/length of the full RAID stripe, but this should be
automatic if the relevant geometry is reported to the Linux
kernel. Otherwise thee are many previous messages in this list
about that, and the FAQ etc.

Things that you might want to double check in case they matter
for you, as to not-'mkfs' options:

* XFS has several limitations on 32b kernels. Just make sure
you have a 64b kernel.

* Make really sure your partitions (or LUNs if unpartitioned)
are aligned, certainly to a multiple of stripe size, ideally
to something larg, at least like 1MiB.

* Recent (let's say at least 2.6.32 or EL57) kernels and
editions of XFS tools and partitioning tools (if you use
any) are very improved. The newer usually the better.

* Usually just in case explicitly specify at 'mount' (not
'mkfs') time the 'inode64' option; and the 'barrier' option
unless you really know better (and pray hard that your
storage layer supports it). The 'delaylog' option or its
opposite are also something to look carefully into.

* Check carefully whether your app is compatible with the
'noatime' and 'nodiratime' options and enable them if
possible, "just in case" :-).

* Look very attentively at the kernel page cache flusher
parameters to make it run more often (tom prevent the
accumulation of very large gulps of unwritten data) but not
too often (to give a chance to the delayed allocator).

As to proper 'mkfs' you may want to look into:

* Explicitly set the sector size because most storage layers
lie. In general if possible you should set it to 4096, just
in case :-). This also allegedly extends the range where
inodes can be stored if you cannot specify 'inode64' at
mount time.

* If you have a critically high rate of metadata work (like
file creation/deletion, and it seems your case overnight)
you may want to ensure that your log is not only aligned,
but perhaps on a separate device, and/or you have a host
adapter with a large battery backed cache. Logs are small,
so it should be easy either way.

* Depending on the degree of multihtreading of your
application you may want more/less AGs, but usually on a
4.9TB filetree there will be plenty.

* You may want larger inodes than the default if you have lots
of ACLs or your files are written slowly and thus have many
extents. They are recommended also for small files but I
cannot remember whether XFS really stores small files or
directories into the inode (I remember that directories of
less than 8 entries are stored in the inode, but I don't
know whether depends on its size).

Run first 'mfs.fs -N ....' so it will print out which
parameters it will use without actually doing anything.

Linda Walsh

2012-01-20 09:03:40 UTC

Permalink

Post by Peter Grandi
* XFS has several limitations on 32b kernels. Just make sure
you have a 64b kernel.

----
I was unaware that the block size was larger on 64b kernels.

Is that what you are referring to ?

(would be nice)...

One thing I have a Q on -- you (OP), said this was an 'iscsi' box?

That means hookup over an network, right?

You are planning on using a 10Gbit or faster network fabric, right?

a 1Gb ethernet will only get you 125MB/s max... doesn't take much
tuning to hit that speed.

Peter Grandi

2012-01-20 12:06:31 UTC

Permalink

[ ... ]

Post by Linda Walsh

Post by Peter Grandi
* XFS has several limitations on 32b kernels. Just make sure
you have a 64b kernel.

[ ... ]

Post by Linda Walsh
I was unaware that the block size was larger on 64b kernels.
Is that what you are referring to ? (would be nice)...

Not as such, the maximum block size is limited by the Linux page
cache, that is hw page size, which is for IA32 and AMD64
architectures the same at 4KiB. However other architectures
which are natively 64b allow bigger page sizes (notably IA64
[aka Itanium]), so the page cache and thus XFS can do larger
blocks sizes.

The limitations of XFS on 32b kernels come from limitations of
XFS itself in 32b mode, limitations of Linux in 32b mode, and
combined limitations. For example:

* There be 32b inode numbers, which limit inodes to the first
1TB of a filetree if sector size is 512B.

* The 32b block IO subsystems limits partition sizes to 16TiB.

* XFS tools scanning a large filesystem, usually for repair,
can run out of the available 32b address space (by default
around 2GiB).

Page 5 and 6 here list some limits:

http://oss.sgi.com/projects/xfs/training/xfs_slides_02_overview.pdf

Michael Monnerie

2012-01-20 15:55:13 UTC

Permalink

Post by Peter Grandi
* There be 32b inode numbers, which limit inodes to the first
1TB of a filetree if sector size is 512B.
* The 32b block IO subsystems limits partition sizes to 16TiB.

I thought those two have been removed by some updates? I think I
remember to have read that. Not that it's too interesting, I've been
running on 64b Linux everywhere since AMD has put it in their
processors. Should be 10+ years or so.

--
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services: Protéger
http://proteger.at [gesprochen: Prot-e-schee]
Tel: +43 660 / 415 6531

Dave Chinner

2012-01-23 04:21:28 UTC

Permalink

Post by Peter Grandi
[ ... ]

Post by Linda Walsh

Post by Peter Grandi
* XFS has several limitations on 32b kernels. Just make sure
you have a 64b kernel.

[ ... ]

Post by Linda Walsh
I was unaware that the block size was larger on 64b kernels.
Is that what you are referring to ? (would be nice)...

Internally XFS still uses 64 bit inode numbers - the on-disk format
does not change just because the CPU arch has changed. If you use
the stat64() style interfaces, even on 32 bit machines you can
access the full 64 bit inode numbers.

Post by Peter Grandi
* The 32b block IO subsystems limits partition sizes to 16TiB.

The sector_t is a 64 bit number even on 32 bit systems. The
problem is that the page cache cannot index past offsets of 16TB.
Given that XFS no longer uses the page cache for it's metadata
indexing, we could remove this limit in the kernel code if we
wanted to. And given that the userpsace tools use direct IO, the
page cache limitation doesn't cause problems there, either, because
we bypass it.

So in theory we could lift this limit, but there really isn't much
demand for >16TB filesystems on 32 bit, because....

Post by Peter Grandi
* XFS tools scanning a large filesystem, usually for repair,
can run out of the available 32b address space (by default
around 2GiB).

.... you need 64 bit systems to handle the userspace memory
requirements tools like xfs_check and xfs_repair require to run. If
the filesystem is large enough that you can't run repair because it
needs more than 2GB of RAM, then you shouldn't be using a 32 bit
systems.

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com