Discussion:
[FAQ] XFS speculative preallocation
Brian Foster
2014-03-21 16:29:20 UTC
Permalink
Hi all,

Eric had suggested we add an FAQ entry for speculative preallocation
since it seems to be a common question, so I offered to write something
up. I started with a single entry but split it into a couple Q's when it
turned into TL;DR fodder. ;)

The text is embedded below for review. Thoughts on the questions or
content is appreciated. Also, once folks are Ok with this... how does
one gain edit access to the wiki?

Brian

---

Q: Why do files on XFS use more data blocks than expected?

A:

The XFS speculative preallocation algorithm allocates extra blocks
beyond end of file (EOF) to combat fragmentation under parallel
sequential write workloads. This post-EOF block allocation is included
in 'st_blocks' counts via stat() system calls and is accounted as
globally allocated space by the filesystem. This is reported by various
userspace utilities (stat, du, df, ls) and thus provides a common source
of confusion for administrators. Post-EOF blocks are temporary in most
situations and are usually reclaimed via several possible mechanisms in
XFS.

See the FAQ entry on speculative preallocation for details.

Q: What is speculative preallocation? How can I manage it?

A:

XFS speculatively preallocates post-EOF blocks on file extending writes
in anticipation of future extending writes. The size of a preallocation
is dynamic and depends on the size of the previous extent in the file
(starting from 0 again if the write extends past a hole). As files grow
larger, so do the size of preallocations. Speculative preallocation is
not enabled for files smaller than a minimum size (64k by default, but
can vary depending on filesystem geometry and/or mount options).
Preallocations are capped at a maximum of 8GB on 4k block filesystems.
Preallocation is throttled automatically as the filesystem approaches
low free space conditions or other allocation limits on a file (such as
a quota).

In most cases, speculative preallocation is automatically reclaimed when
a file is closed. The preallocation may persist after file close if an
open, write, close pattern is repeated on a file. In this scenario,
post-EOF preallocation is trimmed once the inode is reclaimed from cache
or the filesystem unmounted.

Linux 3.8 (and later) includes a scanner to perform background trimming
of files with lingering post-EOF preallocations. The scanner bypasses
files that have been recently modified to not interfere with ongoing
writes. A 5 minute scan interval is used by default and can be adjusted
via the following file (value in seconds):

/proc/sys/fs/xfs/speculative_prealloc_lifetime

Although speculative preallocation can lead to reports of excess space
usage, the preallocated space is not permanent unless explicitly made so
via fallocate or a similar interface. Preallocated space can also be
encoded permanently in situations where file size is extended beyond a
range of post-EOF blocks (i.e., via truncate). Otherwise, preallocated
blocks are reclaimed on file close, inode reclaim, unmount or in the
background once file write activity subsides.

Finally, the XFS block allocation algorithm can be configured to use a
fixed allocation size with the 'allocsize=' mount option. Note that
speculative preallocation does not occur when a fixed allocation size is
set and thus increases the potential for fragmentation via parallel
writes.
Shaun Gosse
2014-03-21 16:54:32 UTC
Permalink
Brian,

FWIW, from my perspective as a newcomer to XFS that is quite clear and understandable and informative. Looks like a valuable addition.

I've got no idea how to get write access on the wiki personally, but hopefully that answer will arrive for you 'soon(tm)'.

Cheers,
-Shaun

-----Original Message-----
From: xfs-***@oss.sgi.com [mailto:xfs-***@oss.sgi.com] On Behalf Of Brian Foster
Sent: Friday, March 21, 2014 11:29 AM
To: ***@oss.sgi.com
Subject: [FAQ] XFS speculative preallocation

Hi all,

Eric had suggested we add an FAQ entry for speculative preallocation since it seems to be a common question, so I offered to write something up. I started with a single entry but split it into a couple Q's when it turned into TL;DR fodder. ;)

The text is embedded below for review. Thoughts on the questions or content is appreciated. Also, once folks are Ok with this... how does one gain edit access to the wiki?

Brian

---

Q: Why do files on XFS use more data blocks than expected?

A:

The XFS speculative preallocation algorithm allocates extra blocks beyond end of file (EOF) to combat fragmentation under parallel sequential write workloads. This post-EOF block allocation is included in 'st_blocks' counts via stat() system calls and is accounted as globally allocated space by the filesystem. This is reported by various userspace utilities (stat, du, df, ls) and thus provides a common source of confusion for administrators. Post-EOF blocks are temporary in most situations and are usually reclaimed via several possible mechanisms in XFS.

See the FAQ entry on speculative preallocation for details.

Q: What is speculative preallocation? How can I manage it?

A:

XFS speculatively preallocates post-EOF blocks on file extending writes in anticipation of future extending writes. The size of a preallocation is dynamic and depends on the size of the previous extent in the file (starting from 0 again if the write extends past a hole). As files grow larger, so do the size of preallocations. Speculative preallocation is not enabled for files smaller than a minimum size (64k by default, but can vary depending on filesystem geometry and/or mount options).
Preallocations are capped at a maximum of 8GB on 4k block filesystems.
Preallocation is throttled automatically as the filesystem approaches low free space conditions or other allocation limits on a file (such as a quota).

In most cases, speculative preallocation is automatically reclaimed when a file is closed. The preallocation may persist after file close if an open, write, close pattern is repeated on a file. In this scenario, post-EOF preallocation is trimmed once the inode is reclaimed from cache or the filesystem unmounted.

Linux 3.8 (and later) includes a scanner to perform background trimming of files with lingering post-EOF preallocations. The scanner bypasses files that have been recently modified to not interfere with ongoing writes. A 5 minute scan interval is used by default and can be adjusted via the following file (value in seconds):

/proc/sys/fs/xfs/speculative_prealloc_lifetime

Although speculative preallocation can lead to reports of excess space usage, the preallocated space is not permanent unless explicitly made so via fallocate or a similar interface. Preallocated space can also be encoded permanently in situations where file size is extended beyond a range of post-EOF blocks (i.e., via truncate). Otherwise, preallocated blocks are reclaimed on file close, inode reclaim, unmount or in the background once file write activity subsides.

Finally, the XFS block allocation algorithm can be configured to use a fixed allocation size with the 'allocsize=' mount option. Note that speculative preallocation does not occur when a fixed allocation size is set and thus increases the potential for fragmentation via parallel writes.
Arkadiusz Miśkiewicz
2014-03-21 17:09:03 UTC
Permalink
Post by Brian Foster
Hi all,
Eric had suggested we add an FAQ entry for speculative preallocation
since it seems to be a common question, so I offered to write something
up. I started with a single entry but split it into a couple Q's when it
turned into TL;DR fodder. ;)
The text is embedded below for review. Thoughts on the questions or
content is appreciated. Also, once folks are Ok with this... how does
one gain edit access to the wiki?
More questions or topics that can be converted to questions from me:

1) Before preallocation kernel did things differently. AFAIK it wasn't the
same as allocsize=64k, was it? Is there a way to get old behaviour or
something similar to old behaviour?

2) Is there a way to see which file got some preallocation and how big that
preallocation is? Scenario - something ate free space due to preallocation and
from admin point of view it would be usefull to know which app did that and
how many MB was due to preallocation (vs real, written data).
Post by Brian Foster
Linux 3.8 (and later) includes a scanner to perform background trimming
of files with lingering post-EOF preallocations. The scanner bypasses
files that have been recently
What time is "recently" ? Is "modified" equal to "file data modified" or "file
data or metadata modified" ?
Post by Brian Foster
modified to not interfere with ongoing
writes.
In case of some app that constantly writes to files (apache web server
writting to its logs for example) that background trimming will never do
anything for these files, right?
Post by Brian Foster
A 5 minute scan interval is used by default and can be adjusted
/proc/sys/fs/xfs/speculative_prealloc_lifetime
Although speculative preallocation can lead to reports of excess space
usage, the preallocated space is not permanent unless explicitly made so
via fallocate or a similar interface. Preallocated space can also be
encoded permanently in situations where file size is extended beyond a
range of post-EOF blocks (i.e., via truncate). Otherwise, preallocated
blocks are reclaimed on file close, inode reclaim, unmount or in the
background once file write activity subsides.
So there is no mechanism that would shirnk preallocations in case when free
space is (almost or) gone on a fs? Case: apache causes xfs to preallocate
several GB for its /var/..../{access,error}_log (common problem here) and then
free space ends on that fs causing problems for every app that writes to /var.

Thanks!
--
Arkadiusz Miśkiewicz, arekm / maven.pl
Brian Foster
2014-03-21 18:02:41 UTC
Permalink
Post by Arkadiusz Miśkiewicz
Post by Brian Foster
Hi all,
Eric had suggested we add an FAQ entry for speculative preallocation
since it seems to be a common question, so I offered to write something
up. I started with a single entry but split it into a couple Q's when it
turned into TL;DR fodder. ;)
The text is embedded below for review. Thoughts on the questions or
content is appreciated. Also, once folks are Ok with this... how does
one gain edit access to the wiki?
1) Before preallocation kernel did things differently. AFAIK it wasn't the
same as allocsize=64k, was it? Is there a way to get old behaviour or
something similar to old behaviour?
Going from the commit log that introduced speculative preallocation, it
appears that the behavior was effectively allocsize=64k. For reference:

055388a3 xfs: dynamic speculative EOF preallocation
Post by Arkadiusz Miśkiewicz
2) Is there a way to see which file got some preallocation and how big that
preallocation is? Scenario - something ate free space due to preallocation and
from admin point of view it would be usefull to know which app did that and
how many MB was due to preallocation (vs real, written data).
The common scenario is when du/stat reports a larger block usage than
file size, so the question of how much extra space is allocated is just
the difference between the two. I suppose we could include a simple
example of that in the first Q.

This isn't necessarily true in the case of sparse files. xfs_bmap prints
the extent information for a file, so it should be possible to determine
how much post-EOF space exists from looking at the extent that covers
EOF. That said, this strikes me as more "user guide" material than FAQ.
Post by Arkadiusz Miśkiewicz
Post by Brian Foster
Linux 3.8 (and later) includes a scanner to perform background trimming
of files with lingering post-EOF preallocations. The scanner bypasses
files that have been recently
What time is "recently" ? Is "modified" equal to "file data modified" or "file
data or metadata modified" ?
I originally had something like "files that have not been modified since
last flushed to disk," which is the heuristic as I understand it. That
seemed too verbose and technical for FAQ. I could replace "recently
modified" with "... bypasses files that are dirty ..." if that is more
useful..?
Post by Arkadiusz Miśkiewicz
Post by Brian Foster
modified to not interfere with ongoing
writes.
In case of some app that constantly writes to files (apache web server
writting to its logs for example) that background trimming will never do
anything for these files, right?
Most likely true. Though by the same logic, those files will eventually
use the preallocated space.
Post by Arkadiusz Miśkiewicz
Post by Brian Foster
A 5 minute scan interval is used by default and can be adjusted
/proc/sys/fs/xfs/speculative_prealloc_lifetime
Although speculative preallocation can lead to reports of excess space
usage, the preallocated space is not permanent unless explicitly made so
via fallocate or a similar interface. Preallocated space can also be
encoded permanently in situations where file size is extended beyond a
range of post-EOF blocks (i.e., via truncate). Otherwise, preallocated
blocks are reclaimed on file close, inode reclaim, unmount or in the
background once file write activity subsides.
So there is no mechanism that would shirnk preallocations in case when free
space is (almost or) gone on a fs? Case: apache causes xfs to preallocate
several GB for its /var/..../{access,error}_log (common problem here) and then
free space ends on that fs causing problems for every app that writes to /var.
I noted in the second answer that the preallocation is throttled as we
near allocation limits such as no free space or quota. I think that
should cover most cases. I still have some code lying around somewhere
that forces a scan and retry in EDQUOT scenarios though. I should dust
that off...

Thanks for the reviews!

Brian
Post by Arkadiusz Miśkiewicz
Thanks!
--
Arkadiusz Miśkiewicz, arekm / maven.pl
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
Dave Chinner
2014-03-21 23:16:17 UTC
Permalink
Post by Arkadiusz Miśkiewicz
Post by Brian Foster
Hi all,
Eric had suggested we add an FAQ entry for speculative preallocation
since it seems to be a common question, so I offered to write something
up. I started with a single entry but split it into a couple Q's when it
turned into TL;DR fodder. ;)
The text is embedded below for review. Thoughts on the questions or
content is appreciated. Also, once folks are Ok with this... how does
one gain edit access to the wiki?
1) Before preallocation kernel did things differently. AFAIK it wasn't the
same as allocsize=64k, was it? Is there a way to get old behaviour or
something similar to old behaviour?
The old behaviour is exactly that of allocsize=64k.
Post by Arkadiusz Miśkiewicz
Post by Brian Foster
modified to not interfere with ongoing
writes.
In case of some app that constantly writes to files (apache web server
writting to its logs for example) that background trimming will never do
anything for these files, right?
If the inode is being constantly dirtied, then the speculative
prealloc will not be removed by the background scanner. It only
removes prealloc from clean inodes.
Post by Arkadiusz Miśkiewicz
Post by Brian Foster
A 5 minute scan interval is used by default and can be adjusted
/proc/sys/fs/xfs/speculative_prealloc_lifetime
Although speculative preallocation can lead to reports of excess space
usage, the preallocated space is not permanent unless explicitly made so
via fallocate or a similar interface. Preallocated space can also be
encoded permanently in situations where file size is extended beyond a
range of post-EOF blocks (i.e., via truncate). Otherwise, preallocated
blocks are reclaimed on file close, inode reclaim, unmount or in the
background once file write activity subsides.
So there is no mechanism that would shirnk preallocations in case when free
space is (almost or) gone on a fs?
Background space trimmer takes care of that. We could probably also
trigger it on ENOSPC, but once you are already at ENOSPC it's too
late....
Post by Arkadiusz Miśkiewicz
Case: apache causes xfs to preallocate
several GB for its /var/..../{access,error}_log (common problem here) and then
free space ends on that fs causing problems for every app that writes to /var.
Your log files would have to already be GB in size for that your
apache logs to preallocate that much. If your log files are that
big, then /var needs to be much, much larger than what the
speculative prealloc for a handful of files could easily exhaust.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Florian Weimer
2014-03-21 20:11:29 UTC
Permalink
Post by Brian Foster
Although speculative preallocation can lead to reports of excess space
usage, the preallocated space is not permanent unless explicitly made so
via fallocate or a similar interface.
How does an explicit allocation with posix_fallocate interact with
speculative preallocation? Does it disable it?

I see rather dramatic fragmentation of the systemd journal when it is
stored on XFS, and it calls posix_fallocate before writing data to the
file.
Dave Chinner
2014-03-21 23:10:33 UTC
Permalink
Post by Florian Weimer
Post by Brian Foster
Although speculative preallocation can lead to reports of excess space
usage, the preallocated space is not permanent unless explicitly made so
via fallocate or a similar interface.
How does an explicit allocation with posix_fallocate interact with
speculative preallocation? Does it disable it?
fallocate is permanent preallocation using unwritten extents.
Speculative preallocation is an extension of delayed allocation that
is done when extending the file and the EOF falls into a hole. If
there is unwritten extents beyond EOF, speulative preallocation is
not performed.
Post by Florian Weimer
I see rather dramatic fragmentation of the systemd journal when it is
stored on XFS, and it calls posix_fallocate before writing data to the
file.
There's your problem - systemd is preventing delayed allocation, and
so it fragmenting the file itself with it's write pattern.
Basically, that's a bug in systemd, and not something the filesystem
can avoid because userspace is directly controlling block
allocation.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Eric Sandeen
2014-03-21 23:13:30 UTC
Permalink
Post by Dave Chinner
Post by Florian Weimer
Post by Brian Foster
Although speculative preallocation can lead to reports of excess space
usage, the preallocated space is not permanent unless explicitly made so
via fallocate or a similar interface.
How does an explicit allocation with posix_fallocate interact with
speculative preallocation? Does it disable it?
fallocate is permanent preallocation using unwritten extents.
Speculative preallocation is an extension of delayed allocation that
is done when extending the file and the EOF falls into a hole. If
there is unwritten extents beyond EOF, speulative preallocation is
not performed.
Post by Florian Weimer
I see rather dramatic fragmentation of the systemd journal when it is
stored on XFS, and it calls posix_fallocate before writing data to the
file.
There's your problem - systemd is preventing delayed allocation, and
so it fragmenting the file itself with it's write pattern.
Basically, that's a bug in systemd, and not something the filesystem
can avoid because userspace is directly controlling block
allocation.
hohum, I guess we should look into this.

OTOH: nothing wrong with calling posix_fallocate() if you need the space
guarantees it provides for proper operation...

-Eric
Post by Dave Chinner
Cheers,
Dave.
Dave Chinner
2014-03-21 23:18:01 UTC
Permalink
Post by Eric Sandeen
Post by Dave Chinner
Post by Florian Weimer
Post by Brian Foster
Although speculative preallocation can lead to reports of excess space
usage, the preallocated space is not permanent unless explicitly made so
via fallocate or a similar interface.
How does an explicit allocation with posix_fallocate interact with
speculative preallocation? Does it disable it?
fallocate is permanent preallocation using unwritten extents.
Speculative preallocation is an extension of delayed allocation that
is done when extending the file and the EOF falls into a hole. If
there is unwritten extents beyond EOF, speulative preallocation is
not performed.
Post by Florian Weimer
I see rather dramatic fragmentation of the systemd journal when it is
stored on XFS, and it calls posix_fallocate before writing data to the
file.
There's your problem - systemd is preventing delayed allocation, and
so it fragmenting the file itself with it's write pattern.
Basically, that's a bug in systemd, and not something the filesystem
can avoid because userspace is directly controlling block
allocation.
hohum, I guess we should look into this.
OTOH: nothing wrong with calling posix_fallocate() if you need the space
guarantees it provides for proper operation...
Right, but it's something that the filesystem has no real control
over. We've been asked to allocate blocks immediately by
fallocate(), and so we get what we get....

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Christoph Hellwig
2014-03-22 13:32:52 UTC
Permalink
Post by Florian Weimer
I see rather dramatic fragmentation of the systemd journal when it is
stored on XFS, and it calls posix_fallocate before writing data to the
file.
You mean it calls fallocate before each write? That's not very useful
behaviour and should be fixed. If it calls fallocate for the whole
expeted file size (or large increments) it should not fragment the file,
and if it does there's a bug we'd need to look into.

Dave Chinner
2014-03-21 23:05:53 UTC
Permalink
Post by Brian Foster
Hi all,
Eric had suggested we add an FAQ entry for speculative preallocation
since it seems to be a common question, so I offered to write something
up. I started with a single entry but split it into a couple Q's when it
turned into TL;DR fodder. ;)
The text is embedded below for review. Thoughts on the questions or
content is appreciated. Also, once folks are Ok with this... how does
one gain edit access to the wiki?
Request an account and wait for one of us admins to ack it.

FWIW, what I'd really like is for the FAQ to be converted to a
asciidoc document in the xfs-documentation tree. The current FAQ has
lots of stuff that could do with updating, but editing a wiki
document that long in a browser is, well, painful. We can then
publish the build html version of the FAQ on the wiki...
Post by Brian Foster
Brian
---
Q: Why do files on XFS use more data blocks than expected?
The XFS speculative preallocation algorithm allocates extra blocks
beyond end of file (EOF) to combat fragmentation under parallel
sequential write workloads.
"minimise file fragmentation during buffered write workloads.
Workloads that benefit from this behaviour include slowly growing
files, concurrent writers and mixed reader/writers workloads. It
also provides fragmentation resistence in situations where memory
pressure prevents adequate buffering of dirty data to allow large
contiguous regions of dirty data to be formed in memory."
Post by Brian Foster
This post-EOF block allocation is included
"is accounted identically to blocks withing EOF. It is visible..."
Post by Brian Foster
in 'st_blocks' counts via stat() system calls and is accounted as
globally allocated space by the filesystem. This is reported by various
userspace utilities (stat, du, df, ls) and thus provides a common source
of confusion for administrators. Post-EOF blocks are temporary in most
situations and are usually reclaimed via several possible mechanisms in
XFS.
Also accounted for in quotas.
Post by Brian Foster
See the FAQ entry on speculative preallocation for details.
Q: What is speculative preallocation? How can I manage it?
XFS speculatively preallocates post-EOF blocks on file extending writes
in anticipation of future extending writes. The size of a preallocation
is dynamic and depends on the size of the previous extent in the file
(starting from 0 again if the write extends past a hole).
I'd keep specific heuristics out of the description. Heuristics
change....
Post by Brian Foster
As files grow
larger, so do the size of preallocations. Speculative preallocation is
not enabled for files smaller than a minimum size (64k by default, but
can vary depending on filesystem geometry and/or mount options).
Again, actual numbers should probably be avoided, because we can
change that at will...
Post by Brian Foster
Preallocations are capped at a maximum of 8GB on 4k block filesystems.
"capped at a single extent of the maximum supported size of the filesystem"
Post by Brian Foster
Preallocation is throttled automatically as the filesystem approaches
low free space conditions or other allocation limits on a file (such as
a quota).
"Preallocation size is throttled..."
Post by Brian Foster
In most cases, speculative preallocation is automatically reclaimed when
a file is closed. The preallocation may persist after file close if an
open, write, close pattern is repeated on a file. In this scenario,
post-EOF preallocation is trimmed once the inode is reclaimed from cache
or the filesystem unmounted.
I'd rewrite this slightly differently, saying that preallocation "may
persist beyond the lifecycle of any given file descriptor." And then
describe the reason for this - that certain application behaviours
(like slowly growing files, or file servers) can cause fragmentation
if we remove the preallocation on fd close. These behaviours are
automatically detected, and result in "delayed removal" of the
preallocation.

Q: How can I speed up or avoid delayed removal of speculative preallocation?

A. Removing the inode from the VFS cache or unmounting the
filesystem will remove speculative preallocations associated with an
inode.
Post by Brian Foster
Linux 3.8 (and later) includes a scanner to perform background trimming
of files with lingering post-EOF preallocations. The scanner bypasses
files that have been recently modified to not interfere with ongoing
writes. A 5 minute scan interval is used by default and can be adjusted
/proc/sys/fs/xfs/speculative_prealloc_lifetime
Q: Is speculative preallocation permanent?
Post by Brian Foster
Although speculative preallocation can lead to reports of excess space
usage, the preallocated space is not permanent unless explicitly made so
via fallocate or a similar interface. Preallocated space can also be
encoded permanently in situations where file size is extended beyond a
range of post-EOF blocks (i.e., via truncate). Otherwise, preallocated
blocks are reclaimed on file close, inode reclaim, unmount or in the
background once file write activity subsides.
Q: My workload has known characteristics - can I tune speculative
preallocation to be an optimal fixed size?

A.
Post by Brian Foster
Finally, the XFS block allocation algorithm can be configured to use a
fixed allocation size with the 'allocsize=' mount option. Note that
speculative preallocation does not occur when a fixed allocation size is
set and thus increases the potential for fragmentation via parallel
writes.
This should say "dynamic resizing of speculative preallocation does
not occur" rather than "speculative preallocation does not occur",
because allocsize only determines the size of the speculative
preallocation beyond EOF that is done - it doesn't turn it off...

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Loading...