[PATCH v4] xfs: rework zero range to prevent invalid i

Discussion:

[PATCH v4] xfs: rework zero range to prevent invalid i_size updates

Brian Foster

2014-10-17 20:20:20 UTC

The zero range operation is analogous to fallocate with the exception of
converting the range to zeroes. E.g., it attempts to allocate zeroed
blocks over the range specified by the caller. The XFS implementation
kills all delalloc blocks currently over the aligned range, converts the
range to allocated zero blocks (unwritten extents) and handles the
partial pages at the ends of the range by sending writes through the
pagecache.
The current implementation suffers from several problems associated with
inode size. If the aligned range covers an extending I/O, said I/O is
discarded and an inode size update from a previous write never makes it
to disk. Further, if an unaligned zero range extends beyond eof, the
page write induced for the partial end page can itself increase the
inode size, even if the zero range request is not supposed to update
i_size (via KEEP_SIZE, similar to an fallocate beyond EOF).
The latter behavior not only incorrectly increases the inode size, but
can lead to stray delalloc blocks on the inode. Typically, post-eof
preallocation blocks are either truncated on release or inode eviction
or explicitly written to by xfs_zero_eof() on natural file size
extension. If the inode size increases due to zero range, however,
associated blocks leak into the address space having never been
converted or mapped to pagecache pages. A direct I/O to such an
uncovered range cannot convert the extent via writeback and will BUG().
$ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" <file>
...
$ xfs_io -d -c "pread 128k 128k" <file>
<BUG>
If the entire delalloc extent happens to not have page coverage
whatsoever (e.g., delalloc conversion couldn't find a large enough free
space extent), even a full file writeback won't convert what's left of
the extent and we'll assert on inode eviction.
Rework xfs_zero_file_space() to avoid buffered I/O for partial pages.
Use the existing hole punch and prealloc mechanisms as primitives for
zero range. We punch out the pagecache beforehand to eliminate
unnecessary writeback. The hole punch mechanism handles partial block
zeroing for us and facilitates the use of a single prealloc call over
the entire range, which increases the odds of contiguous allocation.
---
This is v4, but effectively an alternate implementation to v3 that
reduces zero range to a hole punch and preallocate operation. I
originally reproduced a hang with v3 + my unwritten conversion helper
patch that I thought somehow related to said helper, but I have seen
what appears to be the same thing with this patch after a couple hundred
cycles of generic/269. That leads me to believe that there is a more
general problem perhaps related to the increased extent manipulation
going on in either implementation (whether for conversion or removal). I
don't reproduce such a failure with focused hole punch testing on tot,
but I do reproduce other assert failures that I suspect just mask the
ability to reproduce this one.
In short, all three of these configurations pass basic correctness tests
for me. v3 has been the most reliable in stress tests. v3+unwritten
conversion and v4 seem to reproduce this hang after a while. v4 is
probably the most simple implementation of the three.
Brian
- Simplify the implementation to use hole punch.
v3: http://oss.sgi.com/archives/xfs/2014-10/msg00149.html
- Pass length to xfs_alloc_file_space() rather than end offset.
- Split up start/end page writeback branches.
- Fix up a bunch of comments.
v2: http://oss.sgi.com/archives/xfs/2014-10/msg00138.html
- Refactor the logic to punch out pagecache/delalloc first and do
allocation last to prevent stray delalloc on ENOSPC.
v1: http://oss.sgi.com/archives/xfs/2014-10/msg00052.html
fs/xfs/xfs_bmap_util.c | 95 +++++++++++++++++++++++++++++---------------------
1 file changed, 55 insertions(+), 40 deletions(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 92e8f99..8d178fc 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1338,7 +1338,10 @@ xfs_free_file_space(
goto out;
}
-
+/*
+ * Preallocate and zero a range of a file. This mechanism has the allocation
+ * semantics of fallocate and in addition converts data in the range to zeroes.
+ */
int
xfs_zero_file_space(
struct xfs_inode *ip,
@@ -1346,65 +1349,77 @@ xfs_zero_file_space(
xfs_off_t len)
{
struct xfs_mount *mp = ip->i_mount;
- uint granularity;
+ uint blksize;
xfs_off_t start_boundary;
xfs_off_t end_boundary;
int error;
+ loff_t eof;
trace_xfs_zero_file_space(ip);
- granularity = max_t(uint, 1 << mp->m_sb.sb_blocklog, PAGE_CACHE_SIZE);
+ blksize = 1 << mp->m_sb.sb_blocklog;
/*
- * Round the range of extents we are going to convert inwards. If the
- * offset is aligned, then it doesn't get changed so we zero from the
- * start of the block offset points to.
+ * Align the range inward to page size. This represents the range of
+ * pages that can be tossed, even if dirty.
*/
- start_boundary = round_up(offset, granularity);
- end_boundary = round_down(offset + len, granularity);
+ start_boundary = round_up(offset, PAGE_CACHE_SIZE);
+ end_boundary = round_down(offset + len, PAGE_CACHE_SIZE);
ASSERT(start_boundary >= offset);
ASSERT(end_boundary <= offset + len);
- if (start_boundary < end_boundary - 1) {
+ /*
+ * If the range covers one or more full pages, punch out the pagecache
+ * and any delalloc blocks over the range. This is an optimization to
+ * prevent writeback and delalloc extent conversion over the zeroed
+ * range via the hole punch or prealloc calls.
+ *
+ * We only handle the page aligned range here because this function does
+ * not handle the partial block zeroing necessary to keep the cache and
+ * on-disk data consistent. That is the responsibility of the
+ * xfs_free_file_space() call below.
+ */
+ if (end_boundary > start_boundary) {
/*
- * Writeback the range to ensure any inode size updates due to
- * appending writes make it to disk (otherwise we could just
- * punch out the delalloc blocks).
+ * Flush the eof page first if it falls within the range so we
+ * do not lose i_size updates.
*/
- error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
- start_boundary, end_boundary - 1);
- if (error)
- goto out;
+ eof = round_down(i_size_read(VFS_I(ip)) - 1, PAGE_CACHE_SIZE);
+ if (i_size_read(VFS_I(ip)) > ip->i_d.di_size &&
+ eof >= start_boundary && eof <= end_boundary)
+ filemap_write_and_wait_range(VFS_I(ip)->i_mapping, eof,
+ -1);
truncate_pagecache_range(VFS_I(ip), start_boundary,
end_boundary - 1);
- /* convert the blocks */
- error = xfs_alloc_file_space(ip, start_boundary,
- end_boundary - start_boundary - 1,
- XFS_BMAPI_PREALLOC | XFS_BMAPI_CONVERT);
- if (error)
- goto out;
-
- /* We've handled the interior of the range, now for the edges */
- if (start_boundary != offset) {
- error = xfs_iozero(ip, offset, start_boundary - offset);
- if (error)
- goto out;
- }
-
- if (end_boundary != offset + len)
- error = xfs_iozero(ip, end_boundary,
- offset + len - end_boundary);
-
- } else {
- /*
- * It's either a sub-granularity range or the range spanned lies
- * partially across two adjacent blocks.
- */
- error = xfs_iozero(ip, offset, len);
+ xfs_ilock(ip, XFS_ILOCK_EXCL);
+ error = xfs_bmap_punch_delalloc_range(ip,
+ XFS_B_TO_FSBT(mp, start_boundary),
+ XFS_B_TO_FSB(mp, end_boundary - start_boundary));
+ xfs_iunlock(ip, XFS_ILOCK_EXCL);
}

It appears that the tp->t_blk_res overrun assert is also related to
delalloc punching. I'm basically seeing a writepage attempted for a page
over a delalloc extent (imap in writepage reflects this), but at some
point before we get into the allocation (I suspect between where we
release and reacquire ilock in xfs_map_blocks() to
iomap_write_allocate()) the delalloc blocks that relate this particular
page to the extent disappear. An in-core extent dump after doing the
allocation but before inserting into the extent tree shows the range to
be allocated as a hole. Indeed, the allocator saw a hole (wasdel == 0)
and ended up doing the full allocation while the transaction had a
reservation for a delalloc conversion, which we end up overrunning.

It's not quite clear to me how this is happening. I would expect
truncate_pagecache_range() to serialize against writeback via page lock.
Regardless, the scenario of writing back pages into unallocated space
that writeback thinks is covered by delayed allocation suggests
something racing with delalloc punching. The problem seems to disappear
if I comment it out from zero range, so we might want to kill the above
hunk entirely for the time being.

Brian

+ /*
+ * Punch a hole and prealloc the range. We use hole punch rather than
+ *
+ * 1.) Hole punch handles partial block zeroing for us. Note that we've
+ * already tossed pagecache over the aligned range so we won't write
+ * back too much data (only unaligned start/end pages to avoid races
+ * with uncached I/O).
+ *
+ * 2.) If prealloc returns ENOSPC, the file range is still zero-valued
+ * by virtue of the hole punch.
+ */
+ error = xfs_free_file_space(ip, offset, len);
+ if (error)
+ goto out;
+
+ error = xfs_alloc_file_space(ip, round_down(offset, blksize),
+ round_up(offset + len, blksize) -
+ round_down(offset, blksize),
+ XFS_BMAPI_PREALLOC);
return error;
--
1.8.3.1
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs

Dave Chinner

2014-10-20 01:25:16 UTC

Permalink

Post by Brian Foster
It appears that the tp->t_blk_res overrun assert is also related to
delalloc punching. I'm basically seeing a writepage attempted for a page
over a delalloc extent (imap in writepage reflects this), but at some
point before we get into the allocation (I suspect between where we
release and reacquire ilock in xfs_map_blocks() to
iomap_write_allocate()) the delalloc blocks that relate this particular
page to the extent disappear. An in-core extent dump after doing the
allocation but before inserting into the extent tree shows the range to
be allocated as a hole. Indeed, the allocator saw a hole (wasdel == 0)
and ended up doing the full allocation while the transaction had a
reservation for a delalloc conversion, which we end up overrunning.
It's not quite clear to me how this is happening. I would expect
truncate_pagecache_range() to serialize against writeback via page lock.

It does. But the problem won't be truncate_pagecache_range(), it will be
that punching delalloc blocks doesn't serialise against the page
lock.

i.e. we run truncate_pagecache_range/truncate_setsize without
holding inode locks, and they serialise against writeback via page
locks. However, we run xfs_bmap_punch_delalloc_range() holding inode
locks, but no page locks. Hence both can serialise against
writeback, but not against each other. therefore:

writeback zero range

page_lock
page is delalloc
XFS_ILOCK_SHARED XFS_ILOCK_EXCL
map delalloc <blocks>
xfs_iunlock
punch delalloc range
xfs_iunlock
XFS_ILOCK_EXCL pagecache_truncate_range
allocate HOLE!!!! page_lock
xfs_iunlock <blocks>
starts page writeback
unlocks page
waits on writeback
removes page from cache

Basically, we back to the fundamental problem that we can't
manipulate extent state directly if there are *unlocked* pages in
the page cache over that range because we hold meaningful writeback
state on the page.

Post by Brian Foster
Regardless, the scenario of writing back pages into unallocated space
that writeback thinks is covered by delayed allocation suggests
something racing with delalloc punching. The problem seems to disappear
if I comment it out from zero range, so we might want to kill the above
hunk entirely for the time being.

Yes, but I think I see the way forward now. xfs_vm_writepage() needs
to unconditionally map the page it is being passed rather than
relying on the state of bufferheads attached to the buffer. We can't
serialise the page cache state against direct extent manipulations,
therefore we have to make all decisions based on one set of state we
can serialise access to sanely.

The issue is that we can't currently serialise extent manipulations
against writeback easily, but I think I see a neat and easy way to
do this, and it mirrors a technique we already use(*): add a cursor
to track the active map we are operating on so that we know if we
are about to modify and extent we are currently writing to.

Actually, a cursor solves a couple of other problems in this
mapping code - see the big comment in xfs_iomap_write_allocate()
about using a single map to avoid lookup/unlock/lock/allocate race
conditions with truncate/hole-punch.

So I think I have a solution that moves us towards the goal of "die,
bufferheads, die" and solves these writeback vs page cache vs extent
manipulation issues - let me do a bit more thinking about it and
I'll write something up.

Cheers,

Dave.

(*) we use a cursor for a similar purpose during AIL traversal: to
detect item deletion while the AIL lock was dropped that would cause
us to follow an invalid list pointer.

--
Dave Chinner
***@fromorbit.com

Brian Foster

2014-10-20 12:42:49 UTC

Permalink

Post by Dave Chinner

It does. But the problem won't be truncate_pagecache_range(), it will be
that punching delalloc blocks doesn't serialise against the page
lock.

Ok...

Post by Dave Chinner
i.e. we run truncate_pagecache_range/truncate_setsize without
holding inode locks, and they serialise against writeback via page
locks. However, we run xfs_bmap_punch_delalloc_range() holding inode
locks, but no page locks. Hence both can serialise against
writeback zero range
page_lock
page is delalloc
XFS_ILOCK_SHARED XFS_ILOCK_EXCL
map delalloc <blocks>
xfs_iunlock

This pretty much maps the observed behavior and what I suspect is
happening on a basic level (e.g., this ilock cycle opens a race window).
What still isn't clear is how this gets past the pagecache truncate in
such a state. The zero range path truncates pagecache before the
delalloc punch, which does serialize on the page lock. If it gets the
lock it completely tosses the page, which it looks like the mapping
checks in writepage should handle. If it doesn't get the lock or the
page is writeback, it looks like it waits on writeback.

This is an fsx workload so it is single threaded I/O. I'm not quite
seeing how we get through the writeback/pagecache truncate in a state
with a delalloc page.

Post by Dave Chinner
punch delalloc range
xfs_iunlock
XFS_ILOCK_EXCL pagecache_truncate_range
allocate HOLE!!!! page_lock
xfs_iunlock <blocks>

Ah, so this makes sense according to the code as it is today because we
punch delalloc blocks and then truncate pagecache. Note that this zero
range rework does the pagecache truncate first, then punches out the
delalloc blocks. Hence my expectation of either seeing the page tossed
or the extent converted before getting to the delalloc punch.

Post by Dave Chinner
starts page writeback
unlocks page
waits on writeback
removes page from cache
Basically, we back to the fundamental problem that we can't
manipulate extent state directly if there are *unlocked* pages in
the page cache over that range because we hold meaningful writeback
state on the page.

Sounds interesting. I'm all for condensing our page state management
towards killing buffer_head. Tracking down some of these issues has
helped clarify the general problem for me and also shows that instances
are quite difficult to root cause. This one in particular is a zero
range operation that presumably works (no fsx failures), only to cause
an implicit error in an asynchronous path (writeback) sometime later.
That makes it difficult to detect and equally difficult to try and
narrow down.

Post by Dave Chinner
The issue is that we can't currently serialise extent manipulations
against writeback easily, but I think I see a neat and easy way to
do this, and it mirrors a technique we already use(*): add a cursor
to track the active map we are operating on so that we know if we
are about to modify and extent we are currently writing to.

Ok, so I take it this would allow us to mark an extent as "under
modification" or "in-use" in the sense of its state is significant for a
code sequence larger than an ilock critical section (i.e., an imap,
extent-conversion, writeback sequence). Therefore, one side can
invalidate another if a modification is made, or conversely the other
side could simply block on the operation if that is more appropriate. I
_think_ I get where you're going there, but I await more details. ;)

I'll post a new version of this patch regardless because as mentioned
previously, I'd like to get some kind of checkpoint into the tree that
works correctly (no asserts), even if the implementation is not ideal.
I've not been able to reproduce any asserts so far without the delalloc
punch, but I'd also like to see if any of your subsequent testing
uncovers anything else.

Brian

Post by Dave Chinner
Actually, a cursor solves a couple of other problems in this
mapping code - see the big comment in xfs_iomap_write_allocate()
about using a single map to avoid lookup/unlock/lock/allocate race
conditions with truncate/hole-punch.
So I think I have a solution that moves us towards the goal of "die,
bufferheads, die" and solves these writeback vs page cache vs extent
manipulation issues - let me do a bit more thinking about it and
I'll write something up.
Cheers,
Dave.
(*) we use a cursor for a similar purpose during AIL traversal: to
detect item deletion while the AIL lock was dropped that would cause
us to follow an invalid list pointer.
--
Dave Chinner
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs