Discussion:
[PATCH v2 1/5] xfs: track collapse via file offset rather than extent index
Brian Foster
2014-09-10 13:20:27 UTC
Permalink
The collapse range implementation uses a transaction per extent shift.
The progress of the overall operation is tracked via the current extent
index of the in-core extent list. This is racy because the ilock must be
dropped and reacquired for each transaction according to locking and log
reservation rules. Therefore, writeback to prior regions of the file is
possible and can change the extent count. This changes the extent to
which the current index refers and causes the collapse to fail mid
operation. To avoid this problem, the entire file is currently written
back before the collapse operation starts.

To eliminate the need to flush the entire file, use the file offset
(fsb) to track the progress of the overall extent shift operation rather
than the extent index. Modify xfs_bmap_shift_extents() to
unconditionally convert the start_fsb parameter to an extent index and
return the file offset of the extent where the shift left off, if
further extents exist. The bulk of ths function can remain based on
extent index as ilock is held by the caller. xfs_collapse_file_space()
now uses the fsb output as the starting point for the subsequent shift.

Signed-off-by: Brian Foster <***@redhat.com>
Reviewed-by: Dave Chinner <***@redhat.com>
---
fs/xfs/libxfs/xfs_bmap.c | 85 +++++++++++++++++++++++++-----------------------
fs/xfs/libxfs/xfs_bmap.h | 7 ++--
fs/xfs/xfs_bmap_util.c | 12 +++----
3 files changed, 53 insertions(+), 51 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 86df952..4b3f1b9 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5406,20 +5406,21 @@ error0:
/*
* Shift extent records to the left to cover a hole.
*
- * The maximum number of extents to be shifted in a single operation
- * is @num_exts, and @current_ext keeps track of the current extent
- * index we have shifted. @offset_shift_fsb is the length by which each
- * extent is shifted. If there is no hole to shift the extents
- * into, this will be considered invalid operation and we abort immediately.
+ * The maximum number of extents to be shifted in a single operation is
+ * @num_exts. @start_fsb specifies the file offset to start the shift and the
+ * file offset where we've left off is returned in @next_fsb. @offset_shift_fsb
+ * is the length by which each extent is shifted. If there is no hole to shift
+ * the extents into, this will be considered invalid operation and we abort
+ * immediately.
*/
int
xfs_bmap_shift_extents(
struct xfs_trans *tp,
struct xfs_inode *ip,
- int *done,
xfs_fileoff_t start_fsb,
xfs_fileoff_t offset_shift_fsb,
- xfs_extnum_t *current_ext,
+ int *done,
+ xfs_fileoff_t *next_fsb,
xfs_fsblock_t *firstblock,
struct xfs_bmap_free *flist,
int num_exts)
@@ -5431,6 +5432,7 @@ xfs_bmap_shift_extents(
struct xfs_mount *mp = ip->i_mount;
struct xfs_ifork *ifp;
xfs_extnum_t nexts = 0;
+ xfs_extnum_t current_ext;
xfs_fileoff_t startoff;
int error = 0;
int i;
@@ -5451,7 +5453,8 @@ xfs_bmap_shift_extents(
if (XFS_FORCED_SHUTDOWN(mp))
return -EIO;

- ASSERT(current_ext != NULL);
+ ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
+ ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));

ifp = XFS_IFORK_PTR(ip, whichfork);
if (!(ifp->if_flags & XFS_IFEXTENTS)) {
@@ -5462,20 +5465,18 @@ xfs_bmap_shift_extents(
}

/*
- * If *current_ext is 0, we would need to lookup the extent
- * from where we would start shifting and store it in gotp.
+ * Look up the extent index for the fsb where we start shifting. We can
+ * henceforth iterate with current_ext as extent list changes are locked
+ * out via ilock.
+ *
+ * gotp can be null in 2 cases: 1) if there are no extents or 2)
+ * start_fsb lies in a hole beyond which there are no extents. Either
+ * way, we are done.
*/
- if (!*current_ext) {
- gotp = xfs_iext_bno_to_ext(ifp, start_fsb, current_ext);
- /*
- * gotp can be null in 2 cases: 1) if there are no extents
- * or 2) start_fsb lies in a hole beyond which there are
- * no extents. Either way, we are done.
- */
- if (!gotp) {
- *done = 1;
- return 0;
- }
+ gotp = xfs_iext_bno_to_ext(ifp, start_fsb, &current_ext);
+ if (!gotp) {
+ *done = 1;
+ return 0;
}

if (ifp->if_flags & XFS_IFBROOT) {
@@ -5487,36 +5488,32 @@ xfs_bmap_shift_extents(

/*
* There may be delalloc extents in the data fork before the range we
- * are collapsing out, so we cannot
- * use the count of real extents here. Instead we have to calculate it
- * from the incore fork.
+ * are collapsing out, so we cannot use the count of real extents here.
+ * Instead we have to calculate it from the incore fork.
*/
total_extents = ifp->if_bytes / sizeof(xfs_bmbt_rec_t);
- while (nexts++ < num_exts && *current_ext < total_extents) {
+ while (nexts++ < num_exts && current_ext < total_extents) {

- gotp = xfs_iext_get_ext(ifp, *current_ext);
+ gotp = xfs_iext_get_ext(ifp, current_ext);
xfs_bmbt_get_all(gotp, &got);
startoff = got.br_startoff - offset_shift_fsb;

/*
- * Before shifting extent into hole, make sure that the hole
- * is large enough to accomodate the shift.
+ * Before shifting extent into hole, make sure that the hole is
+ * large enough to accommodate the shift.
*/
- if (*current_ext) {
- xfs_bmbt_get_all(xfs_iext_get_ext(ifp,
- *current_ext - 1), &left);
-
+ if (current_ext > 0) {
+ xfs_bmbt_get_all(xfs_iext_get_ext(ifp, current_ext - 1),
+ &left);
if (startoff < left.br_startoff + left.br_blockcount)
error = -EINVAL;
} else if (offset_shift_fsb > got.br_startoff) {
/*
- * When first extent is shifted, offset_shift_fsb
- * should be less than the stating offset of
- * the first extent.
+ * When first extent is shifted, offset_shift_fsb should
+ * be less than the stating offset of the first extent.
*/
error = -EINVAL;
}
-
if (error)
goto del_cursor;

@@ -5531,7 +5528,7 @@ xfs_bmap_shift_extents(
}

/* Check if we can merge 2 adjacent extents */
- if (*current_ext &&
+ if (current_ext &&
left.br_startoff + left.br_blockcount == startoff &&
left.br_startblock + left.br_blockcount ==
got.br_startblock &&
@@ -5539,7 +5536,7 @@ xfs_bmap_shift_extents(
left.br_blockcount + got.br_blockcount <= MAXEXTLEN) {
blockcount = left.br_blockcount +
got.br_blockcount;
- xfs_iext_remove(ip, *current_ext, 1, 0);
+ xfs_iext_remove(ip, current_ext, 1, 0);
logflags |= XFS_ILOG_CORE;
if (cur) {
error = xfs_btree_delete(cur, &i);
@@ -5551,7 +5548,7 @@ xfs_bmap_shift_extents(
}
XFS_IFORK_NEXT_SET(ip, whichfork,
XFS_IFORK_NEXTENTS(ip, whichfork) - 1);
- gotp = xfs_iext_get_ext(ifp, --*current_ext);
+ gotp = xfs_iext_get_ext(ifp, --current_ext);
xfs_bmbt_get_all(gotp, &got);

/* Make cursor point to the extent we will update */
@@ -5585,13 +5582,18 @@ xfs_bmap_shift_extents(
logflags |= XFS_ILOG_DEXT;
}

- (*current_ext)++;
+ current_ext++;
total_extents = ifp->if_bytes / sizeof(xfs_bmbt_rec_t);
}

/* Check if we are done */
- if (*current_ext == total_extents)
+ if (current_ext == total_extents)
*done = 1;
+ else if (next_fsb) {
+ gotp = xfs_iext_get_ext(ifp, current_ext);
+ xfs_bmbt_get_all(gotp, &got);
+ *next_fsb = got.br_startoff;
+ }

del_cursor:
if (cur)
@@ -5600,5 +5602,6 @@ del_cursor:

if (logflags)
xfs_trans_log_inode(tp, ip, logflags);
+
return error;
}
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index b879ca5..44db6db 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -178,9 +178,8 @@ int xfs_check_nostate_extents(struct xfs_ifork *ifp, xfs_extnum_t idx,
xfs_extnum_t num);
uint xfs_default_attroffset(struct xfs_inode *ip);
int xfs_bmap_shift_extents(struct xfs_trans *tp, struct xfs_inode *ip,
- int *done, xfs_fileoff_t start_fsb,
- xfs_fileoff_t offset_shift_fsb, xfs_extnum_t *current_ext,
- xfs_fsblock_t *firstblock, struct xfs_bmap_free *flist,
- int num_exts);
+ xfs_fileoff_t start_fsb, xfs_fileoff_t offset_shift_fsb,
+ int *done, xfs_fileoff_t *next_fsb, xfs_fsblock_t *firstblock,
+ struct xfs_bmap_free *flist, int num_exts);

#endif /* __XFS_BMAP_H__ */
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 1707980..1e96d77 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1456,18 +1456,18 @@ xfs_collapse_file_space(
struct xfs_mount *mp = ip->i_mount;
struct xfs_trans *tp;
int error;
- xfs_extnum_t current_ext = 0;
struct xfs_bmap_free free_list;
xfs_fsblock_t first_block;
int committed;
xfs_fileoff_t start_fsb;
+ xfs_fileoff_t next_fsb;
xfs_fileoff_t shift_fsb;

ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));

trace_xfs_collapse_file_space(ip);

- start_fsb = XFS_B_TO_FSB(mp, offset + len);
+ next_fsb = XFS_B_TO_FSB(mp, offset + len);
shift_fsb = XFS_B_TO_FSB(mp, len);

/*
@@ -1525,10 +1525,10 @@ xfs_collapse_file_space(
* We are using the write transaction in which max 2 bmbt
* updates are allowed
*/
- error = xfs_bmap_shift_extents(tp, ip, &done, start_fsb,
- shift_fsb, &current_ext,
- &first_block, &free_list,
- XFS_BMAP_MAX_SHIFT_EXTENTS);
+ start_fsb = next_fsb;
+ error = xfs_bmap_shift_extents(tp, ip, start_fsb, shift_fsb,
+ &done, &next_fsb, &first_block, &free_list,
+ XFS_BMAP_MAX_SHIFT_EXTENTS);
if (error)
goto out;
--
1.8.3.1
Brian Foster
2014-09-10 13:20:31 UTC
Permalink
xfs_free_file_space() only affects the range of the file for which space
is being freed. It currently writes and truncates the page cache from
the start offset of the free to EOF.

Modify xfs_free_file_space() to write back and truncate page cache of
just the range being freed.

Signed-off-by: Brian Foster <***@redhat.com>
Reviewed-by: Dave Chinner <***@redhat.com>
---
fs/xfs/xfs_bmap_util.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index eae763f..809ae7d 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1205,6 +1205,7 @@ xfs_free_file_space(
xfs_bmap_free_t free_list;
xfs_bmbt_irec_t imap;
xfs_off_t ioffset;
+ xfs_off_t iendoffset;
xfs_extlen_t mod=0;
xfs_mount_t *mp;
int nimap;
@@ -1233,12 +1234,13 @@ xfs_free_file_space(
inode_dio_wait(VFS_I(ip));

rounding = max_t(xfs_off_t, 1 << mp->m_sb.sb_blocklog, PAGE_CACHE_SIZE);
- ioffset = offset & ~(rounding - 1);
- error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
- ioffset, -1);
+ ioffset = round_down(offset, rounding);
+ iendoffset = round_up(offset + len, rounding) - 1;
+ error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, ioffset,
+ iendoffset);
if (error)
goto out;
- truncate_pagecache_range(VFS_I(ip), ioffset, -1);
+ truncate_pagecache_range(VFS_I(ip), ioffset, iendoffset);

/*
* Need to zero the stuff we're not freeing, on disk.
--
1.8.3.1
Brian Foster
2014-09-10 13:20:30 UTC
Permalink
The collapse range operation currently writes the entire file before
starting the collapse to avoid changes in the in-core extent list due to
writeback causing the extent count to change. Now that collapse range is
fsb based rather than extent index based it can sustain changes in the
extent list during the shift sequence without disruption.

Modify xfs_collapse_file_space() to writeback and invalidate pages
associated with the range of the file to be shifted.
xfs_free_file_space() currently has similar behavior, but the space free
need only affect the region of the file that is freed and this could
change in the future.

Also update the comments to reflect the current implementation. We
retain the eofblocks trim permanently as a best option for dealing with
delalloc extents. We don't shift delalloc extents because this scenario
only occurs with post-eof preallocation (since data must be flushed such
that the cache can be invalidated and data can be shifted). That means
said space must also be initialized before being shifted into the
accessible region of the file only to be immediately truncated off as
the last part of the collapse. In other words, the eofblocks trim will
happen anyways, we just run it first to ensure the file remains in a
consistent state throughout the collapse.

Finally, detect and fail explicitly in the event of a delalloc extent
during the extent shift. The implementation does not support delalloc
extents and the caller is expected to prevent this scenario in advance
as is done by collapse.

Signed-off-by: Brian Foster <***@redhat.com>
---
fs/xfs/libxfs/xfs_bmap.c | 4 ++++
fs/xfs/xfs_bmap_util.c | 32 +++++++++++++++++++-------------
2 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 69bf8d8..79c9819 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5543,6 +5543,10 @@ xfs_bmse_shift_one(
xfs_bmbt_get_all(gotp, &got);
startoff = got.br_startoff - offset_shift_fsb;

+ /* delalloc extents should be prevented by caller */
+ XFS_WANT_CORRUPTED_GOTO(!isnullstartblock(got.br_startblock),
+ out_error);
+
/*
* If this is the first extent in the file, make sure there's enough
* room at the start of the file and jump right to the shift as there's
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 1e96d77..eae763f 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1470,27 +1470,33 @@ xfs_collapse_file_space(
next_fsb = XFS_B_TO_FSB(mp, offset + len);
shift_fsb = XFS_B_TO_FSB(mp, len);

- /*
- * Writeback the entire file and force remove any post-eof blocks. The
- * writeback prevents changes to the extent list via concurrent
- * writeback and the eofblocks trim prevents the extent shift algorithm
- * from running into a post-eof delalloc extent.
- *
- * XXX: This is a temporary fix until the extent shift loop below is
- * converted to use offsets and lookups within the ILOCK rather than
- * carrying around the index into the extent list for the next
- * iteration.
- */
- error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
+ error = xfs_free_file_space(ip, offset, len);
if (error)
return error;
+
+ /*
+ * Trim eofblocks to avoid shifting uninitialized post-eof preallocation
+ * into the accessible region of the file.
+ */
if (xfs_can_free_eofblocks(ip, true)) {
error = xfs_free_eofblocks(mp, ip, false);
if (error)
return error;
}

- error = xfs_free_file_space(ip, offset, len);
+ /*
+ * Writeback and invalidate cache for the remainder of the file as we're
+ * about to shift down every extent from the collapse range to EOF. The
+ * free of the collapse range above might have already done some of
+ * this, but we shouldn't rely on it to do anything outside of the range
+ * that was freed.
+ */
+ error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
+ offset + len, -1);
+ if (error)
+ return error;
+ error = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
+ (offset + len) >> PAGE_CACHE_SHIFT, -1);
if (error)
return error;
--
1.8.3.1
Brian Foster
2014-09-10 13:20:28 UTC
Permalink
The extent shift mechanism in xfs_bmap_shift_extents() is complicated
and handles several different, non-deterministic scenarios. These
include extent shifts, extent merges and potential btree updates in
either of the former scenarios.

Refactor the code to be more linear and readable. The loop logic in
xfs_bmap_shift_extents() and some initial error checking is adjusted
slightly. The associated btree lookup and update/delete operations are
condensed into single blocks of code. This reduces the number of
btree-specific blocks and facilitates the separation of the merge
operation into a new xfs_bmse_merge() and xfs_bmse_can_merge() helpers.

This is a code refactor only. The behavior of extent shift and collapse
range is not modified.

Signed-off-by: Brian Foster <***@redhat.com>
---
fs/xfs/libxfs/xfs_bmap.c | 242 ++++++++++++++++++++++++++++++++---------------
1 file changed, 167 insertions(+), 75 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 4b3f1b9..532c4aa 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5404,6 +5404,120 @@ error0:
}

/*
+ * Determine whether an extent shift can be accomplished by a merge with the
+ * extent that precedes the target hole of the shift.
+ */
+STATIC bool
+xfs_bmse_can_merge(
+ struct xfs_bmbt_irec *left, /* preceding extent */
+ struct xfs_bmbt_irec *got, /* current extent to shift */
+ xfs_fileoff_t shift) /* shift fsb */
+{
+ xfs_fileoff_t startoff;
+
+ startoff = got->br_startoff - shift;
+
+ /*
+ * The extent, once shifted, must be adjacent in-file and on-disk with
+ * the preceding extent.
+ */
+ if ((left->br_startoff + left->br_blockcount != startoff) ||
+ (left->br_startblock + left->br_blockcount != got->br_startblock) ||
+ (left->br_state != got->br_state) ||
+ (left->br_blockcount + got->br_blockcount > MAXEXTLEN))
+ return false;
+
+ return true;
+}
+
+/*
+ * A bmap extent shift adjusts the file offset of an extent to fill a preceding
+ * hole in the file. If an extent shift would result in the extent being fully
+ * adjacent to the extent that currently precedes the hole, we can merge with
+ * the preceding extent rather than do the shift.
+ *
+ * This function assumes the caller has verified a shift-by-merge is possible
+ * with the provided extents via xfs_bmse_can_merge().
+ */
+STATIC int
+xfs_bmse_merge(
+ struct xfs_inode *ip,
+ int whichfork,
+ xfs_fileoff_t shift, /* shift fsb */
+ int current_ext, /* idx of gotp */
+ struct xfs_bmbt_rec_host *gotp, /* extent to shift */
+ struct xfs_bmbt_rec_host *leftp, /* preceding extent */
+ struct xfs_btree_cur *cur,
+ int *logflags) /* output */
+{
+ struct xfs_ifork *ifp;
+ struct xfs_bmbt_irec got;
+ struct xfs_bmbt_irec left;
+ xfs_filblks_t blockcount;
+ int error, i;
+
+ ifp = XFS_IFORK_PTR(ip, whichfork);
+ xfs_bmbt_get_all(gotp, &got);
+ xfs_bmbt_get_all(leftp, &left);
+ blockcount = left.br_blockcount + got.br_blockcount;
+
+ ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
+ ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
+ ASSERT(xfs_bmse_can_merge(&left, &got, shift));
+
+ /*
+ * Merge the in-core extents. Note that the host record pointers and
+ * current_ext index are invalid once the extent has been removed via
+ * xfs_iext_remove().
+ */
+ xfs_bmbt_set_blockcount(leftp, blockcount);
+ xfs_iext_remove(ip, current_ext, 1, 0);
+
+ /*
+ * Update the on-disk extent count, the btree if necessary and log the
+ * inode.
+ */
+ XFS_IFORK_NEXT_SET(ip, whichfork,
+ XFS_IFORK_NEXTENTS(ip, whichfork) - 1);
+ *logflags |= XFS_ILOG_CORE;
+ if (!cur) {
+ *logflags |= XFS_ILOG_DEXT;
+ return 0;
+ }
+
+ /* lookup and remove the extent to merge */
+ error = xfs_bmbt_lookup_eq(cur, got.br_startoff, got.br_startblock,
+ got.br_blockcount, &i);
+ if (error)
+ goto out_error;
+ XFS_WANT_CORRUPTED_GOTO(i == 1, out_error);
+
+ error = xfs_btree_delete(cur, &i);
+ if (error)
+ goto out_error;
+ XFS_WANT_CORRUPTED_GOTO(i == 1, out_error);
+
+ /* lookup and update size of the previous extent */
+ error = xfs_bmbt_lookup_eq(cur, left.br_startoff, left.br_startblock,
+ left.br_blockcount, &i);
+ if (error)
+ goto out_error;
+ XFS_WANT_CORRUPTED_GOTO(i == 1, out_error);
+
+ left.br_blockcount = blockcount;
+
+ error = xfs_bmbt_update(cur, left.br_startoff, left.br_startblock,
+ left.br_blockcount, left.br_state);
+ if (error)
+ goto out_error;
+
+ return 0;
+
+out_error:
+ return error;
+}
+
+/*
* Shift extent records to the left to cover a hole.
*
* The maximum number of extents to be shifted in a single operation is
@@ -5427,6 +5541,7 @@ xfs_bmap_shift_extents(
{
struct xfs_btree_cur *cur = NULL;
struct xfs_bmbt_rec_host *gotp;
+ struct xfs_bmbt_rec_host *leftp;
struct xfs_bmbt_irec got;
struct xfs_bmbt_irec left;
struct xfs_mount *mp = ip->i_mount;
@@ -5438,7 +5553,6 @@ xfs_bmap_shift_extents(
int i;
int whichfork = XFS_DATA_FORK;
int logflags = 0;
- xfs_filblks_t blockcount = 0;
int total_extents;

if (unlikely(XFS_TEST_ERROR(
@@ -5464,6 +5578,13 @@ xfs_bmap_shift_extents(
return error;
}

+ if (ifp->if_flags & XFS_IFBROOT) {
+ cur = xfs_bmbt_init_cursor(mp, tp, ip, whichfork);
+ cur->bc_private.b.firstblock = *firstblock;
+ cur->bc_private.b.flist = flist;
+ cur->bc_private.b.flags = 0;
+ }
+
/*
* Look up the extent index for the fsb where we start shifting. We can
* henceforth iterate with current_ext as extent list changes are locked
@@ -5476,14 +5597,17 @@ xfs_bmap_shift_extents(
gotp = xfs_iext_bno_to_ext(ifp, start_fsb, &current_ext);
if (!gotp) {
*done = 1;
- return 0;
+ goto del_cursor;
}
+ xfs_bmbt_get_all(gotp, &got);

- if (ifp->if_flags & XFS_IFBROOT) {
- cur = xfs_bmbt_init_cursor(mp, tp, ip, whichfork);
- cur->bc_private.b.firstblock = *firstblock;
- cur->bc_private.b.flist = flist;
- cur->bc_private.b.flags = 0;
+ /*
+ * If the first extent is shifted, offset_shift_fsb cannot be larger
+ * than the starting offset of the first extent.
+ */
+ if (current_ext == 0 && got.br_startoff < offset_shift_fsb) {
+ error = -EINVAL;
+ goto del_cursor;
}

/*
@@ -5493,30 +5617,41 @@ xfs_bmap_shift_extents(
*/
total_extents = ifp->if_bytes / sizeof(xfs_bmbt_rec_t);
while (nexts++ < num_exts && current_ext < total_extents) {
-
- gotp = xfs_iext_get_ext(ifp, current_ext);
- xfs_bmbt_get_all(gotp, &got);
startoff = got.br_startoff - offset_shift_fsb;

- /*
- * Before shifting extent into hole, make sure that the hole is
- * large enough to accommodate the shift.
- */
+ /* grab the left extent and check for a potential merge */
if (current_ext > 0) {
- xfs_bmbt_get_all(xfs_iext_get_ext(ifp, current_ext - 1),
- &left);
- if (startoff < left.br_startoff + left.br_blockcount)
+ leftp = xfs_iext_get_ext(ifp, current_ext - 1);
+ xfs_bmbt_get_all(leftp, &left);
+
+ /* make sure hole is large enough for shift */
+ if (startoff < left.br_startoff + left.br_blockcount) {
error = -EINVAL;
- } else if (offset_shift_fsb > got.br_startoff) {
- /*
- * When first extent is shifted, offset_shift_fsb should
- * be less than the stating offset of the first extent.
- */
- error = -EINVAL;
+ goto del_cursor;
+ }
+
+ if (xfs_bmse_can_merge(&left, &got, offset_shift_fsb)) {
+ error = xfs_bmse_merge(ip, whichfork,
+ offset_shift_fsb, current_ext, gotp,
+ leftp, cur, &logflags);
+ if (error)
+ goto del_cursor;
+
+ /*
+ * The extent was merged so adjust the extent
+ * index and move onto the next.
+ */
+ current_ext--;
+ goto next;
+ }
}
- if (error)
- goto del_cursor;

+ /*
+ * We didn't merge the extent so do the shift. Update the start
+ * offset in the in-core extent and btree, if necessary.
+ */
+ xfs_bmbt_set_startoff(gotp, startoff);
+ logflags |= XFS_ILOG_CORE;
if (cur) {
error = xfs_bmbt_lookup_eq(cur, got.br_startoff,
got.br_startblock,
@@ -5525,53 +5660,8 @@ xfs_bmap_shift_extents(
if (error)
goto del_cursor;
XFS_WANT_CORRUPTED_GOTO(i == 1, del_cursor);
- }
-
- /* Check if we can merge 2 adjacent extents */
- if (current_ext &&
- left.br_startoff + left.br_blockcount == startoff &&
- left.br_startblock + left.br_blockcount ==
- got.br_startblock &&
- left.br_state == got.br_state &&
- left.br_blockcount + got.br_blockcount <= MAXEXTLEN) {
- blockcount = left.br_blockcount +
- got.br_blockcount;
- xfs_iext_remove(ip, current_ext, 1, 0);
- logflags |= XFS_ILOG_CORE;
- if (cur) {
- error = xfs_btree_delete(cur, &i);
- if (error)
- goto del_cursor;
- XFS_WANT_CORRUPTED_GOTO(i == 1, del_cursor);
- } else {
- logflags |= XFS_ILOG_DEXT;
- }
- XFS_IFORK_NEXT_SET(ip, whichfork,
- XFS_IFORK_NEXTENTS(ip, whichfork) - 1);
- gotp = xfs_iext_get_ext(ifp, --current_ext);
- xfs_bmbt_get_all(gotp, &got);
-
- /* Make cursor point to the extent we will update */
- if (cur) {
- error = xfs_bmbt_lookup_eq(cur, got.br_startoff,
- got.br_startblock,
- got.br_blockcount,
- &i);
- if (error)
- goto del_cursor;
- XFS_WANT_CORRUPTED_GOTO(i == 1, del_cursor);
- }

- xfs_bmbt_set_blockcount(gotp, blockcount);
- got.br_blockcount = blockcount;
- } else {
- /* We have to update the startoff */
- xfs_bmbt_set_startoff(gotp, startoff);
got.br_startoff = startoff;
- }
-
- logflags |= XFS_ILOG_CORE;
- if (cur) {
error = xfs_bmbt_update(cur, got.br_startoff,
got.br_startblock,
got.br_blockcount,
@@ -5582,18 +5672,20 @@ xfs_bmap_shift_extents(
logflags |= XFS_ILOG_DEXT;
}

- current_ext++;
+next:
+ /* update total extent count and grab the next record */
total_extents = ifp->if_bytes / sizeof(xfs_bmbt_rec_t);
+ if (++current_ext >= total_extents)
+ break;
+ gotp = xfs_iext_get_ext(ifp, current_ext);
+ xfs_bmbt_get_all(gotp, &got);
}

/* Check if we are done */
if (current_ext == total_extents)
*done = 1;
- else if (next_fsb) {
- gotp = xfs_iext_get_ext(ifp, current_ext);
- xfs_bmbt_get_all(gotp, &got);
+ else if (next_fsb)
*next_fsb = got.br_startoff;
- }

del_cursor:
if (cur)
--
1.8.3.1
Brian Foster
2014-09-10 13:20:29 UTC
Permalink
xfs_bmap_shift_extents() has a variety of conditions and error checks
that make the logic difficult to follow and indent heavy. Refactor the
loop body of this function into a new xfs_bmse_shift_one() helper. This
simplifies the error checks, eliminates index decrement on merge hack by
pushing the index increment down into the helper, and makes the code
more readable by reducing multiple levels of indentation.

This is a code refactor only. The behavior of extent shift and collapse
range is not modified.

Signed-off-by: Brian Foster <***@redhat.com>
---
fs/xfs/libxfs/xfs_bmap.c | 164 ++++++++++++++++++++++++++---------------------
1 file changed, 91 insertions(+), 73 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 532c4aa..69bf8d8 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5518,6 +5518,88 @@ out_error:
}

/*
+ * Shift a single extent.
+ */
+STATIC int
+xfs_bmse_shift_one(
+ struct xfs_inode *ip,
+ int whichfork,
+ xfs_fileoff_t offset_shift_fsb,
+ int *current_ext,
+ struct xfs_bmbt_rec_host *gotp,
+ struct xfs_btree_cur *cur,
+ int *logflags)
+{
+ struct xfs_ifork *ifp;
+ xfs_fileoff_t startoff;
+ struct xfs_bmbt_rec_host *leftp;
+ struct xfs_bmbt_irec got;
+ struct xfs_bmbt_irec left;
+ int error;
+ int i;
+
+ ifp = XFS_IFORK_PTR(ip, whichfork);
+
+ xfs_bmbt_get_all(gotp, &got);
+ startoff = got.br_startoff - offset_shift_fsb;
+
+ /*
+ * If this is the first extent in the file, make sure there's enough
+ * room at the start of the file and jump right to the shift as there's
+ * no left extent to merge.
+ */
+ if (*current_ext == 0) {
+ if (got.br_startoff < offset_shift_fsb)
+ return -EINVAL;
+ goto shift_extent;
+ }
+
+ /* grab the left extent and check for a large enough hole */
+ leftp = xfs_iext_get_ext(ifp, *current_ext - 1);
+ xfs_bmbt_get_all(leftp, &left);
+
+ if (startoff < left.br_startoff + left.br_blockcount)
+ return -EINVAL;
+
+ /* check whether to merge the extent or shift it down */
+ if (!xfs_bmse_can_merge(&left, &got, offset_shift_fsb))
+ goto shift_extent;
+
+ return xfs_bmse_merge(ip, whichfork, offset_shift_fsb, *current_ext,
+ gotp, leftp, cur, logflags);
+
+shift_extent:
+ /*
+ * Increment the extent index for the next iteration, update the start
+ * offset of the in-core extent and update the btree if applicable.
+ */
+ (*current_ext)++;
+ xfs_bmbt_set_startoff(gotp, startoff);
+ *logflags |= XFS_ILOG_CORE;
+ if (!cur) {
+ *logflags |= XFS_ILOG_DEXT;
+ return 0;
+ }
+
+ error = xfs_bmbt_lookup_eq(cur, got.br_startoff, got.br_startblock,
+ got.br_blockcount, &i);
+ if (error)
+ return error;
+ XFS_WANT_CORRUPTED_GOTO(i == 1, out_error);
+
+ got.br_startoff = startoff;
+ error = xfs_bmbt_update(cur, got.br_startoff, got.br_startblock,
+ got.br_blockcount, got.br_state);
+ if (error)
+ return error;
+
+ return 0;
+
+out_error:
+ return error;
+}
+
+/*
* Shift extent records to the left to cover a hole.
*
* The maximum number of extents to be shifted in a single operation is
@@ -5541,16 +5623,12 @@ xfs_bmap_shift_extents(
{
struct xfs_btree_cur *cur = NULL;
struct xfs_bmbt_rec_host *gotp;
- struct xfs_bmbt_rec_host *leftp;
struct xfs_bmbt_irec got;
- struct xfs_bmbt_irec left;
struct xfs_mount *mp = ip->i_mount;
struct xfs_ifork *ifp;
xfs_extnum_t nexts = 0;
xfs_extnum_t current_ext;
- xfs_fileoff_t startoff;
int error = 0;
- int i;
int whichfork = XFS_DATA_FORK;
int logflags = 0;
int total_extents;
@@ -5599,16 +5677,6 @@ xfs_bmap_shift_extents(
*done = 1;
goto del_cursor;
}
- xfs_bmbt_get_all(gotp, &got);
-
- /*
- * If the first extent is shifted, offset_shift_fsb cannot be larger
- * than the starting offset of the first extent.
- */
- if (current_ext == 0 && got.br_startoff < offset_shift_fsb) {
- error = -EINVAL;
- goto del_cursor;
- }

/*
* There may be delalloc extents in the data fork before the range we
@@ -5617,75 +5685,25 @@ xfs_bmap_shift_extents(
*/
total_extents = ifp->if_bytes / sizeof(xfs_bmbt_rec_t);
while (nexts++ < num_exts && current_ext < total_extents) {
- startoff = got.br_startoff - offset_shift_fsb;
-
- /* grab the left extent and check for a potential merge */
- if (current_ext > 0) {
- leftp = xfs_iext_get_ext(ifp, current_ext - 1);
- xfs_bmbt_get_all(leftp, &left);
-
- /* make sure hole is large enough for shift */
- if (startoff < left.br_startoff + left.br_blockcount) {
- error = -EINVAL;
- goto del_cursor;
- }
-
- if (xfs_bmse_can_merge(&left, &got, offset_shift_fsb)) {
- error = xfs_bmse_merge(ip, whichfork,
- offset_shift_fsb, current_ext, gotp,
- leftp, cur, &logflags);
- if (error)
- goto del_cursor;
-
- /*
- * The extent was merged so adjust the extent
- * index and move onto the next.
- */
- current_ext--;
- goto next;
- }
- }
-
- /*
- * We didn't merge the extent so do the shift. Update the start
- * offset in the in-core extent and btree, if necessary.
- */
- xfs_bmbt_set_startoff(gotp, startoff);
- logflags |= XFS_ILOG_CORE;
- if (cur) {
- error = xfs_bmbt_lookup_eq(cur, got.br_startoff,
- got.br_startblock,
- got.br_blockcount,
- &i);
- if (error)
- goto del_cursor;
- XFS_WANT_CORRUPTED_GOTO(i == 1, del_cursor);
-
- got.br_startoff = startoff;
- error = xfs_bmbt_update(cur, got.br_startoff,
- got.br_startblock,
- got.br_blockcount,
- got.br_state);
- if (error)
- goto del_cursor;
- } else {
- logflags |= XFS_ILOG_DEXT;
- }
+ error = xfs_bmse_shift_one(ip, whichfork, offset_shift_fsb,
+ &current_ext, gotp, cur, &logflags);
+ if (error)
+ goto del_cursor;

-next:
/* update total extent count and grab the next record */
total_extents = ifp->if_bytes / sizeof(xfs_bmbt_rec_t);
- if (++current_ext >= total_extents)
+ if (current_ext >= total_extents)
break;
gotp = xfs_iext_get_ext(ifp, current_ext);
- xfs_bmbt_get_all(gotp, &got);
}

/* Check if we are done */
- if (current_ext == total_extents)
+ if (current_ext == total_extents) {
*done = 1;
- else if (next_fsb)
+ } else if (next_fsb) {
+ xfs_bmbt_get_all(gotp, &got);
*next_fsb = got.br_startoff;
+ }

del_cursor:
if (cur)
--
1.8.3.1
Dave Chinner
2014-09-11 04:42:43 UTC
Permalink
Hi all,
Here's v2 of the collapse clean up. We refactor a bit more via the
insertion of patch 3, otherwise it's similar to v1. This will see some
continued testing, but it survived ~500m fsx operations overnight.
Brian
I'm not sure about the invalidation patch now. On a 1k block size
filesystem, generic/127 fails with:

+ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap
+collapse range: 1000 to 3000
+do_collapse_range: fallocate: Device or resource busy

which indicates we had an invalidation failure. This is probably
exposing some other bug, but I haven't had time to look into it yet
so I don't know.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Brian Foster
2014-09-11 15:20:13 UTC
Permalink
Post by Dave Chinner
Hi all,
Here's v2 of the collapse clean up. We refactor a bit more via the
insertion of patch 3, otherwise it's similar to v1. This will see some
continued testing, but it survived ~500m fsx operations overnight.
Brian
I'm not sure about the invalidation patch now. On a 1k block size
+ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap
+collapse range: 1000 to 3000
+do_collapse_range: fallocate: Device or resource busy
which indicates we had an invalidation failure. This is probably
exposing some other bug, but I haven't had time to look into it yet
so I don't know.
Yeah, I can reproduce this as well, thanks. I think you're referring to
the xfs_free_file_space() patch (5/5)..? FWIW, I don't see the problem
without that patch, so it appears that the full pagecache truncate is
still covering up a problem somewhere. I'll try to dig into it...

Brian
Post by Dave Chinner
Cheers,
Dave.
--
Dave Chinner
Dave Chinner
2014-09-11 21:19:15 UTC
Permalink
Post by Brian Foster
Post by Dave Chinner
Hi all,
Here's v2 of the collapse clean up. We refactor a bit more via the
insertion of patch 3, otherwise it's similar to v1. This will see some
continued testing, but it survived ~500m fsx operations overnight.
Brian
I'm not sure about the invalidation patch now. On a 1k block size
+ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap
+collapse range: 1000 to 3000
+do_collapse_range: fallocate: Device or resource busy
which indicates we had an invalidation failure. This is probably
exposing some other bug, but I haven't had time to look into it yet
so I don't know.
Yeah, I can reproduce this as well, thanks. I think you're referring to
the xfs_free_file_space() patch (5/5)..?
*nod*
Post by Brian Foster
FWIW, I don't see the problem
without that patch, so it appears that the full pagecache truncate is
still covering up a problem somewhere. I'll try to dig into it...
It's likely that it is leaving a dirty buffer on the page beyond EOF
as a result of the truncate zeroing the remainder of the page in
memory.

If I get a chance I'll look at it this afternoon, as this patchset
also seems to be causing a marked increase in the number of fsstress
failures due to stray delalloc blocks in files even on 4k block size
filesystems (e.g. at unmount).

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Brian Foster
2014-09-12 19:51:29 UTC
Permalink
Post by Dave Chinner
Post by Brian Foster
Post by Dave Chinner
Hi all,
Here's v2 of the collapse clean up. We refactor a bit more via the
insertion of patch 3, otherwise it's similar to v1. This will see some
continued testing, but it survived ~500m fsx operations overnight.
Brian
I'm not sure about the invalidation patch now. On a 1k block size
+ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap
+collapse range: 1000 to 3000
+do_collapse_range: fallocate: Device or resource busy
which indicates we had an invalidation failure. This is probably
exposing some other bug, but I haven't had time to look into it yet
so I don't know.
Yeah, I can reproduce this as well, thanks. I think you're referring to
the xfs_free_file_space() patch (5/5)..?
*nod*
Post by Brian Foster
FWIW, I don't see the problem
without that patch, so it appears that the full pagecache truncate is
still covering up a problem somewhere. I'll try to dig into it...
It's likely that it is leaving a dirty buffer on the page beyond EOF
as a result of the truncate zeroing the remainder of the page in
memory.
If I get a chance I'll look at it this afternoon, as this patchset
also seems to be causing a marked increase in the number of fsstress
failures due to stray delalloc blocks in files even on 4k block size
filesystems (e.g. at unmount).
I think I have a bit of an idea of what's going on here. I'm not sure
that this one is post-eof delalloc related. For reference, here's the
command that reproduces for me 100% of the time on a 1k block fs
(derived from the fsx trace):

xfs_io -fc "pwrite 185332 55756" \
-c "fcollapse 28672 40960" \
-c "pwrite 133228 63394" \
-c "fcollapse 0 4096" /mnt/file

The first write extends the file to 241088 bytes and the first collapse
targets a hole, but shrinks the file down to 200128 bytes. The last page
offset of the file is now 196608 and with 1k blocks, eof falls within
the last block of this page (e.g., within bh offsets 199680-200704). The
second write falls just into the first block of this page.

What I see occur on the final collapse is that the
invalidate_inode_pages2_range() call in the collapse op fails due to
finding a dirty page at offset 196608. We don't have a launder_page()
callback, so this results in -EBUSY.

Given the above, it seems like the filemap_write_and_wait_range() call
should handle this. Digging further, what I see is one or two
writepage()->xfs_cluster_write() sequences along the range of the
previous write. xfs_convert_page() eventually attempts to handle page
offset 196608 and breaks out at offset 197632 (the second bh in the
page) near the top of the bh loop:

...
if (!(PageUptodate(page) || buffer_uptodate(bh))) {
done = 1;
break;
}
...

I suspect the reason the page/buffer is in this particular state is due
to the previous invalidation (collapse 1), which would have removed the
page from the mapping. This seems reasonable to me since we only wrote
into the first bytes of the page since the prior invalidation. The problem is
that the page somehow has the following buffer_head state at the point of the
EBUSY inval failure:

invalidate_inode_pages2_range(648): state 0x29 block 84 size 1024
invalidate_inode_pages2_range(648): state 0x0 block 18446744073709551615 size 1024
invalidate_inode_pages2_range(648): state 0x0 block 18446744073709551615 size 1024
invalidate_inode_pages2_range(648): state 0x2b block 87 size 1024

The first block is BH_Uptodate|BH_Req|BH_Mapped. I read the next two as
simply not up to date..? The last block is the same as the first plus
BH_Dirty. This last block is offset 199680 and the previous write goes
to 196622, so I'm not sure how this ends up dirty. I think the write
path to this range of the file might be the spot to dig next...

Brian

P.S., As a datapoint from experimentation, this problem doesn't occur if
I ensure that writepage() handles this particular page rather than
xfs_convert_page(). I can do that by either jumping out of
xfs_convert_page() sooner, before the page is set for writeback, or
hacking xfs_start_page_writeback() to use __test_set_page_writeback()
and keep the write tag such that write_cache_pages() can still find the
page. This calls out an interesting bit of behavior in
xfs_convert_page() since commit a49935f2: if we handle a part of a
page, break and mark the page for writeback without handling the rest,
we'll still clear the PAGECACHE_TAG_TOWRITE tag and writeback won't
finish off the page during the current iteration (for WB_SYNC_ALL).

It's not clear to me if this is contributing to the problem in this
particular case, but it seems like an independent bug at the very
least. Thoughts?
Post by Dave Chinner
Cheers,
Dave.
--
Dave Chinner
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
Brian Foster
2014-09-12 20:05:36 UTC
Permalink
Post by Brian Foster
Post by Dave Chinner
Post by Brian Foster
Post by Dave Chinner
Hi all,
Here's v2 of the collapse clean up. We refactor a bit more via the
insertion of patch 3, otherwise it's similar to v1. This will see some
continued testing, but it survived ~500m fsx operations overnight.
Brian
I'm not sure about the invalidation patch now. On a 1k block size
+ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap
+collapse range: 1000 to 3000
+do_collapse_range: fallocate: Device or resource busy
which indicates we had an invalidation failure. This is probably
exposing some other bug, but I haven't had time to look into it yet
so I don't know.
Yeah, I can reproduce this as well, thanks. I think you're referring to
the xfs_free_file_space() patch (5/5)..?
*nod*
Post by Brian Foster
FWIW, I don't see the problem
without that patch, so it appears that the full pagecache truncate is
still covering up a problem somewhere. I'll try to dig into it...
It's likely that it is leaving a dirty buffer on the page beyond EOF
as a result of the truncate zeroing the remainder of the page in
memory.
If I get a chance I'll look at it this afternoon, as this patchset
also seems to be causing a marked increase in the number of fsstress
failures due to stray delalloc blocks in files even on 4k block size
filesystems (e.g. at unmount).
I think I have a bit of an idea of what's going on here. I'm not sure
that this one is post-eof delalloc related. For reference, here's the
command that reproduces for me 100% of the time on a 1k block fs
xfs_io -fc "pwrite 185332 55756" \
-c "fcollapse 28672 40960" \
-c "pwrite 133228 63394" \
-c "fcollapse 0 4096" /mnt/file
The first write extends the file to 241088 bytes and the first collapse
targets a hole, but shrinks the file down to 200128 bytes. The last page
offset of the file is now 196608 and with 1k blocks, eof falls within
the last block of this page (e.g., within bh offsets 199680-200704). The
second write falls just into the first block of this page.
What I see occur on the final collapse is that the
invalidate_inode_pages2_range() call in the collapse op fails due to
finding a dirty page at offset 196608. We don't have a launder_page()
callback, so this results in -EBUSY.
Given the above, it seems like the filemap_write_and_wait_range() call
should handle this. Digging further, what I see is one or two
writepage()->xfs_cluster_write() sequences along the range of the
previous write. xfs_convert_page() eventually attempts to handle page
offset 196608 and breaks out at offset 197632 (the second bh in the
...
if (!(PageUptodate(page) || buffer_uptodate(bh))) {
done = 1;
break;
}
...
I suspect the reason the page/buffer is in this particular state is due
to the previous invalidation (collapse 1), which would have removed the
into the first bytes of the page since the prior invalidation. The problem is
that the page somehow has the following buffer_head state at the point of the
invalidate_inode_pages2_range(648): state 0x29 block 84 size 1024
invalidate_inode_pages2_range(648): state 0x0 block 18446744073709551615 size 1024
invalidate_inode_pages2_range(648): state 0x0 block 18446744073709551615 size 1024
invalidate_inode_pages2_range(648): state 0x2b block 87 size 1024
The first block is BH_Uptodate|BH_Req|BH_Mapped. I read the next two as
simply not up to date..? The last block is the same as the first plus
BH_Dirty. This last block is offset 199680 and the previous write goes
to 196622, so I'm not sure how this ends up dirty. I think the write
path to this range of the file might be the spot to dig next...
... and it just hit me that truncate dirties the buffer. ;) So I'm
wondering if we have something like the following sequence of events for
this particular page:

- first pwrite writes to complete page
- first collapse:
- flush
- invalidate
- truncate -> dirty last buffer of page
- second pwrite writes to first buffer in page (dirty first buffer)
- flush
- xfs_convert_page() hits the first buffer, breaks out
and causes the last buffer to be passed over due to
the issue below
- invalidate
- finds dirty buffer, error!

Brian
Post by Brian Foster
Brian
P.S., As a datapoint from experimentation, this problem doesn't occur if
I ensure that writepage() handles this particular page rather than
xfs_convert_page(). I can do that by either jumping out of
xfs_convert_page() sooner, before the page is set for writeback, or
hacking xfs_start_page_writeback() to use __test_set_page_writeback()
and keep the write tag such that write_cache_pages() can still find the
page. This calls out an interesting bit of behavior in
xfs_convert_page() since commit a49935f2: if we handle a part of a
page, break and mark the page for writeback without handling the rest,
we'll still clear the PAGECACHE_TAG_TOWRITE tag and writeback won't
finish off the page during the current iteration (for WB_SYNC_ALL).
It's not clear to me if this is contributing to the problem in this
particular case, but it seems like an independent bug at the very
least. Thoughts?
Post by Dave Chinner
Cheers,
Dave.
--
Dave Chinner
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
Dave Chinner
2014-09-15 01:46:54 UTC
Permalink
Post by Brian Foster
Post by Brian Foster
Post by Dave Chinner
Post by Dave Chinner
Hi all,
Here's v2 of the collapse clean up. We refactor a bit more via the
insertion of patch 3, otherwise it's similar to v1. This will see some
continued testing, but it survived ~500m fsx operations overnight.
Brian
I'm not sure about the invalidation patch now. On a 1k block size
+ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap
+collapse range: 1000 to 3000
+do_collapse_range: fallocate: Device or resource busy
which indicates we had an invalidation failure. This is probably
exposing some other bug, but I haven't had time to look into it yet
so I don't know.
....
Post by Brian Foster
Post by Brian Foster
Post by Dave Chinner
It's likely that it is leaving a dirty buffer on the page beyond EOF
as a result of the truncate zeroing the remainder of the page in
memory.
.....
Post by Brian Foster
... and it just hit me that truncate dirties the buffer. ;)
Exactly. ;)
Post by Brian Foster
So I'm
wondering if we have something like the following sequence of events for
- first pwrite writes to complete page
xfs_collapse_file_space: dev 7:0 ino 0x43
Post by Brian Foster
- flush
xfs_map_blocks_alloc: dev 7:0 ino 0x43 size 0x0 offset 0x2d000 count 1024 type startoff 0xb4 startblock 32 blockcount 0x38
xfs_extlist: dev 7:0 ino 0x43 state idx 0 offset 180 block 32 count 56 flag 0 caller xfs_iextents_copy
Post by Brian Foster
- invalidate
xfs_releasepage: dev 7:0 ino 0x43 pgoff 0x2d000 size 0x3adc0 offset 0 length 0 delalloc 0 unwritten 0
....
xfs_releasepage: dev 7:0 ino 0x43 pgoff 0x3a000 size 0x3adc0 offset 0 length 0 delalloc 0 unwritten 0
Post by Brian Foster
- truncate -> dirty last buffer of page
xfs_setattr: dev 7:0 ino 0x43
....
xfs_invalidatepage: dev 7:0 ino 0x43 pgoff 0x30000 size 0x30dc0 offset dc0 length 240 delalloc 0 unwritten 0
....
xfs_itruncate_extents_start: dev 7:0 ino 0x43 size 0x30dc0 new_size 0x30dc0
xfs_extlist: dev 7:0 ino 0x43 state idx 0 offset 140 block 32 count 56 flag 0 caller xfs_iextents_copy

And so we have truncate dirtying the last buffer in the page (the
offset/len indicated by the xfs_invalidatepage tracepoint).
Post by Brian Foster
- second pwrite writes to first buffer in page (dirty first buffer)
- flush
- xfs_convert_page() hits the first buffer, breaks out
and causes the last buffer to be passed over due to
the issue below
- invalidate
- finds dirty buffer, error!
Brian
Post by Brian Foster
Brian
P.S., As a datapoint from experimentation, this problem doesn't occur if
I ensure that writepage() handles this particular page rather than
xfs_convert_page(). I can do that by either jumping out of
xfs_convert_page() sooner, before the page is set for writeback, or
hacking xfs_start_page_writeback() to use __test_set_page_writeback()
and keep the write tag such that write_cache_pages() can still find the
page.
writepage is still supposed to find the page again, because we
haven't fully cleaned it. Indeed, the code used to come back and
write this page because it was still dirty and the
write_cache_pages() iteration would then see it and write it again
because it is dirty.
Post by Brian Foster
Post by Brian Foster
This calls out an interesting bit of behavior in
xfs_convert_page() since commit a49935f2: if we handle a part of a
page, break and mark the page for writeback without handling the rest,
we'll still clear the PAGECACHE_TAG_TOWRITE tag and writeback won't
finish off the page during the current iteration (for WB_SYNC_ALL).
Right, commit a49935f2 made the assumption that the page being left
dirty was sufficient to have the page written as the current
writeback sweep went past it. That *used* to be the way the generic
writeback code worked.

And this is instructive: this same assumption was found in ext4 back
in May i.e. commit 1c8349a ("ext4: fix data integrity sync in
ordered mode") and that introduced the new functions like
__test_set_page_writeback(). That fix wasn't cc'd to linux-fsdevel
which might have got our attention and so noticed the bug earlier...
Post by Brian Foster
Post by Brian Foster
It's not clear to me if this is contributing to the problem in this
particular case, but it seems like an independent bug at the very
least. Thoughts?
We definitely need to use set_page_writeback_keepwrite() for partial
page writes that leave the page dirty. Patch below.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com

xfs: ensure WB_SYNC_ALL writeback handles partial pages correctly

From: Dave Chinner <***@redhat.com>

XFS has been having trouble with stray delayed allocation extents
beyond EOF for a long time. Recent changes to the collapse range
code has triggered erroneous EBUSY errors on page invalidtion for
block size smaller than page size filesystems. These
have been caused by dirty buffers beyond EOF on a partial page which
do not get written to disk during a sync.

The issue is that write-ahead in xfs_cluster_write() finds such a
partial page and handles it by leaving the page dirty but pushing it
into a writeback state. This used to work just fine, as the
write_cache_pages() code would then find the dirty partial page in
the next mapping tree lookup as the dirty tag is still set.

Unfortunately, when we moved to a mark and sweep approach to
writeback to fix other writeback sync issues, we broken this. THe
act of marking the page as under writeback now clears the TOWRITE
tag in the radix tree, even though the page is still dirty. This
causes the TOWRITE tag to be cleared, and hence the next lookup on
the mapping tree does not find the dirty partial page and so doesn't
try to write it again.

This same writeback bug was found recently in ext4 and fixed in
commit 1c8349a ("ext4: fix data integrity sync in ordered mode")
without communication to the wider filesystem community. We can use
exactly the same fix here so the TOWRITE flag is not cleared on
partial page writes.

cc: ***@vger.kernel.org # dependent on 1c8349a17137b93f0a83f276c764a6df1b9a116e
Root-cause-found-by: Brian Foster <***@redhat.com>
Signed-off-by: Dave Chinner <***@redhat.com>
---
fs/xfs/xfs_aops.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index b984647..2f50253 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -434,10 +434,22 @@ xfs_start_page_writeback(
{
ASSERT(PageLocked(page));
ASSERT(!PageWriteback(page));
- if (clear_dirty)
+
+ /*
+ * if the page was not fully cleaned, we need to ensure that the higher
+ * layers come back to it correctly. That means we need to keep the page
+ * dirty, and for WB_SYNC_ALL writeback we need to ensure the
+ * PAGECACHE_TAG_TOWRITE index mark is not removed so another attempt to
+ * write this page in this writeback sweep will be made.
+ */
+ if (clear_dirty) {
clear_page_dirty_for_io(page);
- set_page_writeback(page);
+ set_page_writeback(page);
+ } else
+ set_page_writeback_keepwrite(page);
+
unlock_page(page);
+
/* If no buffers on the page are to be written, finish it here */
if (!buffers)
end_page_writeback(page);
Brian Foster
2014-09-15 13:18:22 UTC
Permalink
Post by Dave Chinner
Post by Brian Foster
Post by Brian Foster
Post by Dave Chinner
Post by Dave Chinner
Hi all,
Here's v2 of the collapse clean up. We refactor a bit more via the
insertion of patch 3, otherwise it's similar to v1. This will see some
continued testing, but it survived ~500m fsx operations overnight.
Brian
I'm not sure about the invalidation patch now. On a 1k block size
+ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap
+collapse range: 1000 to 3000
+do_collapse_range: fallocate: Device or resource busy
which indicates we had an invalidation failure. This is probably
exposing some other bug, but I haven't had time to look into it yet
so I don't know.
....
Post by Brian Foster
Post by Brian Foster
Post by Dave Chinner
It's likely that it is leaving a dirty buffer on the page beyond EOF
as a result of the truncate zeroing the remainder of the page in
memory.
.....
Post by Brian Foster
... and it just hit me that truncate dirties the buffer. ;)
Exactly. ;)
Post by Brian Foster
So I'm
wondering if we have something like the following sequence of events for
- first pwrite writes to complete page
xfs_collapse_file_space: dev 7:0 ino 0x43
Post by Brian Foster
- flush
xfs_map_blocks_alloc: dev 7:0 ino 0x43 size 0x0 offset 0x2d000 count 1024 type startoff 0xb4 startblock 32 blockcount 0x38
xfs_extlist: dev 7:0 ino 0x43 state idx 0 offset 180 block 32 count 56 flag 0 caller xfs_iextents_copy
Post by Brian Foster
- invalidate
xfs_releasepage: dev 7:0 ino 0x43 pgoff 0x2d000 size 0x3adc0 offset 0 length 0 delalloc 0 unwritten 0
....
xfs_releasepage: dev 7:0 ino 0x43 pgoff 0x3a000 size 0x3adc0 offset 0 length 0 delalloc 0 unwritten 0
Post by Brian Foster
- truncate -> dirty last buffer of page
xfs_setattr: dev 7:0 ino 0x43
....
xfs_invalidatepage: dev 7:0 ino 0x43 pgoff 0x30000 size 0x30dc0 offset dc0 length 240 delalloc 0 unwritten 0
....
xfs_itruncate_extents_start: dev 7:0 ino 0x43 size 0x30dc0 new_size 0x30dc0
xfs_extlist: dev 7:0 ino 0x43 state idx 0 offset 140 block 32 count 56 flag 0 caller xfs_iextents_copy
And so we have truncate dirtying the last buffer in the page (the
offset/len indicated by the xfs_invalidatepage tracepoint).
Post by Brian Foster
- second pwrite writes to first buffer in page (dirty first buffer)
- flush
- xfs_convert_page() hits the first buffer, breaks out
and causes the last buffer to be passed over due to
the issue below
- invalidate
- finds dirty buffer, error!
Brian
Post by Brian Foster
Brian
P.S., As a datapoint from experimentation, this problem doesn't occur if
I ensure that writepage() handles this particular page rather than
xfs_convert_page(). I can do that by either jumping out of
xfs_convert_page() sooner, before the page is set for writeback, or
hacking xfs_start_page_writeback() to use __test_set_page_writeback()
and keep the write tag such that write_cache_pages() can still find the
page.
writepage is still supposed to find the page again, because we
haven't fully cleaned it. Indeed, the code used to come back and
write this page because it was still dirty and the
write_cache_pages() iteration would then see it and write it again
because it is dirty.
Post by Brian Foster
Post by Brian Foster
This calls out an interesting bit of behavior in
xfs_convert_page() since commit a49935f2: if we handle a part of a
page, break and mark the page for writeback without handling the rest,
we'll still clear the PAGECACHE_TAG_TOWRITE tag and writeback won't
finish off the page during the current iteration (for WB_SYNC_ALL).
Right, commit a49935f2 made the assumption that the page being left
dirty was sufficient to have the page written as the current
writeback sweep went past it. That *used* to be the way the generic
writeback code worked.
And this is instructive: this same assumption was found in ext4 back
in May i.e. commit 1c8349a ("ext4: fix data integrity sync in
ordered mode") and that introduced the new functions like
__test_set_page_writeback(). That fix wasn't cc'd to linux-fsdevel
which might have got our attention and so noticed the bug earlier...
Post by Brian Foster
Post by Brian Foster
It's not clear to me if this is contributing to the problem in this
particular case, but it seems like an independent bug at the very
least. Thoughts?
We definitely need to use set_page_writeback_keepwrite() for partial
page writes that leave the page dirty. Patch below.
Cheers,
Dave.
--
Dave Chinner
xfs: ensure WB_SYNC_ALL writeback handles partial pages correctly
XFS has been having trouble with stray delayed allocation extents
beyond EOF for a long time. Recent changes to the collapse range
code has triggered erroneous EBUSY errors on page invalidtion for
block size smaller than page size filesystems. These
have been caused by dirty buffers beyond EOF on a partial page which
do not get written to disk during a sync.
The issue is that write-ahead in xfs_cluster_write() finds such a
partial page and handles it by leaving the page dirty but pushing it
into a writeback state. This used to work just fine, as the
write_cache_pages() code would then find the dirty partial page in
the next mapping tree lookup as the dirty tag is still set.
Unfortunately, when we moved to a mark and sweep approach to
writeback to fix other writeback sync issues, we broken this. THe
act of marking the page as under writeback now clears the TOWRITE
tag in the radix tree, even though the page is still dirty. This
causes the TOWRITE tag to be cleared, and hence the next lookup on
the mapping tree does not find the dirty partial page and so doesn't
try to write it again.
This same writeback bug was found recently in ext4 and fixed in
commit 1c8349a ("ext4: fix data integrity sync in ordered mode")
without communication to the wider filesystem community. We can use
exactly the same fix here so the TOWRITE flag is not cleared on
partial page writes.
---
Looks good and fixes the collapse failure in my test.

Reviewed-by: Brian Foster <***@redhat.com>

I suppose we should prepend the collapse rework series with this patch
to avoid the regression as it pertains to collapse (obviously the
failure to retain towrite goes further back).

I'll continue testing with this. Are you still seeing an increase in
such failures with the xfs_free_file_space() patch or has this quieted
those down?

Brian
Post by Dave Chinner
fs/xfs/xfs_aops.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index b984647..2f50253 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -434,10 +434,22 @@ xfs_start_page_writeback(
{
ASSERT(PageLocked(page));
ASSERT(!PageWriteback(page));
- if (clear_dirty)
+
+ /*
+ * if the page was not fully cleaned, we need to ensure that the higher
+ * layers come back to it correctly. That means we need to keep the page
+ * dirty, and for WB_SYNC_ALL writeback we need to ensure the
+ * PAGECACHE_TAG_TOWRITE index mark is not removed so another attempt to
+ * write this page in this writeback sweep will be made.
+ */
+ if (clear_dirty) {
clear_page_dirty_for_io(page);
- set_page_writeback(page);
+ set_page_writeback(page);
+ } else
+ set_page_writeback_keepwrite(page);
+
unlock_page(page);
+
/* If no buffers on the page are to be written, finish it here */
if (!buffers)
end_page_writeback(page);
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
Dave Chinner
2014-09-15 22:55:35 UTC
Permalink
Post by Brian Foster
Post by Dave Chinner
xfs: ensure WB_SYNC_ALL writeback handles partial pages correctly
XFS has been having trouble with stray delayed allocation extents
beyond EOF for a long time. Recent changes to the collapse range
code has triggered erroneous EBUSY errors on page invalidtion for
block size smaller than page size filesystems. These
have been caused by dirty buffers beyond EOF on a partial page which
do not get written to disk during a sync.
The issue is that write-ahead in xfs_cluster_write() finds such a
partial page and handles it by leaving the page dirty but pushing it
into a writeback state. This used to work just fine, as the
write_cache_pages() code would then find the dirty partial page in
the next mapping tree lookup as the dirty tag is still set.
Unfortunately, when we moved to a mark and sweep approach to
writeback to fix other writeback sync issues, we broken this. THe
act of marking the page as under writeback now clears the TOWRITE
tag in the radix tree, even though the page is still dirty. This
causes the TOWRITE tag to be cleared, and hence the next lookup on
the mapping tree does not find the dirty partial page and so doesn't
try to write it again.
This same writeback bug was found recently in ext4 and fixed in
commit 1c8349a ("ext4: fix data integrity sync in ordered mode")
without communication to the wider filesystem community. We can use
exactly the same fix here so the TOWRITE flag is not cleared on
partial page writes.
---
Looks good and fixes the collapse failure in my test.
I suppose we should prepend the collapse rework series with this patch
to avoid the regression as it pertains to collapse (obviously the
failure to retain towrite goes further back).
Agreed, I will do that.
Post by Brian Foster
I'll continue testing with this. Are you still seeing an increase in
such failures with the xfs_free_file_space() patch or has this quieted
those down?
To early to sayi for sure, but signs are good - I've had xfstests
actually complete without any stray delalloc asserts occurring on
1k block size filesystems for the first time in a couple of weeks.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Loading...