[PATCH 1/2] xfs: fix swapext ilock deadlock

Discussion:

Dave Chinner

2014-07-31 06:12:07 UTC

From: Dave Chinner <***@redhat.com>

xfs_swap_extents() holds the ilock over a call to
filemap_write_and_wait(), which can then try to write data and take
the ilock. That causes a self-deadlock.

Fix the deadlock and clean up the code by separating the locking
appropriately. Add a lockflags variable to track what locks we are
holding as we gain and drop them and cleanup the error handling to
always use "out_unlock" with the lockflags variable.

Signed-off-by: Dave Chinner <***@redhat.com>
---
fs/xfs/xfs_bmap_util.c | 33 ++++++++++++++++++---------------
1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index bbc748a..3c60c43 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1633,6 +1633,7 @@ xfs_swap_extents(
int aforkblks = 0;
int taforkblks = 0;
__uint64_t tmp;
+ int lock_flags;

tempifp = kmem_alloc(sizeof(xfs_ifork_t), KM_MAYFAIL);
if (!tempifp) {
@@ -1641,13 +1642,13 @@ xfs_swap_extents(
}

/*
- * we have to do two separate lock calls here to keep lockdep
- * happy. If we try to get all the locks in one call, lock will
- * report false positives when we drop the ILOCK and regain them
- * below.
+ * Lock up the inodes against other IO and truncate to begin with.
+ * Then we can ensure the inodes are flushed and have no page cache
+ * safely. Once we have done this we can take the ilocks and do the rest
+ * of the checks.
*/
+ lock_flags = XFS_IOLOCK_EXCL;
xfs_lock_two_inodes(ip, tip, XFS_IOLOCK_EXCL);
- xfs_lock_two_inodes(ip, tip, XFS_ILOCK_EXCL);

/* Verify that both files have the same format */
if ((ip->i_d.di_mode & S_IFMT) != (tip->i_d.di_mode & S_IFMT)) {
@@ -1666,6 +1667,9 @@ xfs_swap_extents(
goto out_unlock;
truncate_pagecache_range(VFS_I(tip), 0, -1);

+ xfs_lock_two_inodes(ip, tip, XFS_ILOCK_EXCL);
+ lock_flags |= XFS_ILOCK_EXCL;
+
/* Verify O_DIRECT for ftmp */
if (VFS_I(tip)->i_mapping->nrpages) {
error = -EINVAL;
@@ -1720,6 +1724,7 @@ xfs_swap_extents(

xfs_iunlock(ip, XFS_ILOCK_EXCL);
xfs_iunlock(tip, XFS_ILOCK_EXCL);
+ lock_flags &= ~XFS_ILOCK_EXCL;

/*
* There is a race condition here since we gave up the
@@ -1732,13 +1737,11 @@ xfs_swap_extents(

tp = xfs_trans_alloc(mp, XFS_TRANS_SWAPEXT);
error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ichange, 0, 0);
- if (error) {
- xfs_iunlock(ip, XFS_IOLOCK_EXCL);
- xfs_iunlock(tip, XFS_IOLOCK_EXCL);
- xfs_trans_cancel(tp, 0);
- goto out;
- }
+ if (error)
+ goto out_trans_cancel;
+
xfs_lock_two_inodes(ip, tip, XFS_ILOCK_EXCL);
+ lock_flags |= XFS_ILOCK_EXCL;

/*
* Count the number of extended attribute blocks
@@ -1757,8 +1760,8 @@ xfs_swap_extents(
goto out_trans_cancel;
}

- xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL);
- xfs_trans_ijoin(tp, tip, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL);
+ xfs_trans_ijoin(tp, ip, lock_flags);
+ xfs_trans_ijoin(tp, tip, lock_flags);

/*
* Before we've swapped the forks, lets set the owners of the forks
@@ -1887,8 +1890,8 @@ out:
return error;

out_unlock:
- xfs_iunlock(ip, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL);
- xfs_iunlock(tip, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL);
+ xfs_iunlock(ip, lock_flags);
+ xfs_iunlock(tip, lock_flags);
goto out;

out_trans_cancel:

--
2.0.0

Dave Chinner

2014-07-31 06:12:08 UTC

Permalink

From: Dave Chinner <***@redhat.com>

We need to treat both inodes identically from a page cache point of
view when prepareing them for extent swapping. We don't do this
right now - we assume that one of the inodes empty, because that's
what xfs_fsr currently does. Remove this assumption from the code.

While factoring out the flushing and related checks, move the
transactions reservation to immeidately after the flushes so that we
don't need to pick up and then drop the ilock to do the transaction
reservation. There are no issues with aborting the transaction it if
the checks fail before we join the inodes to the transaction and
dirty them, so this is a safe change to make.

Signed-off-by: Dave Chinner <***@redhat.com>
---
fs/xfs/xfs_bmap_util.c | 81 +++++++++++++++++++++++---------------------------
1 file changed, 37 insertions(+), 44 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 3c60c43..2f1e30d 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1619,6 +1619,30 @@ xfs_swap_extents_check_format(
}

int
+xfs_swap_extent_flush(
+ struct xfs_inode *ip)
+{
+ int error;
+
+ error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
+ if (error)
+ return error;
+ truncate_pagecache_range(VFS_I(ip), 0, -1);
+
+ /* Verify O_DIRECT for ftmp */
+ if (VFS_I(ip)->i_mapping->nrpages)
+ return -EINVAL;
+
+ /*
+ * Don't try to swap extents on mmap()d files because we can't lock
+ * out races against page faults safely.
+ */
+ if (mapping_mapped(VFS_I(ip)->i_mapping))
+ return -EBUSY;
+ return 0;
+}
+
+int
xfs_swap_extents(
xfs_inode_t *ip, /* target inode */
xfs_inode_t *tip, /* tmp inode */
@@ -1662,26 +1686,28 @@ xfs_swap_extents(
goto out_unlock;
}

- error = filemap_write_and_wait(VFS_I(tip)->i_mapping);
+ error = xfs_swap_extent_flush(ip);
+ if (error)
+ goto out_unlock;
+ error = xfs_swap_extent_flush(tip);
if (error)
goto out_unlock;
- truncate_pagecache_range(VFS_I(tip), 0, -1);
-
- xfs_lock_two_inodes(ip, tip, XFS_ILOCK_EXCL);
- lock_flags |= XFS_ILOCK_EXCL;

- /* Verify O_DIRECT for ftmp */
- if (VFS_I(tip)->i_mapping->nrpages) {
- error = -EINVAL;
+ tp = xfs_trans_alloc(mp, XFS_TRANS_SWAPEXT);
+ error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ichange, 0, 0);
+ if (error) {
+ xfs_trans_cancel(tp, 0);
goto out_unlock;
}
+ xfs_lock_two_inodes(ip, tip, XFS_ILOCK_EXCL);
+ lock_flags |= XFS_ILOCK_EXCL;

/* Verify all data are being swapped */
if (sxp->sx_offset != 0 ||
sxp->sx_length != ip->i_d.di_size ||
sxp->sx_length != tip->i_d.di_size) {
error = -EFAULT;
- goto out_unlock;
+ goto out_trans_cancel;
}

trace_xfs_swap_extent_before(ip, 0);
@@ -1693,7 +1719,7 @@ xfs_swap_extents(
xfs_notice(mp,
"%s: inode 0x%llx format is incompatible for exchanging.",
__func__, ip->i_ino);
- goto out_unlock;
+ goto out_trans_cancel;
}

/*
@@ -1708,41 +1734,8 @@ xfs_swap_extents(
(sbp->bs_mtime.tv_sec != VFS_I(ip)->i_mtime.tv_sec) ||
(sbp->bs_mtime.tv_nsec != VFS_I(ip)->i_mtime.tv_nsec)) {
error = -EBUSY;
- goto out_unlock;
- }
-
- /* We need to fail if the file is memory mapped. Once we have tossed
- * all existing pages, the page fault will have no option
- * but to go to the filesystem for pages. By making the page fault call
- * vop_read (or write in the case of autogrow) they block on the iolock
- * until we have switched the extents.
- */
- if (mapping_mapped(VFS_I(ip)->i_mapping)) {
- error = -EBUSY;
- goto out_unlock;
- }
-
- xfs_iunlock(ip, XFS_ILOCK_EXCL);
- xfs_iunlock(tip, XFS_ILOCK_EXCL);
- lock_flags &= ~XFS_ILOCK_EXCL;
-
- /*
- * There is a race condition here since we gave up the
- * ilock. However, the data fork will not change since
- * we have the iolock (locked for truncation too) so we
- * are safe. We don't really care if non-io related
- * fields change.
- */
- truncate_pagecache_range(VFS_I(ip), 0, -1);
-
- tp = xfs_trans_alloc(mp, XFS_TRANS_SWAPEXT);
- error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ichange, 0, 0);
- if (error)
goto out_trans_cancel;
-
- xfs_lock_two_inodes(ip, tip, XFS_ILOCK_EXCL);
- lock_flags |= XFS_ILOCK_EXCL;
-
+ }
/*
* Count the number of extended attribute blocks
*/

--
2.0.0

Christoph Hellwig

2014-07-31 17:16:02 UTC

Permalink

Post by Dave Chinner
int
+xfs_swap_extent_flush(
+ struct xfs_inode *ip)
+{
+ int error;

nipick: shouldn't the arguments and local variables align to the same
level?

Also a local struct inode variable instead of using VFS_I(ip) 4 times
would be nice.

Otherwise looks good,

Reviewed-by: Christoph Hellwig <***@lst.de>

Dave Chinner

2014-07-31 23:02:12 UTC

Permalink

Post by Christoph Hellwig

Post by Dave Chinner
int
+xfs_swap_extent_flush(
+ struct xfs_inode *ip)
+{
+ int error;

nipick: shouldn't the arguments and local variables align to the same
level?

*nod*

Post by Christoph Hellwig
Also a local struct inode variable instead of using VFS_I(ip) 4 times
would be nice.

Will fix.

Post by Christoph Hellwig
Otherwise looks good,

Thanks.

-Dave.

--
Dave Chinner
***@fromorbit.com

Brian Foster

2014-08-01 12:44:02 UTC

Permalink

Post by Dave Chinner
We need to treat both inodes identically from a page cache point of
view when prepareing them for extent swapping. We don't do this
right now - we assume that one of the inodes empty, because that's
what xfs_fsr currently does. Remove this assumption from the code.
While factoring out the flushing and related checks, move the
transactions reservation to immeidately after the flushes so that we
don't need to pick up and then drop the ilock to do the transaction
reservation. There are no issues with aborting the transaction it if
the checks fail before we join the inodes to the transaction and
dirty them, so this is a safe change to make.
---

Both of these looked fine to me, but I couldn't apply this one to
for-next or master...

Brian

Post by Dave Chinner
fs/xfs/xfs_bmap_util.c | 81 +++++++++++++++++++++++---------------------------
1 file changed, 37 insertions(+), 44 deletions(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 3c60c43..2f1e30d 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1619,6 +1619,30 @@ xfs_swap_extents_check_format(
}
int
+xfs_swap_extent_flush(
+ struct xfs_inode *ip)
+{
+ int error;
+
+ error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
+ if (error)
+ return error;
+ truncate_pagecache_range(VFS_I(ip), 0, -1);
+
+ /* Verify O_DIRECT for ftmp */
+ if (VFS_I(ip)->i_mapping->nrpages)
+ return -EINVAL;
+
+ /*
+ * Don't try to swap extents on mmap()d files because we can't lock
+ * out races against page faults safely.
+ */
+ if (mapping_mapped(VFS_I(ip)->i_mapping))
+ return -EBUSY;
+ return 0;
+}
+
+int
xfs_swap_extents(
xfs_inode_t *ip, /* target inode */
xfs_inode_t *tip, /* tmp inode */
@@ -1662,26 +1686,28 @@ xfs_swap_extents(
goto out_unlock;
}
- error = filemap_write_and_wait(VFS_I(tip)->i_mapping);
+ error = xfs_swap_extent_flush(ip);
+ if (error)
+ goto out_unlock;
+ error = xfs_swap_extent_flush(tip);
if (error)
goto out_unlock;
- truncate_pagecache_range(VFS_I(tip), 0, -1);
-
- xfs_lock_two_inodes(ip, tip, XFS_ILOCK_EXCL);
- lock_flags |= XFS_ILOCK_EXCL;
- /* Verify O_DIRECT for ftmp */
- if (VFS_I(tip)->i_mapping->nrpages) {
- error = -EINVAL;
+ tp = xfs_trans_alloc(mp, XFS_TRANS_SWAPEXT);
+ error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ichange, 0, 0);
+ if (error) {
+ xfs_trans_cancel(tp, 0);
goto out_unlock;
}
+ xfs_lock_two_inodes(ip, tip, XFS_ILOCK_EXCL);
+ lock_flags |= XFS_ILOCK_EXCL;
/* Verify all data are being swapped */
if (sxp->sx_offset != 0 ||
sxp->sx_length != ip->i_d.di_size ||
sxp->sx_length != tip->i_d.di_size) {
error = -EFAULT;
- goto out_unlock;
+ goto out_trans_cancel;
}
trace_xfs_swap_extent_before(ip, 0);
@@ -1693,7 +1719,7 @@ xfs_swap_extents(
xfs_notice(mp,
"%s: inode 0x%llx format is incompatible for exchanging.",
__func__, ip->i_ino);
- goto out_unlock;
+ goto out_trans_cancel;
}
/*
@@ -1708,41 +1734,8 @@ xfs_swap_extents(
(sbp->bs_mtime.tv_sec != VFS_I(ip)->i_mtime.tv_sec) ||
(sbp->bs_mtime.tv_nsec != VFS_I(ip)->i_mtime.tv_nsec)) {
error = -EBUSY;
- goto out_unlock;
- }
-
- /* We need to fail if the file is memory mapped. Once we have tossed
- * all existing pages, the page fault will have no option
- * but to go to the filesystem for pages. By making the page fault call
- * vop_read (or write in the case of autogrow) they block on the iolock
- * until we have switched the extents.
- */
- if (mapping_mapped(VFS_I(ip)->i_mapping)) {
- error = -EBUSY;
- goto out_unlock;
- }
-
- xfs_iunlock(ip, XFS_ILOCK_EXCL);
- xfs_iunlock(tip, XFS_ILOCK_EXCL);
- lock_flags &= ~XFS_ILOCK_EXCL;
-
- /*
- * There is a race condition here since we gave up the
- * ilock. However, the data fork will not change since
- * we have the iolock (locked for truncation too) so we
- * are safe. We don't really care if non-io related
- * fields change.
- */
- truncate_pagecache_range(VFS_I(ip), 0, -1);
-
- tp = xfs_trans_alloc(mp, XFS_TRANS_SWAPEXT);
- error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ichange, 0, 0);
- if (error)
goto out_trans_cancel;
-
- xfs_lock_two_inodes(ip, tip, XFS_ILOCK_EXCL);
- lock_flags |= XFS_ILOCK_EXCL;
-
+ }
/*
* Count the number of extended attribute blocks
*/
--
2.0.0
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs

Dave Chinner

2014-08-02 03:19:31 UTC

Permalink

Post by Brian Foster

Both of these looked fine to me, but I couldn't apply this one to
for-next or master...

It's actually in my working branch, which means it's based on
3.16-rc5 + random-outside-xfs-patches + for-next + verifier fixes +
sb discombobulation and then this patch set. I didn't check that it
applied directly against for-next - do you want me to rebase and
resend it?

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Brian Foster

2014-08-02 11:24:06 UTC

Permalink

Post by Dave Chinner

Post by Brian Foster

Both of these looked fine to me, but I couldn't apply this one to
for-next or master...

Ok, no worries. I was just trying to compile test and verify that
nothing was unexpected with the patch as posted. I'm good as long we get
some lead team for testing and whatnot when this gets merged to a dev.
branch. :)

Brian

Post by Dave Chinner
Cheers,
Dave.
--
Dave Chinner

Christoph Hellwig

2014-07-31 17:15:00 UTC

Permalink

Post by Dave Chinner
xfs_swap_extents() holds the ilock over a call to
filemap_write_and_wait(), which can then try to write data and take
the ilock. That causes a self-deadlock.
Fix the deadlock and clean up the code by separating the locking
appropriately. Add a lockflags variable to track what locks we are
holding as we gain and drop them and cleanup the error handling to
always use "out_unlock" with the lockflags variable.

Looks good,

Reviewed-by: Christoph Hellwig <***@lst.de>