Discussion:
[PATCH 3/9] xfs: rework xfs_buf_bio_endio error handling
Dave Chinner
2014-08-15 06:39:01 UTC
Permalink
From: Dave Chinner <***@redhat.com>

Currently the report of a bio error from completion
immediately marks the buffer with an error. The issue is that this
is racy w.r.t. synchronous IO - the submitter can see b_error being
set before the IO is complete, and hence we cannot differentiate
between submission failures and completion failures.

Add an internal b_io_error field protected by the b_lock to catch IO
completion errors, and only propagate that to the buffer during
final IO completion handling. Hence we can tell in xfs_buf_iorequest
if we've had a submission failure bey checking bp->b_error before
dropping our b_io_remaining reference - that reference will prevent
b_io_error values from being propagated to b_error in the event that
completion races with submission.

In doing so, allow xfs_buf_iorequest to return an error. That way,
the caller can check for submission errors safely if required, and
easily distinguish them from completion errors that come from
xfs_buf_iowait().

Signed-off-by: Dave Chinner <***@redhat.com>
---
fs/xfs/xfs_buf.c | 18 ++++++++++++++++--
fs/xfs/xfs_buf.h | 1 +
2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 1b7f0bc..58ae34c 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1008,6 +1008,13 @@ xfs_buf_ioend(

bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);

+ /*
+ * Pull in IO completion errors now. We are guaranteed to be running
+ * single threaded, so we don't need the lock to read b_io_error.
+ */
+ if (!bp->b_error && bp->b_io_error)
+ xfs_buf_ioerror(bp, bp->b_io_error);
+
if (!bp->b_error) {
bp->b_flags |= XBF_DONE;

@@ -1186,8 +1193,12 @@ xfs_buf_bio_end_io(
* don't overwrite existing errors - otherwise we can lose errors on
* buffers that require multiple bios to complete.
*/
- if (!bp->b_error)
- xfs_buf_ioerror(bp, error);
+ if (error) {
+ spin_lock(&bp->b_lock);
+ if (!bp->b_io_error)
+ bp->b_io_error = error;
+ spin_unlock(&bp->b_lock);
+ }

if (!bp->b_error && xfs_buf_is_vmapped(bp) && (bp->b_flags & XBF_READ))
invalidate_kernel_vmap_range(bp->b_addr, xfs_buf_vmap_len(bp));
@@ -1373,6 +1384,9 @@ xfs_buf_iorequest(
if (bp->b_flags & XBF_WRITE)
xfs_buf_wait_unpin(bp);

+ /* clear the internal error state to avoid spurious errors */
+ bp->b_io_error = 0;
+
/*
* Take references to the buffer. For XBF_ASYNC buffers, holding a
* reference for as long as submission takes is all that is necessary
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 4585c15..44db8cd 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -158,6 +158,7 @@ typedef struct xfs_buf {
struct list_head b_lru; /* lru list */
spinlock_t b_lock; /* internal state lock */
unsigned int b_state; /* internal state flags */
+ int b_io_error; /* internal IO error state */
wait_queue_head_t b_waiters; /* unpin waiters */
struct list_head b_list;
struct xfs_perag *b_pag; /* contains rbtree root */
--
2.0.0
Dave Chinner
2014-08-15 06:39:04 UTC
Permalink
From: Dave Chinner <***@redhat.com>

There is only one caller now - xfs_trans_read_buf_map() - and it has
very well defined call semantics - read, synchronous, and b_iodone
is NULL. Hence it's pretty clear what error handling is necessary
for this case. The bigger problem of untangling
xfs_trans_read_buf_map error handling is left to a future patch.

Signed-off-by: Dave Chinner <***@redhat.com>
---
fs/xfs/xfs_buf.c | 39 ---------------------------------------
fs/xfs/xfs_buf.h | 2 --
fs/xfs/xfs_trans_buf.c | 8 +++++---
3 files changed, 5 insertions(+), 44 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index f444285..352e9219 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1071,45 +1071,6 @@ xfs_buf_ioerror_alert(
(__uint64_t)XFS_BUF_ADDR(bp), func, -bp->b_error, bp->b_length);
}

-/*
- * Same as xfs_bioerror, except that we are releasing the buffer
- * here ourselves, and avoiding the xfs_buf_ioend call.
- * This is meant for userdata errors; metadata bufs come with
- * iodone functions attached, so that we can track down errors.
- */
-int
-xfs_bioerror_relse(
- struct xfs_buf *bp)
-{
- int64_t fl = bp->b_flags;
- /*
- * No need to wait until the buffer is unpinned.
- * We aren't flushing it.
- *
- * chunkhold expects B_DONE to be set, whether
- * we actually finish the I/O or not. We don't want to
- * change that interface.
- */
- XFS_BUF_UNREAD(bp);
- XFS_BUF_DONE(bp);
- xfs_buf_stale(bp);
- bp->b_iodone = NULL;
- if (!(fl & XBF_ASYNC)) {
- /*
- * Mark b_error and B_ERROR _both_.
- * Lot's of chunkcache code assumes that.
- * There's no reason to mark error for
- * ASYNC buffers.
- */
- xfs_buf_ioerror(bp, -EIO);
- complete(&bp->b_iowait);
- } else {
- xfs_buf_relse(bp);
- }
-
- return -EIO;
-}
-
int
xfs_bwrite(
struct xfs_buf *bp)
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 44db8cd..d8f57f6 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -297,8 +297,6 @@ extern void xfs_buf_iomove(xfs_buf_t *, size_t, size_t, void *,
#define xfs_buf_zero(bp, off, len) \
xfs_buf_iomove((bp), (off), (len), NULL, XBRW_ZERO)

-extern int xfs_bioerror_relse(struct xfs_buf *);
-
/* Buffer Utility Routines */
extern xfs_caddr_t xfs_buf_offset(xfs_buf_t *, size_t);

diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 96c898e..758c07d 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -324,11 +324,13 @@ xfs_trans_read_buf_map(
*/
if (XFS_FORCED_SHUTDOWN(mp)) {
trace_xfs_bdstrat_shut(bp, _RET_IP_);
- xfs_bioerror_relse(bp);
- } else {
- xfs_buf_iorequest(bp);
+ bp->b_flags &= ~(XBF_READ | XBF_DONE);
+ xfs_buf_ioerror(bp, -EIO);
+ xfs_buf_stale(bp);
+ return -EIO;
}

+ xfs_buf_iorequest(bp);
error = xfs_buf_iowait(bp);
if (error) {
xfs_buf_ioerror_alert(bp, __func__);
--
2.0.0
Christoph Hellwig
2014-08-29 00:32:57 UTC
Permalink
Post by Dave Chinner
index 96c898e..758c07d 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -324,11 +324,13 @@ xfs_trans_read_buf_map(
*/
if (XFS_FORCED_SHUTDOWN(mp)) {
trace_xfs_bdstrat_shut(bp, _RET_IP_);
- xfs_bioerror_relse(bp);
- } else {
- xfs_buf_iorequest(bp);
+ bp->b_flags &= ~(XBF_READ | XBF_DONE);
+ xfs_buf_ioerror(bp, -EIO);
+ xfs_buf_stale(bp);
+ return -EIO;
}
This is a large change of behavior as it doesn't hit the error
path after the xfs_buf_iowait anymore. While I don't think that
that path was entirely correct this version seems to be even less so
by not releasing the buffer reference or forcing the shutdown.
Dave Chinner
2014-08-29 01:12:36 UTC
Permalink
Post by Christoph Hellwig
Post by Dave Chinner
index 96c898e..758c07d 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -324,11 +324,13 @@ xfs_trans_read_buf_map(
*/
if (XFS_FORCED_SHUTDOWN(mp)) {
trace_xfs_bdstrat_shut(bp, _RET_IP_);
- xfs_bioerror_relse(bp);
- } else {
- xfs_buf_iorequest(bp);
+ bp->b_flags &= ~(XBF_READ | XBF_DONE);
+ xfs_buf_ioerror(bp, -EIO);
+ xfs_buf_stale(bp);
+ return -EIO;
}
This is a large change of behavior as it doesn't hit the error
path after the xfs_buf_iowait anymore. While I don't think that
that path was entirely correct this version seems to be even less so
by not releasing the buffer reference or forcing the shutdown.
The IO is synchronous, so the previous behaviour did not release
the buffer here. But, yes, it needs to because we're not running the
io wait on it anymore. And this happens only during a shutdown, so i
don't see any need to trigger a shutdown ;)

As it is, I think this gets properly fixed by the next patch....

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Christoph Hellwig
2014-08-29 18:26:10 UTC
Permalink
Post by Dave Chinner
Post by Christoph Hellwig
This is a large change of behavior as it doesn't hit the error
path after the xfs_buf_iowait anymore. While I don't think that
that path was entirely correct this version seems to be even less so
by not releasing the buffer reference or forcing the shutdown.
The IO is synchronous, so the previous behaviour did not release
the buffer here. But, yes, it needs to because we're not running the
io wait on it anymore. And this happens only during a shutdown, so i
don't see any need to trigger a shutdown ;)
As it is, I think this gets properly fixed by the next patch....
Do you have any scenario that might actually exercise this path? I
enabled xfs_trans_read_buf_io trace events and run xfstests as well
as a few other workloads and couldn't hit it at all. And when you
think of it: when would be do a trans_get_buf, then not actually
updating it with data and then recurse into a trans_read_buf in
the same transaction?

Maybe it's just time do a bit more of an audit and put this whole
code path to rest.
Dave Chinner
2014-08-30 00:05:47 UTC
Permalink
Post by Christoph Hellwig
Post by Dave Chinner
Post by Christoph Hellwig
This is a large change of behavior as it doesn't hit the error
path after the xfs_buf_iowait anymore. While I don't think that
that path was entirely correct this version seems to be even less so
by not releasing the buffer reference or forcing the shutdown.
The IO is synchronous, so the previous behaviour did not release
the buffer here. But, yes, it needs to because we're not running the
io wait on it anymore. And this happens only during a shutdown, so i
don't see any need to trigger a shutdown ;)
As it is, I think this gets properly fixed by the next patch....
Do you have any scenario that might actually exercise this path? I
enabled xfs_trans_read_buf_io trace events and run xfstests as well
as a few other workloads and couldn't hit it at all. And when you
think of it: when would be do a trans_get_buf, then not actually
updating it with data and then recurse into a trans_read_buf in
the same transaction?
*nod*. This path has been there is the Irix days, and I suspect that
it was there to handle some wierd corner case to do with async
buffer reads which IIRC, could once be done through this code...
Post by Christoph Hellwig
Maybe it's just time do a bit more of an audit and put this whole
code path to rest.
Perhaps so.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Dave Chinner
2014-08-15 06:39:03 UTC
Permalink
From: Dave Chinner <***@redhat.com>

Internal buffer write error handling is a mess due to the unnatural
split between xfs_bioerror and xfs_bioerror_relse(). The buffer
reference counting is also wrong for the xfs_bioerror path for
xfs_bwrite - it uses sync IO and so xfs_buf_ioend will release a
hold count that was never taken.

Further, xfs_bwrite only does sync IO and determines the handler to
call based on b_iodone, so for this caller the only difference
between xfs_bioerror() and xfs_bioerror_release() is the XBF_DONE
flag. We don't care what the XBF_DONE flag state is because we stale
the buffer in both paths - the next buffer lookup will clear
XBF_DONE anyway because XBF_STALE is set. Hence we can use common
error handling for xfs_bwrite().

__xfs_buf_delwri_submit() is a similar - it's only ever called
on writes - all sync or async - and again there's no reason to
handle them any differently at all.

Clean up the nasty error handling and remove xfs_bioerror().

Signed-off-by: Dave Chinner <***@redhat.com>
---
fs/xfs/xfs_buf.c | 67 +++++++++++++++++---------------------------------------
1 file changed, 20 insertions(+), 47 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 98fcf5a..f444285 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1072,36 +1072,6 @@ xfs_buf_ioerror_alert(
}

/*
- * Called when we want to stop a buffer from getting written or read.
- * We attach the EIO error, muck with its flags, and call xfs_buf_ioend
- * so that the proper iodone callbacks get called.
- */
-STATIC int
-xfs_bioerror(
- xfs_buf_t *bp)
-{
-#ifdef XFSERRORDEBUG
- ASSERT(XFS_BUF_ISREAD(bp) || bp->b_iodone);
-#endif
-
- /*
- * No need to wait until the buffer is unpinned, we aren't flushing it.
- */
- xfs_buf_ioerror(bp, -EIO);
-
- /*
- * We're calling xfs_buf_ioend, so delete XBF_DONE flag.
- */
- XFS_BUF_UNREAD(bp);
- XFS_BUF_UNDONE(bp);
- xfs_buf_stale(bp);
-
- xfs_buf_ioend(bp);
-
- return -EIO;
-}
-
-/*
* Same as xfs_bioerror, except that we are releasing the buffer
* here ourselves, and avoiding the xfs_buf_ioend call.
* This is meant for userdata errors; metadata bufs come with
@@ -1149,19 +1119,19 @@ xfs_bwrite(
ASSERT(xfs_buf_islocked(bp));

bp->b_flags |= XBF_WRITE;
- bp->b_flags &= ~(XBF_ASYNC | XBF_READ | _XBF_DELWRI_Q | XBF_WRITE_FAIL);
+ bp->b_flags &= ~(XBF_ASYNC | XBF_READ | _XBF_DELWRI_Q |
+ XBF_WRITE_FAIL | XBF_DONE);

if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
trace_xfs_bdstrat_shut(bp, _RET_IP_);

- /*
- * Metadata write that didn't get logged but written anyway.
- * These aren't associated with a transaction, and can be
- * ignored.
- */
- if (!bp->b_iodone)
- return xfs_bioerror_relse(bp);
- return xfs_bioerror(bp);
+ xfs_buf_ioerror(bp, -EIO);
+ xfs_buf_stale(bp);
+
+ /* sync IO, xfs_buf_ioend is going to remove a ref here */
+ xfs_buf_hold(bp);
+ xfs_buf_ioend(bp);
+ return -EIO;
}

xfs_buf_iorequest(bp);
@@ -1848,16 +1818,19 @@ __xfs_buf_delwri_submit(
if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
trace_xfs_bdstrat_shut(bp, _RET_IP_);

+ xfs_buf_ioerror(bp, -EIO);
+ xfs_buf_stale(bp);
+
/*
- * Metadata write that didn't get logged but written anyway.
- * These aren't associated with a transaction, and can be
- * ignored.
+ * if sync IO, xfs_buf_ioend is going to remove a
+ * ref here. We need to add that so we can collect
+ * all the buffer errors in the wait loop.
*/
- if (!bp->b_iodone)
- return xfs_bioerror_relse(bp);
- return xfs_bioerror(bp);
- }
- xfs_buf_iorequest(bp);
+ if (wait)
+ xfs_buf_hold(bp);
+ xfs_buf_ioend(bp);
+ } else
+ xfs_buf_iorequest(bp);
}
blk_finish_plug(&plug);
--
2.0.0
Brian Foster
2014-08-15 14:35:41 UTC
Permalink
Post by Dave Chinner
Internal buffer write error handling is a mess due to the unnatural
split between xfs_bioerror and xfs_bioerror_relse(). The buffer
reference counting is also wrong for the xfs_bioerror path for
xfs_bwrite - it uses sync IO and so xfs_buf_ioend will release a
hold count that was never taken.
Further, xfs_bwrite only does sync IO and determines the handler to
call based on b_iodone, so for this caller the only difference
between xfs_bioerror() and xfs_bioerror_release() is the XBF_DONE
flag. We don't care what the XBF_DONE flag state is because we stale
the buffer in both paths - the next buffer lookup will clear
XBF_DONE anyway because XBF_STALE is set. Hence we can use common
error handling for xfs_bwrite().
__xfs_buf_delwri_submit() is a similar - it's only ever called
on writes - all sync or async - and again there's no reason to
handle them any differently at all.
Clean up the nasty error handling and remove xfs_bioerror().
---
fs/xfs/xfs_buf.c | 67 +++++++++++++++++---------------------------------------
1 file changed, 20 insertions(+), 47 deletions(-)
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 98fcf5a..f444285 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1072,36 +1072,6 @@ xfs_buf_ioerror_alert(
}
/*
- * Called when we want to stop a buffer from getting written or read.
- * We attach the EIO error, muck with its flags, and call xfs_buf_ioend
- * so that the proper iodone callbacks get called.
- */
-STATIC int
-xfs_bioerror(
- xfs_buf_t *bp)
-{
-#ifdef XFSERRORDEBUG
- ASSERT(XFS_BUF_ISREAD(bp) || bp->b_iodone);
-#endif
-
- /*
- * No need to wait until the buffer is unpinned, we aren't flushing it.
- */
- xfs_buf_ioerror(bp, -EIO);
-
- /*
- * We're calling xfs_buf_ioend, so delete XBF_DONE flag.
- */
- XFS_BUF_UNREAD(bp);
- XFS_BUF_UNDONE(bp);
- xfs_buf_stale(bp);
-
- xfs_buf_ioend(bp);
-
- return -EIO;
-}
-
-/*
* Same as xfs_bioerror, except that we are releasing the buffer
* here ourselves, and avoiding the xfs_buf_ioend call.
* This is meant for userdata errors; metadata bufs come with
@@ -1149,19 +1119,19 @@ xfs_bwrite(
ASSERT(xfs_buf_islocked(bp));
bp->b_flags |= XBF_WRITE;
- bp->b_flags &= ~(XBF_ASYNC | XBF_READ | _XBF_DELWRI_Q | XBF_WRITE_FAIL);
+ bp->b_flags &= ~(XBF_ASYNC | XBF_READ | _XBF_DELWRI_Q |
+ XBF_WRITE_FAIL | XBF_DONE);
if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
trace_xfs_bdstrat_shut(bp, _RET_IP_);
- /*
- * Metadata write that didn't get logged but written anyway.
- * These aren't associated with a transaction, and can be
- * ignored.
- */
- if (!bp->b_iodone)
- return xfs_bioerror_relse(bp);
- return xfs_bioerror(bp);
+ xfs_buf_ioerror(bp, -EIO);
+ xfs_buf_stale(bp);
+
+ /* sync IO, xfs_buf_ioend is going to remove a ref here */
+ xfs_buf_hold(bp);
Looks correct, but that's kind of ugly. The reference serves no purpose
except to appease the error sequence. It gives the impression that the
reference management should be handled at a higher level (and with truly
synchronous I/O primitives/mechanisms, it is ;).

At the very least, can we reorganize the ioend side of things to handle
this? We already have the duplicate execution issue previously mentioned
that has to get resolved. Some duplicate code is fine if it improves
clarity, imo.

Brian
Post by Dave Chinner
+ xfs_buf_ioend(bp);
+ return -EIO;
}
xfs_buf_iorequest(bp);
@@ -1848,16 +1818,19 @@ __xfs_buf_delwri_submit(
if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
trace_xfs_bdstrat_shut(bp, _RET_IP_);
+ xfs_buf_ioerror(bp, -EIO);
+ xfs_buf_stale(bp);
+
/*
- * Metadata write that didn't get logged but written anyway.
- * These aren't associated with a transaction, and can be
- * ignored.
+ * if sync IO, xfs_buf_ioend is going to remove a
+ * ref here. We need to add that so we can collect
+ * all the buffer errors in the wait loop.
*/
- if (!bp->b_iodone)
- return xfs_bioerror_relse(bp);
- return xfs_bioerror(bp);
- }
- xfs_buf_iorequest(bp);
+ if (wait)
+ xfs_buf_hold(bp);
+ xfs_buf_ioend(bp);
+ } else
+ xfs_buf_iorequest(bp);
}
blk_finish_plug(&plug);
--
2.0.0
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
Dave Chinner
2014-08-15 23:27:41 UTC
Permalink
Post by Brian Foster
Post by Dave Chinner
@@ -1149,19 +1119,19 @@ xfs_bwrite(
ASSERT(xfs_buf_islocked(bp));
bp->b_flags |= XBF_WRITE;
- bp->b_flags &= ~(XBF_ASYNC | XBF_READ | _XBF_DELWRI_Q | XBF_WRITE_FAIL);
+ bp->b_flags &= ~(XBF_ASYNC | XBF_READ | _XBF_DELWRI_Q |
+ XBF_WRITE_FAIL | XBF_DONE);
if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
trace_xfs_bdstrat_shut(bp, _RET_IP_);
- /*
- * Metadata write that didn't get logged but written anyway.
- * These aren't associated with a transaction, and can be
- * ignored.
- */
- if (!bp->b_iodone)
- return xfs_bioerror_relse(bp);
- return xfs_bioerror(bp);
+ xfs_buf_ioerror(bp, -EIO);
+ xfs_buf_stale(bp);
+
+ /* sync IO, xfs_buf_ioend is going to remove a ref here */
+ xfs_buf_hold(bp);
Looks correct, but that's kind of ugly. The reference serves no purpose
except to appease the error sequence. It gives the impression that the
reference management should be handled at a higher level (and with truly
synchronous I/O primitives/mechanisms, it is ;).
Oh, yeah, it's nasty ugly, but that goes away with patch 8. This is
just a temporary state that then gets factored neatly when the new
interfaces are introduced.
Post by Brian Foster
At the very least, can we reorganize the ioend side of things to handle
this? We already have the duplicate execution issue previously mentioned
that has to get resolved. Some duplicate code is fine if it improves
clarity, imo.
patch 8 does that - this is just preparing the code for being easy
to factor.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Christoph Hellwig
2014-08-29 00:28:25 UTC
Permalink
Post by Dave Chinner
Internal buffer write error handling is a mess due to the unnatural
split between xfs_bioerror and xfs_bioerror_relse(). The buffer
reference counting is also wrong for the xfs_bioerror path for
xfs_bwrite - it uses sync IO and so xfs_buf_ioend will release a
hold count that was never taken.
Which seems to be cause by patch 1 in the series, and thus the
additional hold for the error case should be introduced there.

Otherwise this looks good to me.
Dave Chinner
2014-08-29 01:05:54 UTC
Permalink
Post by Christoph Hellwig
Post by Dave Chinner
Internal buffer write error handling is a mess due to the unnatural
split between xfs_bioerror and xfs_bioerror_relse(). The buffer
reference counting is also wrong for the xfs_bioerror path for
xfs_bwrite - it uses sync IO and so xfs_buf_ioend will release a
hold count that was never taken.
Which seems to be cause by patch 1 in the series, and thus the
additional hold for the error case should be introduced there.
will fix.
--
Dave Chinner
***@fromorbit.com
Dave Chinner
2014-08-15 06:39:00 UTC
Permalink
From: Dave Chinner <***@redhat.com>

We do some work in xfs_buf_ioend, and some work in
xfs_buf_iodone_work, but much of that functionality is the same.
This work can all be done in a single function, leaving
xfs_buf_iodone just a wrapper to determine if we should execute it
by workqueue or directly. hence rename xfs_buf_iodone_work to
xfs_buf_ioend(), and add a new xfs_buf_ioend_async() for places that
need async processing.

Signed-off-by: Dave Chinner <***@redhat.com>
---
fs/xfs/xfs_buf.c | 79 +++++++++++++++++++++---------------------------
fs/xfs/xfs_buf.h | 2 +-
fs/xfs/xfs_buf_item.c | 4 +--
fs/xfs/xfs_inode.c | 2 +-
fs/xfs/xfs_log.c | 2 +-
fs/xfs/xfs_log_recover.c | 2 +-
6 files changed, 40 insertions(+), 51 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 5d86bbd..1b7f0bc 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -999,54 +999,49 @@ xfs_buf_wait_unpin(
*/

STATIC void
-xfs_buf_iodone_work(
- struct work_struct *work)
+xfs_buf_ioend(
+ struct xfs_buf *bp)
{
- struct xfs_buf *bp =
- container_of(work, xfs_buf_t, b_iodone_work);
- bool read = !!(bp->b_flags & XBF_READ);
+ bool read = !!(bp->b_flags & XBF_READ);
+
+ trace_xfs_buf_iodone(bp, _RET_IP_);

bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);

- /* only validate buffers that were read without errors */
- if (read && bp->b_ops && !bp->b_error && (bp->b_flags & XBF_DONE))
- bp->b_ops->verify_read(bp);
+ if (!bp->b_error) {
+ bp->b_flags |= XBF_DONE;
+
+ /* only validate buffers that were read without errors */
+ if (read && bp->b_ops)
+ bp->b_ops->verify_read(bp);
+ }

if (bp->b_iodone)
(*(bp->b_iodone))(bp);
else if (bp->b_flags & XBF_ASYNC)
xfs_buf_relse(bp);
else {
- ASSERT(read && bp->b_ops);
complete(&bp->b_iowait);
xfs_buf_rele(bp);
}
}

-void
-xfs_buf_ioend(
- struct xfs_buf *bp,
- int schedule)
+static void
+xfs_buf_ioend_work(
+ struct work_struct *work)
{
- bool read = !!(bp->b_flags & XBF_READ);
-
- trace_xfs_buf_iodone(bp, _RET_IP_);
+ struct xfs_buf *bp =
+ container_of(work, xfs_buf_t, b_iodone_work);

- if (bp->b_error == 0)
- bp->b_flags |= XBF_DONE;
+ xfs_buf_ioend(bp);
+}

- if (bp->b_iodone || (read && bp->b_ops) || (bp->b_flags & XBF_ASYNC)) {
- if (schedule) {
- INIT_WORK(&bp->b_iodone_work, xfs_buf_iodone_work);
- queue_work(xfslogd_workqueue, &bp->b_iodone_work);
- } else {
- xfs_buf_iodone_work(&bp->b_iodone_work);
- }
- } else {
- bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);
- complete(&bp->b_iowait);
- xfs_buf_rele(bp);
- }
+void
+xfs_buf_ioend_async(
+ struct xfs_buf *bp)
+{
+ INIT_WORK(&bp->b_iodone_work, xfs_buf_ioend_work);
+ queue_work(xfslogd_workqueue, &bp->b_iodone_work);
}

void
@@ -1094,7 +1089,7 @@ xfs_bioerror(
XFS_BUF_UNDONE(bp);
xfs_buf_stale(bp);

- xfs_buf_ioend(bp, 0);
+ xfs_buf_ioend(bp);

return -EIO;
}
@@ -1181,15 +1176,6 @@ xfs_bwrite(
}

STATIC void
-_xfs_buf_ioend(
- xfs_buf_t *bp,
- int schedule)
-{
- if (atomic_dec_and_test(&bp->b_io_remaining) == 1)
- xfs_buf_ioend(bp, schedule);
-}
-
-STATIC void
xfs_buf_bio_end_io(
struct bio *bio,
int error)
@@ -1206,7 +1192,8 @@ xfs_buf_bio_end_io(
if (!bp->b_error && xfs_buf_is_vmapped(bp) && (bp->b_flags & XBF_READ))
invalidate_kernel_vmap_range(bp->b_addr, xfs_buf_vmap_len(bp));

- _xfs_buf_ioend(bp, 1);
+ if (atomic_dec_and_test(&bp->b_io_remaining) == 1)
+ xfs_buf_ioend_async(bp);
bio_put(bio);
}

@@ -1425,10 +1412,12 @@ xfs_buf_iorequest(
* waiting, and in the synchronous IO case it avoids unnecessary context
* switches an latency for high-peformance devices.
*/
- if (bp->b_error || !(bp->b_flags & XBF_ASYNC))
- _xfs_buf_ioend(bp, 0);
- else
- _xfs_buf_ioend(bp, 1);
+ if (atomic_dec_and_test(&bp->b_io_remaining) == 1) {
+ if (bp->b_error || !(bp->b_flags & XBF_ASYNC))
+ xfs_buf_ioend(bp);
+ else
+ xfs_buf_ioend_async(bp);
+ }

xfs_buf_rele(bp);
}
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index c753183..4585c15 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -286,7 +286,7 @@ extern void xfs_buf_unlock(xfs_buf_t *);

/* Buffer Read and Write Routines */
extern int xfs_bwrite(struct xfs_buf *bp);
-extern void xfs_buf_ioend(xfs_buf_t *, int);
+extern void xfs_buf_ioend(struct xfs_buf *bp);
extern void xfs_buf_ioerror(xfs_buf_t *, int);
extern void xfs_buf_ioerror_alert(struct xfs_buf *, const char *func);
extern void xfs_buf_iorequest(xfs_buf_t *);
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 76007de..4fd41b5 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -491,7 +491,7 @@ xfs_buf_item_unpin(
xfs_buf_ioerror(bp, -EIO);
XFS_BUF_UNDONE(bp);
xfs_buf_stale(bp);
- xfs_buf_ioend(bp, 0);
+ xfs_buf_ioend(bp);
}
}

@@ -1115,7 +1115,7 @@ do_callbacks:
xfs_buf_do_callbacks(bp);
bp->b_fspriv = NULL;
bp->b_iodone = NULL;
- xfs_buf_ioend(bp, 0);
+ xfs_buf_ioend(bp);
}

/*
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index fea3c92..00d210b 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3056,7 +3056,7 @@ cluster_corrupt_out:
XFS_BUF_UNDONE(bp);
xfs_buf_stale(bp);
xfs_buf_ioerror(bp, -EIO);
- xfs_buf_ioend(bp, 0);
+ xfs_buf_ioend(bp);
} else {
xfs_buf_stale(bp);
xfs_buf_relse(bp);
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 8eaa8f5..e4665db 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1689,7 +1689,7 @@ xlog_bdstrat(
if (iclog->ic_state & XLOG_STATE_IOERROR) {
xfs_buf_ioerror(bp, -EIO);
xfs_buf_stale(bp);
- xfs_buf_ioend(bp, 0);
+ xfs_buf_ioend(bp);
/*
* It would seem logical to return EIO here, but we rely on
* the log state machine to propagate I/O errors instead of
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 1fd5787..4ba19bf 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -383,7 +383,7 @@ xlog_recover_iodone(
SHUTDOWN_META_IO_ERROR);
}
bp->b_iodone = NULL;
- xfs_buf_ioend(bp, 0);
+ xfs_buf_ioend(bp);
}

/*
--
2.0.0
Brian Foster
2014-08-15 13:18:21 UTC
Permalink
Post by Dave Chinner
We do some work in xfs_buf_ioend, and some work in
xfs_buf_iodone_work, but much of that functionality is the same.
This work can all be done in a single function, leaving
xfs_buf_iodone just a wrapper to determine if we should execute it
by workqueue or directly. hence rename xfs_buf_iodone_work to
xfs_buf_ioend(), and add a new xfs_buf_ioend_async() for places that
need async processing.
---
fs/xfs/xfs_buf.c | 79 +++++++++++++++++++++---------------------------
fs/xfs/xfs_buf.h | 2 +-
fs/xfs/xfs_buf_item.c | 4 +--
fs/xfs/xfs_inode.c | 2 +-
fs/xfs/xfs_log.c | 2 +-
fs/xfs/xfs_log_recover.c | 2 +-
6 files changed, 40 insertions(+), 51 deletions(-)
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 5d86bbd..1b7f0bc 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -999,54 +999,49 @@ xfs_buf_wait_unpin(
*/
STATIC void
-xfs_buf_iodone_work(
- struct work_struct *work)
+xfs_buf_ioend(
+ struct xfs_buf *bp)
Compile failure here due to STATIC.
Post by Dave Chinner
{
- struct xfs_buf *bp =
- container_of(work, xfs_buf_t, b_iodone_work);
- bool read = !!(bp->b_flags & XBF_READ);
+ bool read = !!(bp->b_flags & XBF_READ);
+
+ trace_xfs_buf_iodone(bp, _RET_IP_);
bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);
- /* only validate buffers that were read without errors */
- if (read && bp->b_ops && !bp->b_error && (bp->b_flags & XBF_DONE))
- bp->b_ops->verify_read(bp);
+ if (!bp->b_error) {
+ bp->b_flags |= XBF_DONE;
+
+ /* only validate buffers that were read without errors */
+ if (read && bp->b_ops)
+ bp->b_ops->verify_read(bp);
+ }
Probably not a cause of errors, but this code is now executed twice for
I/O with b_iodone callbacks. Once for the initial call from bio_end_io,
again from the callback via the b_iodone handler. The flags bits are
probably fine, but we don't want to be running the verifiers multiple
times unnecessarily.
Post by Dave Chinner
if (bp->b_iodone)
(*(bp->b_iodone))(bp);
else if (bp->b_flags & XBF_ASYNC)
xfs_buf_relse(bp);
else {
- ASSERT(read && bp->b_ops);
complete(&bp->b_iowait);
xfs_buf_rele(bp);
}
}
-void
-xfs_buf_ioend(
- struct xfs_buf *bp,
- int schedule)
+static void
+xfs_buf_ioend_work(
+ struct work_struct *work)
{
- bool read = !!(bp->b_flags & XBF_READ);
-
- trace_xfs_buf_iodone(bp, _RET_IP_);
+ struct xfs_buf *bp =
+ container_of(work, xfs_buf_t, b_iodone_work);
- if (bp->b_error == 0)
- bp->b_flags |= XBF_DONE;
+ xfs_buf_ioend(bp);
+}
- if (bp->b_iodone || (read && bp->b_ops) || (bp->b_flags & XBF_ASYNC)) {
- if (schedule) {
- INIT_WORK(&bp->b_iodone_work, xfs_buf_iodone_work);
- queue_work(xfslogd_workqueue, &bp->b_iodone_work);
- } else {
- xfs_buf_iodone_work(&bp->b_iodone_work);
- }
- } else {
- bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);
- complete(&bp->b_iowait);
- xfs_buf_rele(bp);
- }
+void
+xfs_buf_ioend_async(
+ struct xfs_buf *bp)
+{
+ INIT_WORK(&bp->b_iodone_work, xfs_buf_ioend_work);
+ queue_work(xfslogd_workqueue, &bp->b_iodone_work);
}
void
@@ -1094,7 +1089,7 @@ xfs_bioerror(
XFS_BUF_UNDONE(bp);
xfs_buf_stale(bp);
- xfs_buf_ioend(bp, 0);
+ xfs_buf_ioend(bp);
return -EIO;
}
@@ -1181,15 +1176,6 @@ xfs_bwrite(
}
STATIC void
-_xfs_buf_ioend(
- xfs_buf_t *bp,
- int schedule)
-{
- if (atomic_dec_and_test(&bp->b_io_remaining) == 1)
- xfs_buf_ioend(bp, schedule);
-}
-
-STATIC void
xfs_buf_bio_end_io(
struct bio *bio,
int error)
@@ -1206,7 +1192,8 @@ xfs_buf_bio_end_io(
if (!bp->b_error && xfs_buf_is_vmapped(bp) && (bp->b_flags & XBF_READ))
invalidate_kernel_vmap_range(bp->b_addr, xfs_buf_vmap_len(bp));
- _xfs_buf_ioend(bp, 1);
+ if (atomic_dec_and_test(&bp->b_io_remaining) == 1)
+ xfs_buf_ioend_async(bp);
bio_put(bio);
}
@@ -1425,10 +1412,12 @@ xfs_buf_iorequest(
* waiting, and in the synchronous IO case it avoids unnecessary context
* switches an latency for high-peformance devices.
*/
- if (bp->b_error || !(bp->b_flags & XBF_ASYNC))
- _xfs_buf_ioend(bp, 0);
- else
- _xfs_buf_ioend(bp, 1);
+ if (atomic_dec_and_test(&bp->b_io_remaining) == 1) {
+ if (bp->b_error || !(bp->b_flags & XBF_ASYNC))
+ xfs_buf_ioend(bp);
+ else
+ xfs_buf_ioend_async(bp);
+ }
This looks cleaner, but the comment is out of whack at this point.

Brian
Post by Dave Chinner
xfs_buf_rele(bp);
}
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index c753183..4585c15 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -286,7 +286,7 @@ extern void xfs_buf_unlock(xfs_buf_t *);
/* Buffer Read and Write Routines */
extern int xfs_bwrite(struct xfs_buf *bp);
-extern void xfs_buf_ioend(xfs_buf_t *, int);
+extern void xfs_buf_ioend(struct xfs_buf *bp);
extern void xfs_buf_ioerror(xfs_buf_t *, int);
extern void xfs_buf_ioerror_alert(struct xfs_buf *, const char *func);
extern void xfs_buf_iorequest(xfs_buf_t *);
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 76007de..4fd41b5 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -491,7 +491,7 @@ xfs_buf_item_unpin(
xfs_buf_ioerror(bp, -EIO);
XFS_BUF_UNDONE(bp);
xfs_buf_stale(bp);
- xfs_buf_ioend(bp, 0);
+ xfs_buf_ioend(bp);
}
}
xfs_buf_do_callbacks(bp);
bp->b_fspriv = NULL;
bp->b_iodone = NULL;
- xfs_buf_ioend(bp, 0);
+ xfs_buf_ioend(bp);
}
/*
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index fea3c92..00d210b 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
XFS_BUF_UNDONE(bp);
xfs_buf_stale(bp);
xfs_buf_ioerror(bp, -EIO);
- xfs_buf_ioend(bp, 0);
+ xfs_buf_ioend(bp);
} else {
xfs_buf_stale(bp);
xfs_buf_relse(bp);
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 8eaa8f5..e4665db 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1689,7 +1689,7 @@ xlog_bdstrat(
if (iclog->ic_state & XLOG_STATE_IOERROR) {
xfs_buf_ioerror(bp, -EIO);
xfs_buf_stale(bp);
- xfs_buf_ioend(bp, 0);
+ xfs_buf_ioend(bp);
/*
* It would seem logical to return EIO here, but we rely on
* the log state machine to propagate I/O errors instead of
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 1fd5787..4ba19bf 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -383,7 +383,7 @@ xlog_recover_iodone(
SHUTDOWN_META_IO_ERROR);
}
bp->b_iodone = NULL;
- xfs_buf_ioend(bp, 0);
+ xfs_buf_ioend(bp);
}
/*
--
2.0.0
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
Dave Chinner
2014-08-15 23:21:35 UTC
Permalink
Post by Brian Foster
Post by Dave Chinner
We do some work in xfs_buf_ioend, and some work in
xfs_buf_iodone_work, but much of that functionality is the same.
This work can all be done in a single function, leaving
xfs_buf_iodone just a wrapper to determine if we should execute it
by workqueue or directly. hence rename xfs_buf_iodone_work to
xfs_buf_ioend(), and add a new xfs_buf_ioend_async() for places that
need async processing.
---
fs/xfs/xfs_buf.c | 79 +++++++++++++++++++++---------------------------
fs/xfs/xfs_buf.h | 2 +-
fs/xfs/xfs_buf_item.c | 4 +--
fs/xfs/xfs_inode.c | 2 +-
fs/xfs/xfs_log.c | 2 +-
fs/xfs/xfs_log_recover.c | 2 +-
6 files changed, 40 insertions(+), 51 deletions(-)
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 5d86bbd..1b7f0bc 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -999,54 +999,49 @@ xfs_buf_wait_unpin(
*/
STATIC void
-xfs_buf_iodone_work(
- struct work_struct *work)
+xfs_buf_ioend(
+ struct xfs_buf *bp)
Compile failure here due to STATIC.
Post by Dave Chinner
{
- struct xfs_buf *bp =
- container_of(work, xfs_buf_t, b_iodone_work);
- bool read = !!(bp->b_flags & XBF_READ);
+ bool read = !!(bp->b_flags & XBF_READ);
+
+ trace_xfs_buf_iodone(bp, _RET_IP_);
bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);
- /* only validate buffers that were read without errors */
- if (read && bp->b_ops && !bp->b_error && (bp->b_flags & XBF_DONE))
- bp->b_ops->verify_read(bp);
+ if (!bp->b_error) {
+ bp->b_flags |= XBF_DONE;
+
+ /* only validate buffers that were read without errors */
+ if (read && bp->b_ops)
+ bp->b_ops->verify_read(bp);
+ }
Probably not a cause of errors, but this code is now executed twice for
I/O with b_iodone callbacks.
reads don't have b_iodone callbacks.
Post by Brian Foster
Once for the initial call from bio_end_io,
again from the callback via the b_iodone handler. The flags bits are
probably fine, but we don't want to be running the verifiers multiple
times unnecessarily.
Which we don't ;)
Post by Brian Foster
Post by Dave Chinner
@@ -1425,10 +1412,12 @@ xfs_buf_iorequest(
* waiting, and in the synchronous IO case it avoids unnecessary context
* switches an latency for high-peformance devices.
*/
- if (bp->b_error || !(bp->b_flags & XBF_ASYNC))
- _xfs_buf_ioend(bp, 0);
- else
- _xfs_buf_ioend(bp, 1);
+ if (atomic_dec_and_test(&bp->b_io_remaining) == 1) {
+ if (bp->b_error || !(bp->b_flags & XBF_ASYNC))
+ xfs_buf_ioend(bp);
+ else
+ xfs_buf_ioend_async(bp);
+ }
This looks cleaner, but the comment is out of whack at this point.
The code is functionally identical, so the comment didn't get
changed. As it is, the behaviour that exists in this patch goes away
in later patches, so it's mostly irrelevant that a comment is
absoultely correct in an intermediate point within the patch set.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Brian Foster
2014-08-18 14:15:54 UTC
Permalink
Post by Dave Chinner
Post by Brian Foster
Post by Dave Chinner
We do some work in xfs_buf_ioend, and some work in
xfs_buf_iodone_work, but much of that functionality is the same.
This work can all be done in a single function, leaving
xfs_buf_iodone just a wrapper to determine if we should execute it
by workqueue or directly. hence rename xfs_buf_iodone_work to
xfs_buf_ioend(), and add a new xfs_buf_ioend_async() for places that
need async processing.
---
fs/xfs/xfs_buf.c | 79 +++++++++++++++++++++---------------------------
fs/xfs/xfs_buf.h | 2 +-
fs/xfs/xfs_buf_item.c | 4 +--
fs/xfs/xfs_inode.c | 2 +-
fs/xfs/xfs_log.c | 2 +-
fs/xfs/xfs_log_recover.c | 2 +-
6 files changed, 40 insertions(+), 51 deletions(-)
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 5d86bbd..1b7f0bc 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -999,54 +999,49 @@ xfs_buf_wait_unpin(
*/
STATIC void
-xfs_buf_iodone_work(
- struct work_struct *work)
+xfs_buf_ioend(
+ struct xfs_buf *bp)
Compile failure here due to STATIC.
Post by Dave Chinner
{
- struct xfs_buf *bp =
- container_of(work, xfs_buf_t, b_iodone_work);
- bool read = !!(bp->b_flags & XBF_READ);
+ bool read = !!(bp->b_flags & XBF_READ);
+
+ trace_xfs_buf_iodone(bp, _RET_IP_);
bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);
- /* only validate buffers that were read without errors */
- if (read && bp->b_ops && !bp->b_error && (bp->b_flags & XBF_DONE))
- bp->b_ops->verify_read(bp);
+ if (!bp->b_error) {
+ bp->b_flags |= XBF_DONE;
+
+ /* only validate buffers that were read without errors */
+ if (read && bp->b_ops)
+ bp->b_ops->verify_read(bp);
+ }
Probably not a cause of errors, but this code is now executed twice for
I/O with b_iodone callbacks.
reads don't have b_iodone callbacks.
Ah, Ok.
Post by Dave Chinner
Post by Brian Foster
Once for the initial call from bio_end_io,
again from the callback via the b_iodone handler. The flags bits are
probably fine, but we don't want to be running the verifiers multiple
times unnecessarily.
Which we don't ;)
Good point, but that's still a landmine IMO. It looks like the previous
code would avoid it for sync I/O, but not for async. You could probably
avoid it generally via a new flag or just by going off of XBF_DONE. The
latter seems logical to me. A comment wouldn't hurt either.
Post by Dave Chinner
Post by Brian Foster
Post by Dave Chinner
@@ -1425,10 +1412,12 @@ xfs_buf_iorequest(
* waiting, and in the synchronous IO case it avoids unnecessary context
* switches an latency for high-peformance devices.
*/
- if (bp->b_error || !(bp->b_flags & XBF_ASYNC))
- _xfs_buf_ioend(bp, 0);
- else
- _xfs_buf_ioend(bp, 1);
+ if (atomic_dec_and_test(&bp->b_io_remaining) == 1) {
+ if (bp->b_error || !(bp->b_flags & XBF_ASYNC))
+ xfs_buf_ioend(bp);
+ else
+ xfs_buf_ioend_async(bp);
+ }
This looks cleaner, but the comment is out of whack at this point.
The code is functionally identical, so the comment didn't get
changed. As it is, the behaviour that exists in this patch goes away
in later patches, so it's mostly irrelevant that a comment is
absoultely correct in an intermediate point within the patch set.
This was just a minor point that the comment refers to _xfs_buf_ioend().
That obviously no longer exists but the comment is still around at the
end of the series.

Brian
Post by Dave Chinner
Cheers,
Dave.
--
Dave Chinner
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
Christoph Hellwig
2014-08-29 00:22:09 UTC
Permalink
Post by Dave Chinner
STATIC void
needs the static removed as pointed out by Brian.
Post by Dave Chinner
- bool read = !!(bp->b_flags & XBF_READ);
+ bool read = !!(bp->b_flags & XBF_READ);
We don't really need the double negation here.

Otherwise this looks reasonable to me.
Dave Chinner
2014-08-29 00:55:43 UTC
Permalink
Post by Christoph Hellwig
Post by Dave Chinner
STATIC void
needs the static removed as pointed out by Brian.
Post by Dave Chinner
- bool read = !!(bp->b_flags & XBF_READ);
+ bool read = !!(bp->b_flags & XBF_READ);
We don't really need the double negation here.
Good point. I'll fix it.
--
Dave Chinner
***@fromorbit.com
Dave Chinner
2014-08-15 06:39:07 UTC
Permalink
From: Dave Chinner <***@redhat.com>

xfs_buf_read_uncached() has two failure modes. If can either return
NULL or bp->b_error != 0 depending on the type of failure, and not
all callers check for both. Fix it up.

Signed-off-by: Dave Chinner <***@redhat.com>
---
fs/xfs/xfs_mount.c | 55 ++++++++++++++++++++++++++++++------------------------
1 file changed, 31 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index d0ef01e..10a365a 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -304,13 +304,8 @@ xfs_readsb(
reread:
bp = xfs_buf_read_uncached(mp->m_ddev_targp, XFS_SB_DADDR,
BTOBB(sector_size), 0, buf_ops);
- if (!bp) {
- if (loud)
- xfs_warn(mp, "SB buffer read failed");
- return -EIO;
- }
- if (bp->b_error) {
- error = bp->b_error;
+ if (!bp || bp->b_error) {
+ error = bp ? bp->b_error : -ENOMEM;
if (loud)
xfs_warn(mp, "SB validate failed with error %d.", error);
/* bad CRC means corrupted metadata */
@@ -368,7 +363,8 @@ reread:
return 0;

release_buf:
- xfs_buf_relse(bp);
+ if (bp)
+ xfs_buf_relse(bp);
return error;
}

@@ -546,10 +542,12 @@ xfs_set_inoalignment(xfs_mount_t *mp)
* Check that the data (and log if separate) is an ok size.
*/
STATIC int
-xfs_check_sizes(xfs_mount_t *mp)
+xfs_check_sizes(
+ struct xfs_mount *mp)
{
- xfs_buf_t *bp;
+ struct xfs_buf *bp;
xfs_daddr_t d;
+ int error;

d = (xfs_daddr_t)XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks);
if (XFS_BB_TO_FSB(mp, d) != mp->m_sb.sb_dblocks) {
@@ -559,28 +557,37 @@ xfs_check_sizes(xfs_mount_t *mp)
bp = xfs_buf_read_uncached(mp->m_ddev_targp,
d - XFS_FSS_TO_BB(mp, 1),
XFS_FSS_TO_BB(mp, 1), 0, NULL);
- if (!bp) {
+ if (!bp || bp->b_error) {
xfs_warn(mp, "last sector read failed");
- return -EIO;
+ goto out_fail;
}
xfs_buf_relse(bp);

- if (mp->m_logdev_targp != mp->m_ddev_targp) {
- d = (xfs_daddr_t)XFS_FSB_TO_BB(mp, mp->m_sb.sb_logblocks);
- if (XFS_BB_TO_FSB(mp, d) != mp->m_sb.sb_logblocks) {
- xfs_warn(mp, "log size mismatch detected");
- return -EFBIG;
- }
- bp = xfs_buf_read_uncached(mp->m_logdev_targp,
+ if (mp->m_logdev_targp == mp->m_ddev_targp)
+ return 0;
+
+ d = (xfs_daddr_t)XFS_FSB_TO_BB(mp, mp->m_sb.sb_logblocks);
+ if (XFS_BB_TO_FSB(mp, d) != mp->m_sb.sb_logblocks) {
+ xfs_warn(mp, "log size mismatch detected");
+ return -EFBIG;
+ }
+ bp = xfs_buf_read_uncached(mp->m_logdev_targp,
d - XFS_FSB_TO_BB(mp, 1),
XFS_FSB_TO_BB(mp, 1), 0, NULL);
- if (!bp) {
- xfs_warn(mp, "log device read failed");
- return -EIO;
- }
- xfs_buf_relse(bp);
+ if (!bp || bp->b_error) {
+ xfs_warn(mp, "log device read failed");
+ goto out_fail;
}
+ xfs_buf_relse(bp);
return 0;
+
+out_fail:
+ if (!bp)
+ return -EIO;
+ error = bp->b_error;
+ xfs_buf_relse(bp);
+ return error;
+
}

/*
--
2.0.0
Christoph Hellwig
2014-08-15 12:56:08 UTC
Permalink
Post by Dave Chinner
xfs_buf_read_uncached() has two failure modes. If can either return
NULL or bp->b_error != 0 depending on the type of failure, and not
all callers check for both. Fix it up.
I'd rather get rid of these annoying calling conventions. Make it
return and errno, and the bp in a pointer argument, with the bp
never non-NULL in case of error.
Dave Chinner
2014-08-15 23:58:43 UTC
Permalink
Post by Christoph Hellwig
Post by Dave Chinner
xfs_buf_read_uncached() has two failure modes. If can either return
NULL or bp->b_error != 0 depending on the type of failure, and not
all callers check for both. Fix it up.
I'd rather get rid of these annoying calling conventions. Make it
return and errno, and the bp in a pointer argument, with the bp
never non-NULL in case of error.
Ok. I considered that, then just did the simple thing.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Christoph Hellwig
2014-08-29 00:37:56 UTC
Permalink
Post by Dave Chinner
Post by Christoph Hellwig
I'd rather get rid of these annoying calling conventions. Make it
return and errno, and the bp in a pointer argument, with the bp
never non-NULL in case of error.
Ok. I considered that, then just did the simple thing.
There's only about a dozen callers of all xfs_buf_read* variants, so
I think switching them over to return an errno shouldn't be much of a
problem.
Dave Chinner
2014-08-15 06:38:59 UTC
Permalink
From: Dave Chinner <***@redhat.com>

When synchronous IO runs IO completion work, it does so without an
IO reference or a hold reference on the buffer. The IO "hold
reference" is owned by the submitter, and released when the
submission is complete. The IO reference is released when both the
submitter and the bio end_io processing is run, and so if the io
completion work is run from IO completion context, it is run without
an IO reference.

Hence we can get the situation where the submitter can submit the
IO, see an error on the buffer and unlock and free the buffer while
there is still IO in progress. This leads to use-after-free and
memory corruption.

Fix this by taking a "sync IO hold" reference that is owned by the
IO and not released until after the buffer completion calls are run
to wake up synchronous waiters. This means that the buffer will not
be freed in any circumstance until all IO processing is completed.

Signed-off-by: Dave Chinner <***@redhat.com>
---
fs/xfs/xfs_buf.c | 46 ++++++++++++++++++++++++++++++++++++++--------
1 file changed, 38 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index cd7b8ca..5d86bbd 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1019,6 +1019,7 @@ xfs_buf_iodone_work(
else {
ASSERT(read && bp->b_ops);
complete(&bp->b_iowait);
+ xfs_buf_rele(bp);
}
}

@@ -1044,6 +1045,7 @@ xfs_buf_ioend(
} else {
bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);
complete(&bp->b_iowait);
+ xfs_buf_rele(bp);
}
}

@@ -1383,22 +1385,50 @@ xfs_buf_iorequest(

if (bp->b_flags & XBF_WRITE)
xfs_buf_wait_unpin(bp);
+
+ /*
+ * Take references to the buffer. For XBF_ASYNC buffers, holding a
+ * reference for as long as submission takes is all that is necessary
+ * here. The IO inherits the lock and hold count from the submitter,
+ * and these are release during IO completion processing. Taking a hold
+ * over submission ensures that the buffer is not freed until we have
+ * completed all processing, regardless of when IO errors occur or are
+ * reported.
+ *
+ * However, for synchronous IO, the IO does not inherit the submitters
+ * reference count, nor the buffer lock. Hence we need to take an extra
+ * reference to the buffer for the for the IO context so that we can
+ * guarantee the buffer is not freed until all IO completion processing
+ * is done. Otherwise the caller can drop their reference while the IO
+ * is still in progress and hence trigger a use-after-free situation.
+ */
xfs_buf_hold(bp);
+ if (!(bp->b_flags & XBF_ASYNC))
+ xfs_buf_hold(bp);
+

/*
- * Set the count to 1 initially, this will stop an I/O
- * completion callout which happens before we have started
- * all the I/O from calling xfs_buf_ioend too early.
+ * Set the count to 1 initially, this will stop an I/O completion
+ * callout which happens before we have started all the I/O from calling
+ * xfs_buf_ioend too early.
*/
atomic_set(&bp->b_io_remaining, 1);
_xfs_buf_ioapply(bp);
+
/*
- * If _xfs_buf_ioapply failed, we'll get back here with
- * only the reference we took above. _xfs_buf_ioend will
- * drop it to zero, so we'd better not queue it for later,
- * or we'll free it before it's done.
+ * If _xfs_buf_ioapply failed or we are doing synchronous IO that
+ * completes extremely quickly, we can get back here with only the IO
+ * reference we took above. _xfs_buf_ioend will drop it to zero, so
+ * we'd better run completion processing synchronously so that the we
+ * don't return to the caller with completion still pending. In the
+ * error case, this allows the caller to check b_error safely without
+ * waiting, and in the synchronous IO case it avoids unnecessary context
+ * switches an latency for high-peformance devices.
*/
- _xfs_buf_ioend(bp, bp->b_error ? 0 : 1);
+ if (bp->b_error || !(bp->b_flags & XBF_ASYNC))
+ _xfs_buf_ioend(bp, 0);
+ else
+ _xfs_buf_ioend(bp, 1);

xfs_buf_rele(bp);
}
--
2.0.0
Brian Foster
2014-08-15 13:18:04 UTC
Permalink
Post by Dave Chinner
When synchronous IO runs IO completion work, it does so without an
IO reference or a hold reference on the buffer. The IO "hold
reference" is owned by the submitter, and released when the
submission is complete. The IO reference is released when both the
submitter and the bio end_io processing is run, and so if the io
completion work is run from IO completion context, it is run without
an IO reference.
Hence we can get the situation where the submitter can submit the
IO, see an error on the buffer and unlock and free the buffer while
there is still IO in progress. This leads to use-after-free and
memory corruption.
Fix this by taking a "sync IO hold" reference that is owned by the
IO and not released until after the buffer completion calls are run
to wake up synchronous waiters. This means that the buffer will not
be freed in any circumstance until all IO processing is completed.
---
fs/xfs/xfs_buf.c | 46 ++++++++++++++++++++++++++++++++++++++--------
1 file changed, 38 insertions(+), 8 deletions(-)
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index cd7b8ca..5d86bbd 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1019,6 +1019,7 @@ xfs_buf_iodone_work(
else {
ASSERT(read && bp->b_ops);
complete(&bp->b_iowait);
+ xfs_buf_rele(bp);
/* !XBF_ASYNC ref */
Post by Dave Chinner
}
}
@@ -1044,6 +1045,7 @@ xfs_buf_ioend(
} else {
bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);
complete(&bp->b_iowait);
+ xfs_buf_rele(bp);
/* !XBF_ASYNC ref */
Post by Dave Chinner
}
}
@@ -1383,22 +1385,50 @@ xfs_buf_iorequest(
if (bp->b_flags & XBF_WRITE)
xfs_buf_wait_unpin(bp);
+
+ /*
+ * Take references to the buffer. For XBF_ASYNC buffers, holding a
+ * reference for as long as submission takes is all that is necessary
+ * here. The IO inherits the lock and hold count from the submitter,
+ * and these are release during IO completion processing. Taking a hold
+ * over submission ensures that the buffer is not freed until we have
+ * completed all processing, regardless of when IO errors occur or are
+ * reported.
+ *
+ * However, for synchronous IO, the IO does not inherit the submitters
+ * reference count, nor the buffer lock. Hence we need to take an extra
+ * reference to the buffer for the for the IO context so that we can
+ * guarantee the buffer is not freed until all IO completion processing
+ * is done. Otherwise the caller can drop their reference while the IO
+ * is still in progress and hence trigger a use-after-free situation.
+ */
xfs_buf_hold(bp);
+ if (!(bp->b_flags & XBF_ASYNC))
+ xfs_buf_hold(bp);
+
/*
- * Set the count to 1 initially, this will stop an I/O
- * completion callout which happens before we have started
- * all the I/O from calling xfs_buf_ioend too early.
+ * Set the count to 1 initially, this will stop an I/O completion
+ * callout which happens before we have started all the I/O from calling
+ * xfs_buf_ioend too early.
*/
atomic_set(&bp->b_io_remaining, 1);
_xfs_buf_ioapply(bp);
+
/*
- * If _xfs_buf_ioapply failed, we'll get back here with
- * only the reference we took above. _xfs_buf_ioend will
- * drop it to zero, so we'd better not queue it for later,
- * or we'll free it before it's done.
+ * If _xfs_buf_ioapply failed or we are doing synchronous IO that
+ * completes extremely quickly, we can get back here with only the IO
+ * reference we took above. _xfs_buf_ioend will drop it to zero, so
+ * we'd better run completion processing synchronously so that the we
+ * don't return to the caller with completion still pending. In the
+ * error case, this allows the caller to check b_error safely without
+ * waiting, and in the synchronous IO case it avoids unnecessary context
+ * switches an latency for high-peformance devices.
*/
AFAICT there is no real wait if the buf has completed at this point. The
wait just decrements the completion counter. So what's the benefit of
"not waiting?" Where is the potential context switch? Are you referring
to the case where error is set but I/O is not complete? Are you saying
the advantage to the caller is it doesn't have to care about the state
of further I/O once it has been determined at least one error has
occurred? (If so, who cares about latency given that some operation that
depends on this I/O is already doomed to fail?).

The code looks fine, but I'm trying to understand the reasoning better
(and I suspect we can clarify the comment).
Post by Dave Chinner
- _xfs_buf_ioend(bp, bp->b_error ? 0 : 1);
+ if (bp->b_error || !(bp->b_flags & XBF_ASYNC))
+ _xfs_buf_ioend(bp, 0);
+ else
+ _xfs_buf_ioend(bp, 1);
Not related to this patch, but it seems like the problem this code tries
to address is still possible. Perhaps this papers over a particular
instance. Consider the case where an I/O fails immediately after this
call completes, but not before. We have an extra reference now for
completion, but we can still return to the caller with completion
pending. I suppose its fine if we consider the "problem" to be that the
reference goes away underneath the completion, as opposed to the caller
caring about the status of completion.

Brian
Post by Dave Chinner
xfs_buf_rele(bp);
}
--
2.0.0
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
Dave Chinner
2014-08-15 23:17:36 UTC
Permalink
....
Post by Brian Foster
Post by Dave Chinner
if (bp->b_flags & XBF_WRITE)
xfs_buf_wait_unpin(bp);
+
+ /*
+ * Take references to the buffer. For XBF_ASYNC buffers, holding a
+ * reference for as long as submission takes is all that is necessary
+ * here. The IO inherits the lock and hold count from the submitter,
+ * and these are release during IO completion processing. Taking a hold
+ * over submission ensures that the buffer is not freed until we have
+ * completed all processing, regardless of when IO errors occur or are
+ * reported.
+ *
+ * However, for synchronous IO, the IO does not inherit the submitters
+ * reference count, nor the buffer lock. Hence we need to take an extra
+ * reference to the buffer for the for the IO context so that we can
+ * guarantee the buffer is not freed until all IO completion processing
+ * is done. Otherwise the caller can drop their reference while the IO
+ * is still in progress and hence trigger a use-after-free situation.
+ */
xfs_buf_hold(bp);
+ if (!(bp->b_flags & XBF_ASYNC))
+ xfs_buf_hold(bp);
+
/*
- * Set the count to 1 initially, this will stop an I/O
- * completion callout which happens before we have started
- * all the I/O from calling xfs_buf_ioend too early.
+ * Set the count to 1 initially, this will stop an I/O completion
+ * callout which happens before we have started all the I/O from calling
+ * xfs_buf_ioend too early.
*/
atomic_set(&bp->b_io_remaining, 1);
_xfs_buf_ioapply(bp);
+
/*
- * If _xfs_buf_ioapply failed, we'll get back here with
- * only the reference we took above. _xfs_buf_ioend will
- * drop it to zero, so we'd better not queue it for later,
- * or we'll free it before it's done.
+ * If _xfs_buf_ioapply failed or we are doing synchronous IO that
+ * completes extremely quickly, we can get back here with only the IO
+ * reference we took above. _xfs_buf_ioend will drop it to zero, so
+ * we'd better run completion processing synchronously so that the we
+ * don't return to the caller with completion still pending. In the
+ * error case, this allows the caller to check b_error safely without
+ * waiting, and in the synchronous IO case it avoids unnecessary context
+ * switches an latency for high-peformance devices.
*/
AFAICT there is no real wait if the buf has completed at this point. The
wait just decrements the completion counter.
If the IO has completed, then we run the completion code.
Post by Brian Foster
So what's the benefit of
"not waiting?" Where is the potential context switch?
async work for completion processing on synchrnous IO means we queue
the work, then sleep in xfs_buf_iowait(). Two context switches, plus
a work queue execution
Post by Brian Foster
Are you referring
to the case where error is set but I/O is not complete? Are you saying
the advantage to the caller is it doesn't have to care about the state
of further I/O once it has been determined at least one error has
occurred? (If so, who cares about latency given that some operation that
depends on this I/O is already doomed to fail?).
No, you're reading *way* too much into this. For sync IO, it's
always best to process completion inline. For async, it doesn't
matter, but if there's a submission error is *more effecient* to
process it in the current context.
Post by Brian Foster
The code looks fine, but I'm trying to understand the reasoning better
(and I suspect we can clarify the comment).
Post by Dave Chinner
- _xfs_buf_ioend(bp, bp->b_error ? 0 : 1);
+ if (bp->b_error || !(bp->b_flags & XBF_ASYNC))
+ _xfs_buf_ioend(bp, 0);
+ else
+ _xfs_buf_ioend(bp, 1);
Not related to this patch, but it seems like the problem this code tries
to address is still possible.
The race condition is still possible - it just won't result in a
use-after-free. The race condition is not fixed until patch 8,
but as a backportable fix, this patch is much, much simpler.
Post by Brian Foster
Perhaps this papers over a particular
instance. Consider the case where an I/O fails immediately after this
call completes, but not before. We have an extra reference now for
completion, but we can still return to the caller with completion
pending. I suppose its fine if we consider the "problem" to be that the
reference goes away underneath the completion, as opposed to the caller
caring about the status of completion.
Precisely.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Brian Foster
2014-08-18 14:15:26 UTC
Permalink
Post by Dave Chinner
....
Post by Brian Foster
Post by Dave Chinner
if (bp->b_flags & XBF_WRITE)
xfs_buf_wait_unpin(bp);
+
+ /*
+ * Take references to the buffer. For XBF_ASYNC buffers, holding a
+ * reference for as long as submission takes is all that is necessary
+ * here. The IO inherits the lock and hold count from the submitter,
+ * and these are release during IO completion processing. Taking a hold
+ * over submission ensures that the buffer is not freed until we have
+ * completed all processing, regardless of when IO errors occur or are
+ * reported.
+ *
+ * However, for synchronous IO, the IO does not inherit the submitters
+ * reference count, nor the buffer lock. Hence we need to take an extra
+ * reference to the buffer for the for the IO context so that we can
+ * guarantee the buffer is not freed until all IO completion processing
+ * is done. Otherwise the caller can drop their reference while the IO
+ * is still in progress and hence trigger a use-after-free situation.
+ */
xfs_buf_hold(bp);
+ if (!(bp->b_flags & XBF_ASYNC))
+ xfs_buf_hold(bp);
+
/*
- * Set the count to 1 initially, this will stop an I/O
- * completion callout which happens before we have started
- * all the I/O from calling xfs_buf_ioend too early.
+ * Set the count to 1 initially, this will stop an I/O completion
+ * callout which happens before we have started all the I/O from calling
+ * xfs_buf_ioend too early.
*/
atomic_set(&bp->b_io_remaining, 1);
_xfs_buf_ioapply(bp);
+
/*
- * If _xfs_buf_ioapply failed, we'll get back here with
- * only the reference we took above. _xfs_buf_ioend will
- * drop it to zero, so we'd better not queue it for later,
- * or we'll free it before it's done.
+ * If _xfs_buf_ioapply failed or we are doing synchronous IO that
+ * completes extremely quickly, we can get back here with only the IO
+ * reference we took above. _xfs_buf_ioend will drop it to zero, so
+ * we'd better run completion processing synchronously so that the we
+ * don't return to the caller with completion still pending. In the
+ * error case, this allows the caller to check b_error safely without
+ * waiting, and in the synchronous IO case it avoids unnecessary context
+ * switches an latency for high-peformance devices.
*/
AFAICT there is no real wait if the buf has completed at this point. The
wait just decrements the completion counter.
If the IO has completed, then we run the completion code.
Post by Brian Foster
So what's the benefit of
"not waiting?" Where is the potential context switch?
async work for completion processing on synchrnous IO means we queue
the work, then sleep in xfs_buf_iowait(). Two context switches, plus
a work queue execution
Right...
Post by Dave Chinner
Post by Brian Foster
Are you referring
to the case where error is set but I/O is not complete? Are you saying
the advantage to the caller is it doesn't have to care about the state
of further I/O once it has been determined at least one error has
occurred? (If so, who cares about latency given that some operation that
depends on this I/O is already doomed to fail?).
No, you're reading *way* too much into this. For sync IO, it's
always best to process completion inline. For async, it doesn't
matter, but if there's a submission error is *more effecient* to
process it in the current context.
Heh. Sure, that makes sense. Perhaps it's just the way I read it,
implying that how we process I/O completion effects what the calling
code should look like. Simple case of the comment being a bit more
confusing than the code. ;) FWIW, the following is more clear to me:

/*
* If _xfs_buf_ioapply failed or we are doing synchronous IO that
* completes extremely quickly, we can get back here with only the IO
* reference we took above. _xfs_buf_ioend will drop it to zero. Run
* completion processing synchronously so that we don't return to the
* caller with completion still pending. This avoids unnecessary context
* switches associated with the end_io workqueue.
*/

Thanks for the explanation.

Brian
Post by Dave Chinner
Post by Brian Foster
The code looks fine, but I'm trying to understand the reasoning better
(and I suspect we can clarify the comment).
Post by Dave Chinner
- _xfs_buf_ioend(bp, bp->b_error ? 0 : 1);
+ if (bp->b_error || !(bp->b_flags & XBF_ASYNC))
+ _xfs_buf_ioend(bp, 0);
+ else
+ _xfs_buf_ioend(bp, 1);
Not related to this patch, but it seems like the problem this code tries
to address is still possible.
The race condition is still possible - it just won't result in a
use-after-free. The race condition is not fixed until patch 8,
but as a backportable fix, this patch is much, much simpler.
Post by Brian Foster
Perhaps this papers over a particular
instance. Consider the case where an I/O fails immediately after this
call completes, but not before. We have an extra reference now for
completion, but we can still return to the caller with completion
pending. I suppose its fine if we consider the "problem" to be that the
reference goes away underneath the completion, as opposed to the caller
caring about the status of completion.
Precisely.
Cheers,
Dave.
--
Dave Chinner
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
Christoph Hellwig
2014-08-29 00:18:53 UTC
Permalink
Looks good,

Reviewed-by: Christoph Hellwig <***@lst.de>
Dave Chinner
2014-08-15 06:39:05 UTC
Permalink
From: Dave Chinner <***@redhat.com>

The error handling is a mess of inconsistent spaghetti. Clean it up
like Christoph's comment from long ago said we should. While there,
get rid of the "xfs_do_error" error injection debug code. If we need
to do error injection, we can do it on demand via kprobes rather
than needing to run the kernel under a debug and poke magic
variables.

Signed-off-by: Dave Chinner <***@redhat.com>
---
fs/xfs/xfs_trans_buf.c | 187 ++++++++++++++++++-------------------------------
1 file changed, 69 insertions(+), 118 deletions(-)

diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 758c07d..9c3e610 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -229,13 +229,6 @@ xfs_trans_getsb(xfs_trans_t *tp,
return bp;
}

-#ifdef DEBUG
-xfs_buftarg_t *xfs_error_target;
-int xfs_do_error;
-int xfs_req_num;
-int xfs_error_mod = 33;
-#endif
-
/*
* Get and lock the buffer for the caller if it is not already
* locked within the given transaction. If it has not yet been
@@ -257,165 +250,123 @@ xfs_trans_read_buf_map(
struct xfs_buf **bpp,
const struct xfs_buf_ops *ops)
{
- xfs_buf_t *bp;
- xfs_buf_log_item_t *bip;
+ struct xfs_buf *bp = NULL;
int error;
+ bool release = true;

*bpp = NULL;
- if (!tp) {
- bp = xfs_buf_read_map(target, map, nmaps, flags, ops);
- if (!bp)
- return (flags & XBF_TRYLOCK) ?
- -EAGAIN : -ENOMEM;
-
- if (bp->b_error) {
- error = bp->b_error;
- xfs_buf_ioerror_alert(bp, __func__);
- XFS_BUF_UNDONE(bp);
- xfs_buf_stale(bp);
- xfs_buf_relse(bp);
-
- /* bad CRC means corrupted metadata */
- if (error == -EFSBADCRC)
- error = -EFSCORRUPTED;
- return error;
- }
-#ifdef DEBUG
- if (xfs_do_error) {
- if (xfs_error_target == target) {
- if (((xfs_req_num++) % xfs_error_mod) == 0) {
- xfs_buf_relse(bp);
- xfs_debug(mp, "Returning error!");
- return -EIO;
- }
- }
- }
-#endif
- if (XFS_FORCED_SHUTDOWN(mp))
- goto shutdown_abort;
- *bpp = bp;
- return 0;
- }

/*
- * If we find the buffer in the cache with this transaction
- * pointer in its b_fsprivate2 field, then we know we already
- * have it locked. If it is already read in we just increment
- * the lock recursion count and return the buffer to the caller.
- * If the buffer is not yet read in, then we read it in, increment
- * the lock recursion count, and return it to the caller.
+ * If we find the buffer in this transaction's item list, then we know
+ * we already have it locked. If it is already read in we just
+ * increment the lock recursion count and return the buffer to the
+ * caller.
*/
- bp = xfs_trans_buf_item_match(tp, target, map, nmaps);
- if (bp != NULL) {
+ if (tp)
+ bp = xfs_trans_buf_item_match(tp, target, map, nmaps);
+
+ if (bp) {
+ struct xfs_buf_log_item *bip;
+
+ /*
+ * We don't own this buffer ourselves, so we shouldn't release
+ * it if we come across any errors in processing it.
+ */
+ release = false;
+
ASSERT(xfs_buf_islocked(bp));
ASSERT(bp->b_transp == tp);
ASSERT(bp->b_fspriv != NULL);
ASSERT(!bp->b_error);
- if (!(XFS_BUF_ISDONE(bp))) {
+ if (!(bp->b_flags & XBF_DONE)) {
trace_xfs_trans_read_buf_io(bp, _RET_IP_);
- ASSERT(!XFS_BUF_ISASYNC(bp));
+ ASSERT(!(bp->b_flags & XBF_ASYNC));
ASSERT(bp->b_iodone == NULL);
- XFS_BUF_READ(bp);
+
+ bp->b_flags |= XBF_READ;
bp->b_ops = ops;

- /*
- * XXX(hch): clean up the error handling here to be less
- * of a mess..
- */
if (XFS_FORCED_SHUTDOWN(mp)) {
trace_xfs_bdstrat_shut(bp, _RET_IP_);
- bp->b_flags &= ~(XBF_READ | XBF_DONE);
- xfs_buf_ioerror(bp, -EIO);
- xfs_buf_stale(bp);
- return -EIO;
+ error = -EIO;
+ goto out_stale;
}

xfs_buf_iorequest(bp);
error = xfs_buf_iowait(bp);
if (error) {
xfs_buf_ioerror_alert(bp, __func__);
- xfs_buf_relse(bp);
- /*
- * We can gracefully recover from most read
- * errors. Ones we can't are those that happen
- * after the transaction's already dirty.
- */
- if (tp->t_flags & XFS_TRANS_DIRTY)
- xfs_force_shutdown(tp->t_mountp,
- SHUTDOWN_META_IO_ERROR);
- /* bad CRC means corrupted metadata */
- if (error == -EFSBADCRC)
- error = -EFSCORRUPTED;
- return error;
+ goto out_do_shutdown;
+
}
}
- /*
- * We never locked this buf ourselves, so we shouldn't
- * brelse it either. Just get out.
- */
+
if (XFS_FORCED_SHUTDOWN(mp)) {
trace_xfs_trans_read_buf_shut(bp, _RET_IP_);
- *bpp = NULL;
return -EIO;
}

-
+ /* good buffer! */
bip = bp->b_fspriv;
bip->bli_recur++;
-
ASSERT(atomic_read(&bip->bli_refcount) > 0);
trace_xfs_trans_read_buf_recur(bip);
*bpp = bp;
return 0;
}

+ /*
+ * Read the buffer from disk as there is no transaction context or we
+ * didn't find a matching buffer in the transaction item list.
+ */
bp = xfs_buf_read_map(target, map, nmaps, flags, ops);
- if (bp == NULL) {
- *bpp = NULL;
- return (flags & XBF_TRYLOCK) ?
- 0 : -ENOMEM;
+ if (!bp) {
+ /* XXX(dgc): should always return -EAGAIN on trylock failure */
+ if (!(flags & XBF_TRYLOCK))
+ return -ENOMEM;
+ if (!tp)
+ return -EAGAIN;
+ return 0;
}
if (bp->b_error) {
- error = bp->b_error;
- xfs_buf_stale(bp);
- XFS_BUF_DONE(bp);
xfs_buf_ioerror_alert(bp, __func__);
- if (tp->t_flags & XFS_TRANS_DIRTY)
- xfs_force_shutdown(tp->t_mountp, SHUTDOWN_META_IO_ERROR);
- xfs_buf_relse(bp);
-
- /* bad CRC means corrupted metadata */
- if (error == -EFSBADCRC)
- error = -EFSCORRUPTED;
- return error;
+ error = bp->b_error;
+ goto out_do_shutdown;
}
-#ifdef DEBUG
- if (xfs_do_error && !(tp->t_flags & XFS_TRANS_DIRTY)) {
- if (xfs_error_target == target) {
- if (((xfs_req_num++) % xfs_error_mod) == 0) {
- xfs_force_shutdown(tp->t_mountp,
- SHUTDOWN_META_IO_ERROR);
- xfs_buf_relse(bp);
- xfs_debug(mp, "Returning trans error!");
- return -EIO;
- }
- }
+
+ if (XFS_FORCED_SHUTDOWN(mp)) {
+ trace_xfs_trans_read_buf_shut(bp, _RET_IP_);
+ error = -EIO;
+ goto out_stale;
}
-#endif
- if (XFS_FORCED_SHUTDOWN(mp))
- goto shutdown_abort;

- _xfs_trans_bjoin(tp, bp, 1);
- trace_xfs_trans_read_buf(bp->b_fspriv);
+ if (tp) {
+ _xfs_trans_bjoin(tp, bp, 1);
+ trace_xfs_trans_read_buf(bp->b_fspriv);
+ }

*bpp = bp;
return 0;

-shutdown_abort:
- trace_xfs_trans_read_buf_shut(bp, _RET_IP_);
- xfs_buf_relse(bp);
- *bpp = NULL;
- return -EIO;
+
+out_do_shutdown:
+ /*
+ * We can gracefully recover from most read errors. Ones we can't are
+ * those that happen after the transaction's already dirty.
+ */
+ if (tp && (tp->t_flags & XFS_TRANS_DIRTY))
+ xfs_force_shutdown(tp->t_mountp, SHUTDOWN_META_IO_ERROR);
+out_stale:
+ bp->b_flags &= ~(XBF_READ | XBF_DONE);
+ xfs_buf_ioerror(bp, error);
+ xfs_buf_stale(bp);
+ if (release)
+ xfs_buf_relse(bp);
+
+ /* bad CRC means corrupted metadata */
+ if (error == -EFSBADCRC)
+ error = -EFSCORRUPTED;
+ return error;
}

/*
--
2.0.0
Dave Chinner
2014-08-15 06:39:02 UTC
Permalink
From: Dave Chinner <***@redhat.com>

Only has two callers, and is just a shutdown check and error handler
around xfs_buf_iorequest. However, the error handling is a mess of
read and write semantics, and both internal callers only call it for
writes. Hence kill the wrapper, and follow up with a patch to
sanitise the error handling.

Signed-off-by: Dave Chinner <***@redhat.com>
---
fs/xfs/xfs_buf.c | 51 ++++++++++++++++++++++++++++-----------------------
1 file changed, 28 insertions(+), 23 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 58ae34c..98fcf5a 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1140,27 +1140,6 @@ xfs_bioerror_relse(
return -EIO;
}

-STATIC int
-xfs_bdstrat_cb(
- struct xfs_buf *bp)
-{
- if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
- trace_xfs_bdstrat_shut(bp, _RET_IP_);
- /*
- * Metadata write that didn't get logged but
- * written delayed anyway. These aren't associated
- * with a transaction, and can be ignored.
- */
- if (!bp->b_iodone && !XFS_BUF_ISREAD(bp))
- return xfs_bioerror_relse(bp);
- else
- return xfs_bioerror(bp);
- }
-
- xfs_buf_iorequest(bp);
- return 0;
-}
-
int
xfs_bwrite(
struct xfs_buf *bp)
@@ -1172,7 +1151,20 @@ xfs_bwrite(
bp->b_flags |= XBF_WRITE;
bp->b_flags &= ~(XBF_ASYNC | XBF_READ | _XBF_DELWRI_Q | XBF_WRITE_FAIL);

- xfs_bdstrat_cb(bp);
+ if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
+ trace_xfs_bdstrat_shut(bp, _RET_IP_);
+
+ /*
+ * Metadata write that didn't get logged but written anyway.
+ * These aren't associated with a transaction, and can be
+ * ignored.
+ */
+ if (!bp->b_iodone)
+ return xfs_bioerror_relse(bp);
+ return xfs_bioerror(bp);
+ }
+
+ xfs_buf_iorequest(bp);

error = xfs_buf_iowait(bp);
if (error) {
@@ -1852,7 +1844,20 @@ __xfs_buf_delwri_submit(
bp->b_flags |= XBF_ASYNC;
list_del_init(&bp->b_list);
}
- xfs_bdstrat_cb(bp);
+
+ if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
+ trace_xfs_bdstrat_shut(bp, _RET_IP_);
+
+ /*
+ * Metadata write that didn't get logged but written anyway.
+ * These aren't associated with a transaction, and can be
+ * ignored.
+ */
+ if (!bp->b_iodone)
+ return xfs_bioerror_relse(bp);
+ return xfs_bioerror(bp);
+ }
+ xfs_buf_iorequest(bp);
}
blk_finish_plug(&plug);
--
2.0.0
Christoph Hellwig
2014-08-29 00:24:36 UTC
Permalink
Post by Dave Chinner
Only has two callers, and is just a shutdown check and error handler
around xfs_buf_iorequest. However, the error handling is a mess of
read and write semantics, and both internal callers only call it for
writes. Hence kill the wrapper, and follow up with a patch to
sanitise the error handling.
Looks good,

Reviewed-by: Christoph Hellwig <***@lst.de>
Dave Chinner
2014-08-15 06:39:06 UTC
Permalink
From: Dave Chinner <***@redhat.com>

There is a lot of cookie-cutter code that looks like:

if (shutdown)
handle buffer error
xfs_buf_iorequest(bp)
error = xfs_buf_iowait(bp)
if (error)
handle buffer error

spread through XFS. There's significant complexity now in
xfs_buf_iorequest() to specifically handle this sort of synchronous
IO pattern, but there's all sorts of nasty surprises in different
error handling code dependent on who owns the buffer references and
the locks.

Pull this pattern into a single helper, where we can hide all the
synchronous IO warts and hence make the error handling for all the
callers much saner. This removes the need for a special extra
reference to protect IO completion processing, as we can now hold a
single reference across dispatch and waiting, simplifying the sync
IO smeantics and error handling.

In doing this, also rename xfs_buf_iorequest to xfs_buf_submit and
make it explicitly handle on asynchronous IO. This forces all users
to be switched specifically to one interface or the other and
removes any ambiguity between how the interfaces are to be used. It
also means that xfs_buf_iowait() goes away.

For the special case of delwri buffer submission and waiting, we
don't need to issue IO synchronously at all. The second pass to cal
xfs_buf_iowait() can now just block on xfs_buf_lock() - the buffer
will be unlocked when the async IO is complete. This formalises a
sane the method of waiting for async IO - take an extra reference,
submit the IO, call xfs_buf_lock() when you want to wait for IO
completion. i.e.:

bp = xfs_buf_find();
xfs_buf_hold(bp);
bp->b_flags |= XBF_ASYNC;
xfs_buf_iosubmit(bp);
xfs_buf_lock(bp)
error = bp->b_error;
....
xfs_buf_relse(bp);

Signed-off-by: Dave Chinner <***@redhat.com>
---
fs/xfs/xfs_bmap_util.c | 14 +---
fs/xfs/xfs_buf.c | 186 ++++++++++++++++++++++++++---------------------
fs/xfs/xfs_buf.h | 4 +-
fs/xfs/xfs_buf_item.c | 4 +-
fs/xfs/xfs_log.c | 2 +-
fs/xfs/xfs_log_recover.c | 22 ++----
fs/xfs/xfs_trans_buf.c | 17 ++---
7 files changed, 122 insertions(+), 127 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 2f1e30d..c53cc03 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1157,12 +1157,7 @@ xfs_zero_remaining_bytes(
XFS_BUF_READ(bp);
XFS_BUF_SET_ADDR(bp, xfs_fsb_to_db(ip, imap.br_startblock));

- if (XFS_FORCED_SHUTDOWN(mp)) {
- error = -EIO;
- break;
- }
- xfs_buf_iorequest(bp);
- error = xfs_buf_iowait(bp);
+ error = xfs_buf_submit_wait(bp);
if (error) {
xfs_buf_ioerror_alert(bp,
"xfs_zero_remaining_bytes(read)");
@@ -1175,12 +1170,7 @@ xfs_zero_remaining_bytes(
XFS_BUF_UNREAD(bp);
XFS_BUF_WRITE(bp);

- if (XFS_FORCED_SHUTDOWN(mp)) {
- error = -EIO;
- break;
- }
- xfs_buf_iorequest(bp);
- error = xfs_buf_iowait(bp);
+ error = xfs_buf_submit_wait(bp);
if (error) {
xfs_buf_ioerror_alert(bp,
"xfs_zero_remaining_bytes(write)");
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 352e9219..a2599f9 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -623,10 +623,11 @@ _xfs_buf_read(
bp->b_flags &= ~(XBF_WRITE | XBF_ASYNC | XBF_READ_AHEAD);
bp->b_flags |= flags & (XBF_READ | XBF_ASYNC | XBF_READ_AHEAD);

- xfs_buf_iorequest(bp);
- if (flags & XBF_ASYNC)
+ if (flags & XBF_ASYNC) {
+ xfs_buf_submit(bp);
return 0;
- return xfs_buf_iowait(bp);
+ }
+ return xfs_buf_submit_wait(bp);
}

xfs_buf_t *
@@ -708,12 +709,7 @@ xfs_buf_read_uncached(
bp->b_flags |= XBF_READ;
bp->b_ops = ops;

- if (XFS_FORCED_SHUTDOWN(target->bt_mount)) {
- xfs_buf_relse(bp);
- return NULL;
- }
- xfs_buf_iorequest(bp);
- xfs_buf_iowait(bp);
+ xfs_buf_submit_wait(bp);
return bp;
}

@@ -1027,10 +1023,8 @@ xfs_buf_ioend(
(*(bp->b_iodone))(bp);
else if (bp->b_flags & XBF_ASYNC)
xfs_buf_relse(bp);
- else {
+ else
complete(&bp->b_iowait);
- xfs_buf_rele(bp);
- }
}

static void
@@ -1083,21 +1077,7 @@ xfs_bwrite(
bp->b_flags &= ~(XBF_ASYNC | XBF_READ | _XBF_DELWRI_Q |
XBF_WRITE_FAIL | XBF_DONE);

- if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
- trace_xfs_bdstrat_shut(bp, _RET_IP_);
-
- xfs_buf_ioerror(bp, -EIO);
- xfs_buf_stale(bp);
-
- /* sync IO, xfs_buf_ioend is going to remove a ref here */
- xfs_buf_hold(bp);
- xfs_buf_ioend(bp);
- return -EIO;
- }
-
- xfs_buf_iorequest(bp);
-
- error = xfs_buf_iowait(bp);
+ error = xfs_buf_submit_wait(bp);
if (error) {
xfs_force_shutdown(bp->b_target->bt_mount,
SHUTDOWN_META_IO_ERROR);
@@ -1206,7 +1186,7 @@ next_chunk:
} else {
/*
* This is guaranteed not to be the last io reference count
- * because the caller (xfs_buf_iorequest) holds a count itself.
+ * because the caller (xfs_buf_submit) holds a count itself.
*/
atomic_dec(&bp->b_io_remaining);
xfs_buf_ioerror(bp, -EIO);
@@ -1296,13 +1276,29 @@ _xfs_buf_ioapply(
blk_finish_plug(&plug);
}

+/*
+ * Asynchronous IO submission path. This transfers the buffer lock ownership and
+ * the current reference to the IO. It is not safe to reference the buffer after
+ * a call to this function unless the caller holds an additional reference
+ * itself.
+ */
void
-xfs_buf_iorequest(
- xfs_buf_t *bp)
+xfs_buf_submit(
+ struct xfs_buf *bp)
{
trace_xfs_buf_iorequest(bp, _RET_IP_);

ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
+ ASSERT(bp->b_flags & XBF_ASYNC);
+
+ /* on shutdown we stale and complete the buffer immediately */
+ if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
+ xfs_buf_ioerror(bp, -EIO);
+ bp->b_flags &= ~XBF_DONE;
+ xfs_buf_stale(bp);
+ xfs_buf_ioend(bp);
+ return;
+ }

if (bp->b_flags & XBF_WRITE)
xfs_buf_wait_unpin(bp);
@@ -1311,25 +1307,10 @@ xfs_buf_iorequest(
bp->b_io_error = 0;

/*
- * Take references to the buffer. For XBF_ASYNC buffers, holding a
- * reference for as long as submission takes is all that is necessary
- * here. The IO inherits the lock and hold count from the submitter,
- * and these are release during IO completion processing. Taking a hold
- * over submission ensures that the buffer is not freed until we have
- * completed all processing, regardless of when IO errors occur or are
- * reported.
- *
- * However, for synchronous IO, the IO does not inherit the submitters
- * reference count, nor the buffer lock. Hence we need to take an extra
- * reference to the buffer for the for the IO context so that we can
- * guarantee the buffer is not freed until all IO completion processing
- * is done. Otherwise the caller can drop their reference while the IO
- * is still in progress and hence trigger a use-after-free situation.
+ * Take an extra reference so that we don't have to guess when it's no
+ * longer safe to reference the buffer that was passed to us.
*/
xfs_buf_hold(bp);
- if (!(bp->b_flags & XBF_ASYNC))
- xfs_buf_hold(bp);
-

/*
* Set the count to 1 initially, this will stop an I/O completion
@@ -1340,14 +1321,13 @@ xfs_buf_iorequest(
_xfs_buf_ioapply(bp);

/*
- * If _xfs_buf_ioapply failed or we are doing synchronous IO that
- * completes extremely quickly, we can get back here with only the IO
+ * If _xfs_buf_ioapply failed,
+ * we can get back here with only the IO
* reference we took above. _xfs_buf_ioend will drop it to zero, so
* we'd better run completion processing synchronously so that the we
- * don't return to the caller with completion still pending. In the
- * error case, this allows the caller to check b_error safely without
- * waiting, and in the synchronous IO case it avoids unnecessary context
- * switches an latency for high-peformance devices.
+ * don't return to the caller with completion still pending.
+ * this allows the caller to check b_error safely without
+ * waiting
*/
if (atomic_dec_and_test(&bp->b_io_remaining) == 1) {
if (bp->b_error || !(bp->b_flags & XBF_ASYNC))
@@ -1357,25 +1337,70 @@ xfs_buf_iorequest(
}

xfs_buf_rele(bp);
+ /* Note: it is not safe to reference bp now we've dropped our ref */
}

/*
- * Waits for I/O to complete on the buffer supplied. It returns immediately if
- * no I/O is pending or there is already a pending error on the buffer, in which
- * case nothing will ever complete. It returns the I/O error code, if any, or
- * 0 if there was no error.
+ * Synchronous buffer IO submission path, read or write.
*/
int
-xfs_buf_iowait(
- xfs_buf_t *bp)
+xfs_buf_submit_wait(
+ struct xfs_buf *bp)
{
- trace_xfs_buf_iowait(bp, _RET_IP_);
+ int error;
+
+ trace_xfs_buf_iorequest(bp, _RET_IP_);

- if (!bp->b_error)
- wait_for_completion(&bp->b_iowait);
+ ASSERT(!(bp->b_flags & (_XBF_DELWRI_Q | XBF_ASYNC)));
+
+ if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
+ xfs_buf_ioerror(bp, -EIO);
+ xfs_buf_stale(bp);
+ bp->b_flags &= ~XBF_DONE;
+ return -EIO;
+ }

+ if (bp->b_flags & XBF_WRITE)
+ xfs_buf_wait_unpin(bp);
+
+ /* clear the internal error state to avoid spurious errors */
+ bp->b_io_error = 0;
+
+ /*
+ * For synchronous IO, the IO does not inherit the submitters reference
+ * count, nor the buffer lock. Hence we cannot release the reference we
+ * are about to take until we've waited for all IO completion to occur,
+ * including any xfs_buf_ioend_async() work that may be pending.
+ */
+ xfs_buf_hold(bp);
+
+ /*
+ * Set the count to 1 initially, this will stop an I/O completion
+ * callout which happens before we have started all the I/O from calling
+ * xfs_buf_ioend too early.
+ */
+ atomic_set(&bp->b_io_remaining, 1);
+ _xfs_buf_ioapply(bp);
+
+ /*
+ * make sure we run completion synchronously if it raced with us and is
+ * already complete.
+ */
+ if (atomic_dec_and_test(&bp->b_io_remaining) == 1)
+ xfs_buf_ioend(bp);
+
+ /* wait for completion before gathering the error from the buffer */
+ trace_xfs_buf_iowait(bp, _RET_IP_);
+ wait_for_completion(&bp->b_iowait);
trace_xfs_buf_iowait_done(bp, _RET_IP_);
- return bp->b_error;
+ error = bp->b_error;
+
+ /*
+ * all done now, we can release the hold that keeps the buffer
+ * referenced for the entire IO.
+ */
+ xfs_buf_rele(bp);
+ return error;
}

xfs_caddr_t
@@ -1769,29 +1794,19 @@ __xfs_buf_delwri_submit(
blk_start_plug(&plug);
list_for_each_entry_safe(bp, n, io_list, b_list) {
bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_ASYNC | XBF_WRITE_FAIL);
- bp->b_flags |= XBF_WRITE;
+ bp->b_flags |= XBF_WRITE | XBF_ASYNC;

- if (!wait) {
- bp->b_flags |= XBF_ASYNC;
+ /*
+ * we do all Io submission async. This means if we need to wait
+ * for IO completion we need to take an extra reference so the
+ * buffer is still valid on the other side.
+ */
+ if (!wait)
list_del_init(&bp->b_list);
- }
-
- if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
- trace_xfs_bdstrat_shut(bp, _RET_IP_);
-
- xfs_buf_ioerror(bp, -EIO);
- xfs_buf_stale(bp);
+ else
+ xfs_buf_hold(bp);

- /*
- * if sync IO, xfs_buf_ioend is going to remove a
- * ref here. We need to add that so we can collect
- * all the buffer errors in the wait loop.
- */
- if (wait)
- xfs_buf_hold(bp);
- xfs_buf_ioend(bp);
- } else
- xfs_buf_iorequest(bp);
+ xfs_buf_submit(bp);
}
blk_finish_plug(&plug);

@@ -1838,7 +1853,10 @@ xfs_buf_delwri_submit(
bp = list_first_entry(&io_list, struct xfs_buf, b_list);

list_del_init(&bp->b_list);
- error2 = xfs_buf_iowait(bp);
+
+ /* locking the buffer will wait for async IO completion. */
+ xfs_buf_lock(bp);
+ error2 = bp->b_error;
xfs_buf_relse(bp);
if (!error)
error = error2;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index d8f57f6..0acfc30 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -290,8 +290,8 @@ extern int xfs_bwrite(struct xfs_buf *bp);
extern void xfs_buf_ioend(struct xfs_buf *bp);
extern void xfs_buf_ioerror(xfs_buf_t *, int);
extern void xfs_buf_ioerror_alert(struct xfs_buf *, const char *func);
-extern void xfs_buf_iorequest(xfs_buf_t *);
-extern int xfs_buf_iowait(xfs_buf_t *);
+extern void xfs_buf_submit(struct xfs_buf *bp);
+extern int xfs_buf_submit_wait(struct xfs_buf *bp);
extern void xfs_buf_iomove(xfs_buf_t *, size_t, size_t, void *,
xfs_buf_rw_t);
#define xfs_buf_zero(bp, off, len) \
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 4fd41b5..cbea909 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -1081,7 +1081,7 @@ xfs_buf_iodone_callbacks(
* a way to shut the filesystem down if the writes keep failing.
*
* In practice we'll shut the filesystem down soon as non-transient
- * erorrs tend to affect the whole device and a failing log write
+ * errors tend to affect the whole device and a failing log write
* will make us give up. But we really ought to do better here.
*/
if (XFS_BUF_ISASYNC(bp)) {
@@ -1094,7 +1094,7 @@ xfs_buf_iodone_callbacks(
if (!(bp->b_flags & (XBF_STALE|XBF_WRITE_FAIL))) {
bp->b_flags |= XBF_WRITE | XBF_ASYNC |
XBF_DONE | XBF_WRITE_FAIL;
- xfs_buf_iorequest(bp);
+ xfs_buf_submit(bp);
} else {
xfs_buf_relse(bp);
}
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index e4665db..c4d7f79 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1699,7 +1699,7 @@ xlog_bdstrat(
return 0;
}

- xfs_buf_iorequest(bp);
+ xfs_buf_submit(bp);
return 0;
}

diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 4ba19bf..1e14452 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -193,12 +193,8 @@ xlog_bread_noalign(
bp->b_io_length = nbblks;
bp->b_error = 0;

- if (XFS_FORCED_SHUTDOWN(log->l_mp))
- return -EIO;
-
- xfs_buf_iorequest(bp);
- error = xfs_buf_iowait(bp);
- if (error)
+ error = xfs_buf_submit_wait(bp);
+ if (error && !XFS_FORCED_SHUTDOWN(log->l_mp))
xfs_buf_ioerror_alert(bp, __func__);
return error;
}
@@ -4427,16 +4423,12 @@ xlog_do_recover(
XFS_BUF_UNASYNC(bp);
bp->b_ops = &xfs_sb_buf_ops;

- if (XFS_FORCED_SHUTDOWN(log->l_mp)) {
- xfs_buf_relse(bp);
- return -EIO;
- }
-
- xfs_buf_iorequest(bp);
- error = xfs_buf_iowait(bp);
+ error = xfs_buf_submit_wait(bp);
if (error) {
- xfs_buf_ioerror_alert(bp, __func__);
- ASSERT(0);
+ if (XFS_FORCED_SHUTDOWN(log->l_mp)) {
+ xfs_buf_ioerror_alert(bp, __func__);
+ ASSERT(0);
+ }
xfs_buf_relse(bp);
return error;
}
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 9c3e610..4bdf02c 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -286,18 +286,13 @@ xfs_trans_read_buf_map(
bp->b_flags |= XBF_READ;
bp->b_ops = ops;

- if (XFS_FORCED_SHUTDOWN(mp)) {
- trace_xfs_bdstrat_shut(bp, _RET_IP_);
- error = -EIO;
- goto out_stale;
- }
-
- xfs_buf_iorequest(bp);
- error = xfs_buf_iowait(bp);
+ error = xfs_buf_submit_wait(bp);
if (error) {
- xfs_buf_ioerror_alert(bp, __func__);
- goto out_do_shutdown;
-
+ if (!XFS_FORCED_SHUTDOWN(mp)) {
+ xfs_buf_ioerror_alert(bp, __func__);
+ goto out_do_shutdown;
+ }
+ goto out_stale;
}
}
--
2.0.0
Christoph Hellwig
2014-08-15 13:10:20 UTC
Permalink
Post by Dave Chinner
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 2f1e30d..c53cc03 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1157,12 +1157,7 @@ xfs_zero_remaining_bytes(
xfs_zero_remaining_bytes really should be switched to
xfs_buf_read_uncached + xfs_bwrite instead of all this mess just to
micro-optimize a few memory allocation away that don't even happen for
the common case of blocksize == PAGE_SIZE. I'm not even sure we
should be using the buffer cache at all for something inside a regular
file, but that's a discussion for another time.
Post by Dave Chinner
void
+xfs_buf_submit(
+ struct xfs_buf *bp)
{
trace_xfs_buf_iorequest(bp, _RET_IP_);
I suspect these two should have properly name and separate trace
points now?

It also seems like a lot of the guts of the two functions are
still the same, so factoring that into a __xfs_buf_submit helper
would be useful.
Post by Dave Chinner
+ * If _xfs_buf_ioapply failed,
+ * we can get back here with only the IO
* reference we took above. _xfs_buf_ioend will drop it to zero, so
* we'd better run completion processing synchronously so that the we
+ * don't return to the caller with completion still pending.
+ * this allows the caller to check b_error safely without
+ * waiting
*/
if (atomic_dec_and_test(&bp->b_io_remaining) == 1) {
if (bp->b_error || !(bp->b_flags & XBF_ASYNC))
I don't think the !ASYNC case can happen here, can it?
Post by Dave Chinner
+ if (!wait)
list_del_init(&bp->b_list);
+ else
+ xfs_buf_hold(bp);
Maybe switch this around to avoid the negated condition in the if else?

Also might this be worth a change of it's own?
Post by Dave Chinner
+++ b/fs/xfs/xfs_trans_buf.c
@@ -286,18 +286,13 @@ xfs_trans_read_buf_map(
bp->b_flags |= XBF_READ;
bp->b_ops = ops;
+ error = xfs_buf_submit_wait(bp);
if (error) {
+ if (!XFS_FORCED_SHUTDOWN(mp)) {
+ xfs_buf_ioerror_alert(bp, __func__);
+ goto out_do_shutdown;
+ }
+ goto out_stale;
Again maybe reverse the conditions here for a cleaner flow?
Dave Chinner
2014-08-15 23:37:43 UTC
Permalink
Post by Christoph Hellwig
Post by Dave Chinner
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 2f1e30d..c53cc03 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1157,12 +1157,7 @@ xfs_zero_remaining_bytes(
xfs_zero_remaining_bytes really should be switched to
xfs_buf_read_uncached + xfs_bwrite instead of all this mess just to
micro-optimize a few memory allocation away that don't even happen for
the common case of blocksize == PAGE_SIZE. I'm not even sure we
should be using the buffer cache at all for something inside a regular
file, but that's a discussion for another time.
xfs_zero_remaining_bytes is uses an uncached buffer, so we're not
using the buffer cache at all for the blocks being zeroed. That is
why it does the flag twiddling dance it does. However, consolidation
all the different block zeroing functions we have is an exercise for
a different day....
Post by Christoph Hellwig
Post by Dave Chinner
void
+xfs_buf_submit(
+ struct xfs_buf *bp)
{
trace_xfs_buf_iorequest(bp, _RET_IP_);
I suspect these two should have properly name and separate trace
points now?
Yes. Just haven't got to it.
Post by Christoph Hellwig
It also seems like a lot of the guts of the two functions are
still the same, so factoring that into a __xfs_buf_submit helper
would be useful.
Possibly, if we can ensure that the helper it never called directly
by any other code. Then we end up back in the mess we are currently
in.
Post by Christoph Hellwig
Post by Dave Chinner
+ * If _xfs_buf_ioapply failed,
+ * we can get back here with only the IO
* reference we took above. _xfs_buf_ioend will drop it to zero, so
* we'd better run completion processing synchronously so that the we
+ * don't return to the caller with completion still pending.
+ * this allows the caller to check b_error safely without
+ * waiting
*/
if (atomic_dec_and_test(&bp->b_io_remaining) == 1) {
if (bp->b_error || !(bp->b_flags & XBF_ASYNC))
I don't think the !ASYNC case can happen here, can it?
Right, I forget to clean that part up when I was splitting up the
functions.
Post by Christoph Hellwig
Post by Dave Chinner
+ if (!wait)
list_del_init(&bp->b_list);
+ else
+ xfs_buf_hold(bp);
Maybe switch this around to avoid the negated condition in the if else?
Also might this be worth a change of it's own?
Yeah, that's probably a good idea - the algorithm change is not
directly related to the interface change.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Christoph Hellwig
2014-08-16 04:55:31 UTC
Permalink
Post by Dave Chinner
xfs_zero_remaining_bytes is uses an uncached buffer, so we're not
using the buffer cache at all for the blocks being zeroed. That is
why it does the flag twiddling dance it does. However, consolidation
all the different block zeroing functions we have is an exercise for
a different day....
Well, we're using buffers. Anyway, below is what I think it should
look like when using buffers. Although I wonder how either the old
or new variant pass the verifier check in _xfs_buf_ioapply for v5
filesystems given that we don't pass any ops in.

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 2f1e30d..c495dce 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1122,14 +1122,6 @@ xfs_zero_remaining_bytes(
if (endoff > XFS_ISIZE(ip))
endoff = XFS_ISIZE(ip);

- bp = xfs_buf_get_uncached(XFS_IS_REALTIME_INODE(ip) ?
- mp->m_rtdev_targp : mp->m_ddev_targp,
- BTOBB(mp->m_sb.sb_blocksize), 0);
- if (!bp)
- return -ENOMEM;
-
- xfs_buf_unlock(bp);
-
for (offset = startoff; offset <= endoff; offset = lastoffset + 1) {
uint lock_mode;

@@ -1152,42 +1144,26 @@ xfs_zero_remaining_bytes(
ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
if (imap.br_state == XFS_EXT_UNWRITTEN)
continue;
- XFS_BUF_UNDONE(bp);
- XFS_BUF_UNWRITE(bp);
- XFS_BUF_READ(bp);
- XFS_BUF_SET_ADDR(bp, xfs_fsb_to_db(ip, imap.br_startblock));

- if (XFS_FORCED_SHUTDOWN(mp)) {
- error = -EIO;
- break;
- }
- xfs_buf_iorequest(bp);
- error = xfs_buf_iowait(bp);
- if (error) {
- xfs_buf_ioerror_alert(bp,
- "xfs_zero_remaining_bytes(read)");
- break;
+ bp = xfs_buf_read_uncached(XFS_IS_REALTIME_INODE(ip) ?
+ mp->m_rtdev_targp : mp->m_ddev_targp,
+ xfs_fsb_to_db(ip, imap.br_startblock),
+ BTOBB(mp->m_sb.sb_blocksize),
+ 0, NULL);
+ if (!bp)
+ return -ENOMEM;
+ if (bp->b_error) {
+ error = bp->b_error;
+ xfs_buf_relse(bp);
+ return error;
}
+
memset(bp->b_addr +
(offset - XFS_FSB_TO_B(mp, imap.br_startoff)),
0, lastoffset - offset + 1);
- XFS_BUF_UNDONE(bp);
- XFS_BUF_UNREAD(bp);
- XFS_BUF_WRITE(bp);

- if (XFS_FORCED_SHUTDOWN(mp)) {
- error = -EIO;
- break;
- }
- xfs_buf_iorequest(bp);
- error = xfs_buf_iowait(bp);
- if (error) {
- xfs_buf_ioerror_alert(bp,
- "xfs_zero_remaining_bytes(write)");
- break;
- }
+ xfs_bwrite(bp);
}
- xfs_buf_free(bp);
return error;
}
Brian Foster
2014-08-15 14:35:58 UTC
Permalink
Post by Dave Chinner
if (shutdown)
handle buffer error
xfs_buf_iorequest(bp)
error = xfs_buf_iowait(bp)
if (error)
handle buffer error
spread through XFS. There's significant complexity now in
xfs_buf_iorequest() to specifically handle this sort of synchronous
IO pattern, but there's all sorts of nasty surprises in different
error handling code dependent on who owns the buffer references and
the locks.
Pull this pattern into a single helper, where we can hide all the
synchronous IO warts and hence make the error handling for all the
callers much saner. This removes the need for a special extra
reference to protect IO completion processing, as we can now hold a
single reference across dispatch and waiting, simplifying the sync
IO smeantics and error handling.
In doing this, also rename xfs_buf_iorequest to xfs_buf_submit and
make it explicitly handle on asynchronous IO. This forces all users
to be switched specifically to one interface or the other and
removes any ambiguity between how the interfaces are to be used. It
also means that xfs_buf_iowait() goes away.
For the special case of delwri buffer submission and waiting, we
don't need to issue IO synchronously at all. The second pass to cal
xfs_buf_iowait() can now just block on xfs_buf_lock() - the buffer
will be unlocked when the async IO is complete. This formalises a
sane the method of waiting for async IO - take an extra reference,
submit the IO, call xfs_buf_lock() when you want to wait for IO
bp = xfs_buf_find();
xfs_buf_hold(bp);
bp->b_flags |= XBF_ASYNC;
xfs_buf_iosubmit(bp);
xfs_buf_lock(bp)
error = bp->b_error;
....
xfs_buf_relse(bp);
---
On a quick look at submit_wait this looks pretty good. It actually
implements the general model I've been looking for for sync I/O. E.g.,
send the I/O, wait on synchronization, then check for errors. In other
words, a pure synchronous mechanism. The refactoring and new helpers and
whatnot are additional bonus and abstract it nicely.

I still have to take a closer look to review the actual code, but since
we go and remove the additional sync I/O reference counting business,
why do we even add that stuff early on? Can't we get from where the
current code is to here in a more direct manner?

Brian
Post by Dave Chinner
fs/xfs/xfs_bmap_util.c | 14 +---
fs/xfs/xfs_buf.c | 186 ++++++++++++++++++++++++++---------------------
fs/xfs/xfs_buf.h | 4 +-
fs/xfs/xfs_buf_item.c | 4 +-
fs/xfs/xfs_log.c | 2 +-
fs/xfs/xfs_log_recover.c | 22 ++----
fs/xfs/xfs_trans_buf.c | 17 ++---
7 files changed, 122 insertions(+), 127 deletions(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 2f1e30d..c53cc03 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1157,12 +1157,7 @@ xfs_zero_remaining_bytes(
XFS_BUF_READ(bp);
XFS_BUF_SET_ADDR(bp, xfs_fsb_to_db(ip, imap.br_startblock));
- if (XFS_FORCED_SHUTDOWN(mp)) {
- error = -EIO;
- break;
- }
- xfs_buf_iorequest(bp);
- error = xfs_buf_iowait(bp);
+ error = xfs_buf_submit_wait(bp);
if (error) {
xfs_buf_ioerror_alert(bp,
"xfs_zero_remaining_bytes(read)");
@@ -1175,12 +1170,7 @@ xfs_zero_remaining_bytes(
XFS_BUF_UNREAD(bp);
XFS_BUF_WRITE(bp);
- if (XFS_FORCED_SHUTDOWN(mp)) {
- error = -EIO;
- break;
- }
- xfs_buf_iorequest(bp);
- error = xfs_buf_iowait(bp);
+ error = xfs_buf_submit_wait(bp);
if (error) {
xfs_buf_ioerror_alert(bp,
"xfs_zero_remaining_bytes(write)");
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 352e9219..a2599f9 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -623,10 +623,11 @@ _xfs_buf_read(
bp->b_flags &= ~(XBF_WRITE | XBF_ASYNC | XBF_READ_AHEAD);
bp->b_flags |= flags & (XBF_READ | XBF_ASYNC | XBF_READ_AHEAD);
- xfs_buf_iorequest(bp);
- if (flags & XBF_ASYNC)
+ if (flags & XBF_ASYNC) {
+ xfs_buf_submit(bp);
return 0;
- return xfs_buf_iowait(bp);
+ }
+ return xfs_buf_submit_wait(bp);
}
xfs_buf_t *
@@ -708,12 +709,7 @@ xfs_buf_read_uncached(
bp->b_flags |= XBF_READ;
bp->b_ops = ops;
- if (XFS_FORCED_SHUTDOWN(target->bt_mount)) {
- xfs_buf_relse(bp);
- return NULL;
- }
- xfs_buf_iorequest(bp);
- xfs_buf_iowait(bp);
+ xfs_buf_submit_wait(bp);
return bp;
}
@@ -1027,10 +1023,8 @@ xfs_buf_ioend(
(*(bp->b_iodone))(bp);
else if (bp->b_flags & XBF_ASYNC)
xfs_buf_relse(bp);
- else {
+ else
complete(&bp->b_iowait);
- xfs_buf_rele(bp);
- }
}
static void
@@ -1083,21 +1077,7 @@ xfs_bwrite(
bp->b_flags &= ~(XBF_ASYNC | XBF_READ | _XBF_DELWRI_Q |
XBF_WRITE_FAIL | XBF_DONE);
- if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
- trace_xfs_bdstrat_shut(bp, _RET_IP_);
-
- xfs_buf_ioerror(bp, -EIO);
- xfs_buf_stale(bp);
-
- /* sync IO, xfs_buf_ioend is going to remove a ref here */
- xfs_buf_hold(bp);
- xfs_buf_ioend(bp);
- return -EIO;
- }
-
- xfs_buf_iorequest(bp);
-
- error = xfs_buf_iowait(bp);
+ error = xfs_buf_submit_wait(bp);
if (error) {
xfs_force_shutdown(bp->b_target->bt_mount,
SHUTDOWN_META_IO_ERROR);
} else {
/*
* This is guaranteed not to be the last io reference count
- * because the caller (xfs_buf_iorequest) holds a count itself.
+ * because the caller (xfs_buf_submit) holds a count itself.
*/
atomic_dec(&bp->b_io_remaining);
xfs_buf_ioerror(bp, -EIO);
@@ -1296,13 +1276,29 @@ _xfs_buf_ioapply(
blk_finish_plug(&plug);
}
+/*
+ * Asynchronous IO submission path. This transfers the buffer lock ownership and
+ * the current reference to the IO. It is not safe to reference the buffer after
+ * a call to this function unless the caller holds an additional reference
+ * itself.
+ */
void
-xfs_buf_iorequest(
- xfs_buf_t *bp)
+xfs_buf_submit(
+ struct xfs_buf *bp)
{
trace_xfs_buf_iorequest(bp, _RET_IP_);
ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
+ ASSERT(bp->b_flags & XBF_ASYNC);
+
+ /* on shutdown we stale and complete the buffer immediately */
+ if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
+ xfs_buf_ioerror(bp, -EIO);
+ bp->b_flags &= ~XBF_DONE;
+ xfs_buf_stale(bp);
+ xfs_buf_ioend(bp);
+ return;
+ }
if (bp->b_flags & XBF_WRITE)
xfs_buf_wait_unpin(bp);
@@ -1311,25 +1307,10 @@ xfs_buf_iorequest(
bp->b_io_error = 0;
/*
- * Take references to the buffer. For XBF_ASYNC buffers, holding a
- * reference for as long as submission takes is all that is necessary
- * here. The IO inherits the lock and hold count from the submitter,
- * and these are release during IO completion processing. Taking a hold
- * over submission ensures that the buffer is not freed until we have
- * completed all processing, regardless of when IO errors occur or are
- * reported.
- *
- * However, for synchronous IO, the IO does not inherit the submitters
- * reference count, nor the buffer lock. Hence we need to take an extra
- * reference to the buffer for the for the IO context so that we can
- * guarantee the buffer is not freed until all IO completion processing
- * is done. Otherwise the caller can drop their reference while the IO
- * is still in progress and hence trigger a use-after-free situation.
+ * Take an extra reference so that we don't have to guess when it's no
+ * longer safe to reference the buffer that was passed to us.
*/
xfs_buf_hold(bp);
- if (!(bp->b_flags & XBF_ASYNC))
- xfs_buf_hold(bp);
-
/*
* Set the count to 1 initially, this will stop an I/O completion
@@ -1340,14 +1321,13 @@ xfs_buf_iorequest(
_xfs_buf_ioapply(bp);
/*
- * If _xfs_buf_ioapply failed or we are doing synchronous IO that
- * completes extremely quickly, we can get back here with only the IO
+ * If _xfs_buf_ioapply failed,
+ * we can get back here with only the IO
* reference we took above. _xfs_buf_ioend will drop it to zero, so
* we'd better run completion processing synchronously so that the we
- * don't return to the caller with completion still pending. In the
- * error case, this allows the caller to check b_error safely without
- * waiting, and in the synchronous IO case it avoids unnecessary context
- * switches an latency for high-peformance devices.
+ * don't return to the caller with completion still pending.
+ * this allows the caller to check b_error safely without
+ * waiting
*/
if (atomic_dec_and_test(&bp->b_io_remaining) == 1) {
if (bp->b_error || !(bp->b_flags & XBF_ASYNC))
@@ -1357,25 +1337,70 @@ xfs_buf_iorequest(
}
xfs_buf_rele(bp);
+ /* Note: it is not safe to reference bp now we've dropped our ref */
}
/*
- * Waits for I/O to complete on the buffer supplied. It returns immediately if
- * no I/O is pending or there is already a pending error on the buffer, in which
- * case nothing will ever complete. It returns the I/O error code, if any, or
- * 0 if there was no error.
+ * Synchronous buffer IO submission path, read or write.
*/
int
-xfs_buf_iowait(
- xfs_buf_t *bp)
+xfs_buf_submit_wait(
+ struct xfs_buf *bp)
{
- trace_xfs_buf_iowait(bp, _RET_IP_);
+ int error;
+
+ trace_xfs_buf_iorequest(bp, _RET_IP_);
- if (!bp->b_error)
- wait_for_completion(&bp->b_iowait);
+ ASSERT(!(bp->b_flags & (_XBF_DELWRI_Q | XBF_ASYNC)));
+
+ if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
+ xfs_buf_ioerror(bp, -EIO);
+ xfs_buf_stale(bp);
+ bp->b_flags &= ~XBF_DONE;
+ return -EIO;
+ }
+ if (bp->b_flags & XBF_WRITE)
+ xfs_buf_wait_unpin(bp);
+
+ /* clear the internal error state to avoid spurious errors */
+ bp->b_io_error = 0;
+
+ /*
+ * For synchronous IO, the IO does not inherit the submitters reference
+ * count, nor the buffer lock. Hence we cannot release the reference we
+ * are about to take until we've waited for all IO completion to occur,
+ * including any xfs_buf_ioend_async() work that may be pending.
+ */
+ xfs_buf_hold(bp);
+
+ /*
+ * Set the count to 1 initially, this will stop an I/O completion
+ * callout which happens before we have started all the I/O from calling
+ * xfs_buf_ioend too early.
+ */
+ atomic_set(&bp->b_io_remaining, 1);
+ _xfs_buf_ioapply(bp);
+
+ /*
+ * make sure we run completion synchronously if it raced with us and is
+ * already complete.
+ */
+ if (atomic_dec_and_test(&bp->b_io_remaining) == 1)
+ xfs_buf_ioend(bp);
+
+ /* wait for completion before gathering the error from the buffer */
+ trace_xfs_buf_iowait(bp, _RET_IP_);
+ wait_for_completion(&bp->b_iowait);
trace_xfs_buf_iowait_done(bp, _RET_IP_);
- return bp->b_error;
+ error = bp->b_error;
+
+ /*
+ * all done now, we can release the hold that keeps the buffer
+ * referenced for the entire IO.
+ */
+ xfs_buf_rele(bp);
+ return error;
}
xfs_caddr_t
@@ -1769,29 +1794,19 @@ __xfs_buf_delwri_submit(
blk_start_plug(&plug);
list_for_each_entry_safe(bp, n, io_list, b_list) {
bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_ASYNC | XBF_WRITE_FAIL);
- bp->b_flags |= XBF_WRITE;
+ bp->b_flags |= XBF_WRITE | XBF_ASYNC;
- if (!wait) {
- bp->b_flags |= XBF_ASYNC;
+ /*
+ * we do all Io submission async. This means if we need to wait
+ * for IO completion we need to take an extra reference so the
+ * buffer is still valid on the other side.
+ */
+ if (!wait)
list_del_init(&bp->b_list);
- }
-
- if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
- trace_xfs_bdstrat_shut(bp, _RET_IP_);
-
- xfs_buf_ioerror(bp, -EIO);
- xfs_buf_stale(bp);
+ else
+ xfs_buf_hold(bp);
- /*
- * if sync IO, xfs_buf_ioend is going to remove a
- * ref here. We need to add that so we can collect
- * all the buffer errors in the wait loop.
- */
- if (wait)
- xfs_buf_hold(bp);
- xfs_buf_ioend(bp);
- } else
- xfs_buf_iorequest(bp);
+ xfs_buf_submit(bp);
}
blk_finish_plug(&plug);
@@ -1838,7 +1853,10 @@ xfs_buf_delwri_submit(
bp = list_first_entry(&io_list, struct xfs_buf, b_list);
list_del_init(&bp->b_list);
- error2 = xfs_buf_iowait(bp);
+
+ /* locking the buffer will wait for async IO completion. */
+ xfs_buf_lock(bp);
+ error2 = bp->b_error;
xfs_buf_relse(bp);
if (!error)
error = error2;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index d8f57f6..0acfc30 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -290,8 +290,8 @@ extern int xfs_bwrite(struct xfs_buf *bp);
extern void xfs_buf_ioend(struct xfs_buf *bp);
extern void xfs_buf_ioerror(xfs_buf_t *, int);
extern void xfs_buf_ioerror_alert(struct xfs_buf *, const char *func);
-extern void xfs_buf_iorequest(xfs_buf_t *);
-extern int xfs_buf_iowait(xfs_buf_t *);
+extern void xfs_buf_submit(struct xfs_buf *bp);
+extern int xfs_buf_submit_wait(struct xfs_buf *bp);
extern void xfs_buf_iomove(xfs_buf_t *, size_t, size_t, void *,
xfs_buf_rw_t);
#define xfs_buf_zero(bp, off, len) \
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 4fd41b5..cbea909 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -1081,7 +1081,7 @@ xfs_buf_iodone_callbacks(
* a way to shut the filesystem down if the writes keep failing.
*
* In practice we'll shut the filesystem down soon as non-transient
- * erorrs tend to affect the whole device and a failing log write
+ * errors tend to affect the whole device and a failing log write
* will make us give up. But we really ought to do better here.
*/
if (XFS_BUF_ISASYNC(bp)) {
@@ -1094,7 +1094,7 @@ xfs_buf_iodone_callbacks(
if (!(bp->b_flags & (XBF_STALE|XBF_WRITE_FAIL))) {
bp->b_flags |= XBF_WRITE | XBF_ASYNC |
XBF_DONE | XBF_WRITE_FAIL;
- xfs_buf_iorequest(bp);
+ xfs_buf_submit(bp);
} else {
xfs_buf_relse(bp);
}
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index e4665db..c4d7f79 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1699,7 +1699,7 @@ xlog_bdstrat(
return 0;
}
- xfs_buf_iorequest(bp);
+ xfs_buf_submit(bp);
return 0;
}
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 4ba19bf..1e14452 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -193,12 +193,8 @@ xlog_bread_noalign(
bp->b_io_length = nbblks;
bp->b_error = 0;
- if (XFS_FORCED_SHUTDOWN(log->l_mp))
- return -EIO;
-
- xfs_buf_iorequest(bp);
- error = xfs_buf_iowait(bp);
- if (error)
+ error = xfs_buf_submit_wait(bp);
+ if (error && !XFS_FORCED_SHUTDOWN(log->l_mp))
xfs_buf_ioerror_alert(bp, __func__);
return error;
}
@@ -4427,16 +4423,12 @@ xlog_do_recover(
XFS_BUF_UNASYNC(bp);
bp->b_ops = &xfs_sb_buf_ops;
- if (XFS_FORCED_SHUTDOWN(log->l_mp)) {
- xfs_buf_relse(bp);
- return -EIO;
- }
-
- xfs_buf_iorequest(bp);
- error = xfs_buf_iowait(bp);
+ error = xfs_buf_submit_wait(bp);
if (error) {
- xfs_buf_ioerror_alert(bp, __func__);
- ASSERT(0);
+ if (XFS_FORCED_SHUTDOWN(log->l_mp)) {
+ xfs_buf_ioerror_alert(bp, __func__);
+ ASSERT(0);
+ }
xfs_buf_relse(bp);
return error;
}
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 9c3e610..4bdf02c 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -286,18 +286,13 @@ xfs_trans_read_buf_map(
bp->b_flags |= XBF_READ;
bp->b_ops = ops;
- if (XFS_FORCED_SHUTDOWN(mp)) {
- trace_xfs_bdstrat_shut(bp, _RET_IP_);
- error = -EIO;
- goto out_stale;
- }
-
- xfs_buf_iorequest(bp);
- error = xfs_buf_iowait(bp);
+ error = xfs_buf_submit_wait(bp);
if (error) {
- xfs_buf_ioerror_alert(bp, __func__);
- goto out_do_shutdown;
-
+ if (!XFS_FORCED_SHUTDOWN(mp)) {
+ xfs_buf_ioerror_alert(bp, __func__);
+ goto out_do_shutdown;
+ }
+ goto out_stale;
}
}
--
2.0.0
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
Dave Chinner
2014-08-15 23:39:35 UTC
Permalink
Post by Brian Foster
Post by Dave Chinner
if (shutdown)
handle buffer error
xfs_buf_iorequest(bp)
error = xfs_buf_iowait(bp)
if (error)
handle buffer error
spread through XFS. There's significant complexity now in
xfs_buf_iorequest() to specifically handle this sort of synchronous
IO pattern, but there's all sorts of nasty surprises in different
error handling code dependent on who owns the buffer references and
the locks.
Pull this pattern into a single helper, where we can hide all the
synchronous IO warts and hence make the error handling for all the
callers much saner. This removes the need for a special extra
reference to protect IO completion processing, as we can now hold a
single reference across dispatch and waiting, simplifying the sync
IO smeantics and error handling.
In doing this, also rename xfs_buf_iorequest to xfs_buf_submit and
make it explicitly handle on asynchronous IO. This forces all users
to be switched specifically to one interface or the other and
removes any ambiguity between how the interfaces are to be used. It
also means that xfs_buf_iowait() goes away.
For the special case of delwri buffer submission and waiting, we
don't need to issue IO synchronously at all. The second pass to cal
xfs_buf_iowait() can now just block on xfs_buf_lock() - the buffer
will be unlocked when the async IO is complete. This formalises a
sane the method of waiting for async IO - take an extra reference,
submit the IO, call xfs_buf_lock() when you want to wait for IO
bp = xfs_buf_find();
xfs_buf_hold(bp);
bp->b_flags |= XBF_ASYNC;
xfs_buf_iosubmit(bp);
xfs_buf_lock(bp)
error = bp->b_error;
....
xfs_buf_relse(bp);
---
On a quick look at submit_wait this looks pretty good. It actually
implements the general model I've been looking for for sync I/O. E.g.,
send the I/O, wait on synchronization, then check for errors. In other
words, a pure synchronous mechanism. The refactoring and new helpers and
whatnot are additional bonus and abstract it nicely.
I still have to take a closer look to review the actual code, but since
we go and remove the additional sync I/O reference counting business,
why do we even add that stuff early on? Can't we get from where the
current code is to here in a more direct manner?
Simply because we need a fix that we can backport, and that fix is a
simple addition that does not significantly affect the rest of the
patchset...

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Brian Foster
2014-08-18 14:16:09 UTC
Permalink
Post by Dave Chinner
Post by Brian Foster
Post by Dave Chinner
if (shutdown)
handle buffer error
xfs_buf_iorequest(bp)
error = xfs_buf_iowait(bp)
if (error)
handle buffer error
spread through XFS. There's significant complexity now in
xfs_buf_iorequest() to specifically handle this sort of synchronous
IO pattern, but there's all sorts of nasty surprises in different
error handling code dependent on who owns the buffer references and
the locks.
Pull this pattern into a single helper, where we can hide all the
synchronous IO warts and hence make the error handling for all the
callers much saner. This removes the need for a special extra
reference to protect IO completion processing, as we can now hold a
single reference across dispatch and waiting, simplifying the sync
IO smeantics and error handling.
In doing this, also rename xfs_buf_iorequest to xfs_buf_submit and
make it explicitly handle on asynchronous IO. This forces all users
to be switched specifically to one interface or the other and
removes any ambiguity between how the interfaces are to be used. It
also means that xfs_buf_iowait() goes away.
For the special case of delwri buffer submission and waiting, we
don't need to issue IO synchronously at all. The second pass to cal
xfs_buf_iowait() can now just block on xfs_buf_lock() - the buffer
will be unlocked when the async IO is complete. This formalises a
sane the method of waiting for async IO - take an extra reference,
submit the IO, call xfs_buf_lock() when you want to wait for IO
bp = xfs_buf_find();
xfs_buf_hold(bp);
bp->b_flags |= XBF_ASYNC;
xfs_buf_iosubmit(bp);
xfs_buf_lock(bp)
error = bp->b_error;
....
xfs_buf_relse(bp);
---
On a quick look at submit_wait this looks pretty good. It actually
implements the general model I've been looking for for sync I/O. E.g.,
send the I/O, wait on synchronization, then check for errors. In other
words, a pure synchronous mechanism. The refactoring and new helpers and
whatnot are additional bonus and abstract it nicely.
I still have to take a closer look to review the actual code, but since
we go and remove the additional sync I/O reference counting business,
why do we even add that stuff early on? Can't we get from where the
current code is to here in a more direct manner?
Simply because we need a fix that we can backport, and that fix is a
simple addition that does not significantly affect the rest of the
patchset...
Ok, it wasn't clear that multiple fixes was the intent. I guess it's
unforunate this won't see much tot testing, but we can find other ways
to make sure it's tested. ;)

Brian
Post by Dave Chinner
Cheers,
Dave.
--
Dave Chinner
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
Brian Foster
2014-08-15 16:13:20 UTC
Permalink
Post by Dave Chinner
if (shutdown)
handle buffer error
xfs_buf_iorequest(bp)
error = xfs_buf_iowait(bp)
if (error)
handle buffer error
spread through XFS. There's significant complexity now in
xfs_buf_iorequest() to specifically handle this sort of synchronous
IO pattern, but there's all sorts of nasty surprises in different
error handling code dependent on who owns the buffer references and
the locks.
Pull this pattern into a single helper, where we can hide all the
synchronous IO warts and hence make the error handling for all the
callers much saner. This removes the need for a special extra
reference to protect IO completion processing, as we can now hold a
single reference across dispatch and waiting, simplifying the sync
IO smeantics and error handling.
In doing this, also rename xfs_buf_iorequest to xfs_buf_submit and
make it explicitly handle on asynchronous IO. This forces all users
to be switched specifically to one interface or the other and
removes any ambiguity between how the interfaces are to be used. It
also means that xfs_buf_iowait() goes away.
For the special case of delwri buffer submission and waiting, we
don't need to issue IO synchronously at all. The second pass to cal
xfs_buf_iowait() can now just block on xfs_buf_lock() - the buffer
will be unlocked when the async IO is complete. This formalises a
sane the method of waiting for async IO - take an extra reference,
submit the IO, call xfs_buf_lock() when you want to wait for IO
bp = xfs_buf_find();
xfs_buf_hold(bp);
bp->b_flags |= XBF_ASYNC;
xfs_buf_iosubmit(bp);
xfs_buf_lock(bp)
error = bp->b_error;
....
xfs_buf_relse(bp);
---
fs/xfs/xfs_bmap_util.c | 14 +---
fs/xfs/xfs_buf.c | 186 ++++++++++++++++++++++++++---------------------
fs/xfs/xfs_buf.h | 4 +-
fs/xfs/xfs_buf_item.c | 4 +-
fs/xfs/xfs_log.c | 2 +-
fs/xfs/xfs_log_recover.c | 22 ++----
fs/xfs/xfs_trans_buf.c | 17 ++---
7 files changed, 122 insertions(+), 127 deletions(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 2f1e30d..c53cc03 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1157,12 +1157,7 @@ xfs_zero_remaining_bytes(
XFS_BUF_READ(bp);
XFS_BUF_SET_ADDR(bp, xfs_fsb_to_db(ip, imap.br_startblock));
- if (XFS_FORCED_SHUTDOWN(mp)) {
- error = -EIO;
- break;
- }
- xfs_buf_iorequest(bp);
- error = xfs_buf_iowait(bp);
+ error = xfs_buf_submit_wait(bp);
if (error) {
xfs_buf_ioerror_alert(bp,
"xfs_zero_remaining_bytes(read)");
@@ -1175,12 +1170,7 @@ xfs_zero_remaining_bytes(
XFS_BUF_UNREAD(bp);
XFS_BUF_WRITE(bp);
- if (XFS_FORCED_SHUTDOWN(mp)) {
- error = -EIO;
- break;
- }
- xfs_buf_iorequest(bp);
- error = xfs_buf_iowait(bp);
+ error = xfs_buf_submit_wait(bp);
if (error) {
xfs_buf_ioerror_alert(bp,
"xfs_zero_remaining_bytes(write)");
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 352e9219..a2599f9 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -623,10 +623,11 @@ _xfs_buf_read(
bp->b_flags &= ~(XBF_WRITE | XBF_ASYNC | XBF_READ_AHEAD);
bp->b_flags |= flags & (XBF_READ | XBF_ASYNC | XBF_READ_AHEAD);
- xfs_buf_iorequest(bp);
- if (flags & XBF_ASYNC)
+ if (flags & XBF_ASYNC) {
+ xfs_buf_submit(bp);
return 0;
- return xfs_buf_iowait(bp);
+ }
+ return xfs_buf_submit_wait(bp);
}
xfs_buf_t *
@@ -708,12 +709,7 @@ xfs_buf_read_uncached(
bp->b_flags |= XBF_READ;
bp->b_ops = ops;
- if (XFS_FORCED_SHUTDOWN(target->bt_mount)) {
- xfs_buf_relse(bp);
- return NULL;
- }
- xfs_buf_iorequest(bp);
- xfs_buf_iowait(bp);
+ xfs_buf_submit_wait(bp);
return bp;
}
@@ -1027,10 +1023,8 @@ xfs_buf_ioend(
(*(bp->b_iodone))(bp);
else if (bp->b_flags & XBF_ASYNC)
xfs_buf_relse(bp);
- else {
+ else
complete(&bp->b_iowait);
- xfs_buf_rele(bp);
- }
}
static void
@@ -1083,21 +1077,7 @@ xfs_bwrite(
bp->b_flags &= ~(XBF_ASYNC | XBF_READ | _XBF_DELWRI_Q |
XBF_WRITE_FAIL | XBF_DONE);
- if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
- trace_xfs_bdstrat_shut(bp, _RET_IP_);
-
- xfs_buf_ioerror(bp, -EIO);
- xfs_buf_stale(bp);
-
- /* sync IO, xfs_buf_ioend is going to remove a ref here */
- xfs_buf_hold(bp);
- xfs_buf_ioend(bp);
- return -EIO;
- }
-
- xfs_buf_iorequest(bp);
-
- error = xfs_buf_iowait(bp);
+ error = xfs_buf_submit_wait(bp);
if (error) {
xfs_force_shutdown(bp->b_target->bt_mount,
SHUTDOWN_META_IO_ERROR);
} else {
/*
* This is guaranteed not to be the last io reference count
- * because the caller (xfs_buf_iorequest) holds a count itself.
+ * because the caller (xfs_buf_submit) holds a count itself.
*/
atomic_dec(&bp->b_io_remaining);
xfs_buf_ioerror(bp, -EIO);
@@ -1296,13 +1276,29 @@ _xfs_buf_ioapply(
blk_finish_plug(&plug);
}
+/*
+ * Asynchronous IO submission path. This transfers the buffer lock ownership and
+ * the current reference to the IO. It is not safe to reference the buffer after
+ * a call to this function unless the caller holds an additional reference
+ * itself.
+ */
void
-xfs_buf_iorequest(
- xfs_buf_t *bp)
+xfs_buf_submit(
+ struct xfs_buf *bp)
{
trace_xfs_buf_iorequest(bp, _RET_IP_);
ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
+ ASSERT(bp->b_flags & XBF_ASYNC);
+
+ /* on shutdown we stale and complete the buffer immediately */
+ if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
+ xfs_buf_ioerror(bp, -EIO);
+ bp->b_flags &= ~XBF_DONE;
+ xfs_buf_stale(bp);
+ xfs_buf_ioend(bp);
+ return;
+ }
if (bp->b_flags & XBF_WRITE)
xfs_buf_wait_unpin(bp);
@@ -1311,25 +1307,10 @@ xfs_buf_iorequest(
bp->b_io_error = 0;
I know this is from the previous patch, but I wonder if it's cleaner to
reset b_io_error when the error is transferred to b_error. That seems
slightly more future proof at least.
Post by Dave Chinner
/*
- * Take references to the buffer. For XBF_ASYNC buffers, holding a
- * reference for as long as submission takes is all that is necessary
- * here. The IO inherits the lock and hold count from the submitter,
- * and these are release during IO completion processing. Taking a hold
- * over submission ensures that the buffer is not freed until we have
- * completed all processing, regardless of when IO errors occur or are
- * reported.
- *
- * However, for synchronous IO, the IO does not inherit the submitters
- * reference count, nor the buffer lock. Hence we need to take an extra
- * reference to the buffer for the for the IO context so that we can
- * guarantee the buffer is not freed until all IO completion processing
- * is done. Otherwise the caller can drop their reference while the IO
- * is still in progress and hence trigger a use-after-free situation.
+ * Take an extra reference so that we don't have to guess when it's no
+ * longer safe to reference the buffer that was passed to us.
*/
Assuming my understanding is correct:

/*
* The caller's reference is released by buffer I/O completion. Technically
* this should not occur until we release the last b_io_remaining reference.
* Take a direct reference across the I/O submission anyways to be sure it's
* safe to access the buffer until we release it.
*/
Post by Dave Chinner
xfs_buf_hold(bp);
- if (!(bp->b_flags & XBF_ASYNC))
- xfs_buf_hold(bp);
-
/*
* Set the count to 1 initially, this will stop an I/O completion
@@ -1340,14 +1321,13 @@ xfs_buf_iorequest(
_xfs_buf_ioapply(bp);
/*
- * If _xfs_buf_ioapply failed or we are doing synchronous IO that
- * completes extremely quickly, we can get back here with only the IO
+ * If _xfs_buf_ioapply failed,
+ * we can get back here with only the IO
* reference we took above. _xfs_buf_ioend will drop it to zero, so
* we'd better run completion processing synchronously so that the we
- * don't return to the caller with completion still pending. In the
- * error case, this allows the caller to check b_error safely without
- * waiting, and in the synchronous IO case it avoids unnecessary context
- * switches an latency for high-peformance devices.
+ * don't return to the caller with completion still pending.
+ * this allows the caller to check b_error safely without
+ * waiting
*/
if (atomic_dec_and_test(&bp->b_io_remaining) == 1) {
if (bp->b_error || !(bp->b_flags & XBF_ASYNC))
@@ -1357,25 +1337,70 @@ xfs_buf_iorequest(
}
xfs_buf_rele(bp);
+ /* Note: it is not safe to reference bp now we've dropped our ref */
}
/*
- * Waits for I/O to complete on the buffer supplied. It returns immediately if
- * no I/O is pending or there is already a pending error on the buffer, in which
- * case nothing will ever complete. It returns the I/O error code, if any, or
- * 0 if there was no error.
+ * Synchronous buffer IO submission path, read or write.
*/
int
-xfs_buf_iowait(
- xfs_buf_t *bp)
+xfs_buf_submit_wait(
+ struct xfs_buf *bp)
{
- trace_xfs_buf_iowait(bp, _RET_IP_);
+ int error;
+
+ trace_xfs_buf_iorequest(bp, _RET_IP_);
- if (!bp->b_error)
- wait_for_completion(&bp->b_iowait);
+ ASSERT(!(bp->b_flags & (_XBF_DELWRI_Q | XBF_ASYNC)));
+
+ if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
+ xfs_buf_ioerror(bp, -EIO);
+ xfs_buf_stale(bp);
+ bp->b_flags &= ~XBF_DONE;
+ return -EIO;
+ }
+ if (bp->b_flags & XBF_WRITE)
+ xfs_buf_wait_unpin(bp);
+
+ /* clear the internal error state to avoid spurious errors */
+ bp->b_io_error = 0;
+
+ /*
+ * For synchronous IO, the IO does not inherit the submitters reference
+ * count, nor the buffer lock. Hence we cannot release the reference we
+ * are about to take until we've waited for all IO completion to occur,
+ * including any xfs_buf_ioend_async() work that may be pending.
+ */
+ xfs_buf_hold(bp);
+
Harmless, but why is this necessary? The caller should have the
reference, the I/O completion won't release it and we wait on b_iowait
before we return. Isn't the caller's reference sufficient?
Post by Dave Chinner
+ /*
+ * Set the count to 1 initially, this will stop an I/O completion
+ * callout which happens before we have started all the I/O from calling
+ * xfs_buf_ioend too early.
+ */
+ atomic_set(&bp->b_io_remaining, 1);
+ _xfs_buf_ioapply(bp);
+
+ /*
+ * make sure we run completion synchronously if it raced with us and is
+ * already complete.
+ */
+ if (atomic_dec_and_test(&bp->b_io_remaining) == 1)
+ xfs_buf_ioend(bp);
+
+ /* wait for completion before gathering the error from the buffer */
+ trace_xfs_buf_iowait(bp, _RET_IP_);
+ wait_for_completion(&bp->b_iowait);
trace_xfs_buf_iowait_done(bp, _RET_IP_);
- return bp->b_error;
+ error = bp->b_error;
+
+ /*
+ * all done now, we can release the hold that keeps the buffer
+ * referenced for the entire IO.
+ */
+ xfs_buf_rele(bp);
+ return error;
}
xfs_caddr_t
@@ -1769,29 +1794,19 @@ __xfs_buf_delwri_submit(
blk_start_plug(&plug);
list_for_each_entry_safe(bp, n, io_list, b_list) {
bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_ASYNC | XBF_WRITE_FAIL);
- bp->b_flags |= XBF_WRITE;
+ bp->b_flags |= XBF_WRITE | XBF_ASYNC;
- if (!wait) {
- bp->b_flags |= XBF_ASYNC;
+ /*
+ * we do all Io submission async. This means if we need to wait
+ * for IO completion we need to take an extra reference so the
+ * buffer is still valid on the other side.
+ */
+ if (!wait)
list_del_init(&bp->b_list);
- }
-
- if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
- trace_xfs_bdstrat_shut(bp, _RET_IP_);
-
- xfs_buf_ioerror(bp, -EIO);
- xfs_buf_stale(bp);
+ else
+ xfs_buf_hold(bp);
- /*
- * if sync IO, xfs_buf_ioend is going to remove a
- * ref here. We need to add that so we can collect
- * all the buffer errors in the wait loop.
- */
- if (wait)
- xfs_buf_hold(bp);
- xfs_buf_ioend(bp);
- } else
- xfs_buf_iorequest(bp);
+ xfs_buf_submit(bp);
}
blk_finish_plug(&plug);
@@ -1838,7 +1853,10 @@ xfs_buf_delwri_submit(
bp = list_first_entry(&io_list, struct xfs_buf, b_list);
list_del_init(&bp->b_list);
- error2 = xfs_buf_iowait(bp);
+
+ /* locking the buffer will wait for async IO completion. */
+ xfs_buf_lock(bp);
+ error2 = bp->b_error;
Interesting delwri cleanup overall. The lock here works for
synchronization (blocking), but it doesn't look safe for error
processing. Once the buffer lock is dropped, who says b_error is from
our write (and not a subsequent, for example) and thus this caller's
responsibility to handle the error? I suspect this is a reason the lock
is typically held across a sync I/O, so the error is valid after the
I/O.

Also, it looks like blocking on async I/O as such opens up the
possibility of blocking indefinitely on failing writes if the right
combination of delwri and b_iodone handler is used (see
xfs_buf_iodone_callbacks()). I'm not sure if that's a real problem
today, though.
Post by Dave Chinner
xfs_buf_relse(bp);
if (!error)
error = error2;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index d8f57f6..0acfc30 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -290,8 +290,8 @@ extern int xfs_bwrite(struct xfs_buf *bp);
extern void xfs_buf_ioend(struct xfs_buf *bp);
extern void xfs_buf_ioerror(xfs_buf_t *, int);
extern void xfs_buf_ioerror_alert(struct xfs_buf *, const char *func);
-extern void xfs_buf_iorequest(xfs_buf_t *);
-extern int xfs_buf_iowait(xfs_buf_t *);
+extern void xfs_buf_submit(struct xfs_buf *bp);
+extern int xfs_buf_submit_wait(struct xfs_buf *bp);
extern void xfs_buf_iomove(xfs_buf_t *, size_t, size_t, void *,
xfs_buf_rw_t);
#define xfs_buf_zero(bp, off, len) \
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 4fd41b5..cbea909 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -1081,7 +1081,7 @@ xfs_buf_iodone_callbacks(
* a way to shut the filesystem down if the writes keep failing.
*
* In practice we'll shut the filesystem down soon as non-transient
- * erorrs tend to affect the whole device and a failing log write
+ * errors tend to affect the whole device and a failing log write
* will make us give up. But we really ought to do better here.
*/
if (XFS_BUF_ISASYNC(bp)) {
@@ -1094,7 +1094,7 @@ xfs_buf_iodone_callbacks(
if (!(bp->b_flags & (XBF_STALE|XBF_WRITE_FAIL))) {
bp->b_flags |= XBF_WRITE | XBF_ASYNC |
XBF_DONE | XBF_WRITE_FAIL;
- xfs_buf_iorequest(bp);
+ xfs_buf_submit(bp);
} else {
xfs_buf_relse(bp);
}
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index e4665db..c4d7f79 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1699,7 +1699,7 @@ xlog_bdstrat(
return 0;
}
- xfs_buf_iorequest(bp);
+ xfs_buf_submit(bp);
return 0;
}
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 4ba19bf..1e14452 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -193,12 +193,8 @@ xlog_bread_noalign(
bp->b_io_length = nbblks;
bp->b_error = 0;
- if (XFS_FORCED_SHUTDOWN(log->l_mp))
- return -EIO;
-
- xfs_buf_iorequest(bp);
- error = xfs_buf_iowait(bp);
- if (error)
+ error = xfs_buf_submit_wait(bp);
+ if (error && !XFS_FORCED_SHUTDOWN(log->l_mp))
xfs_buf_ioerror_alert(bp, __func__);
return error;
}
@@ -4427,16 +4423,12 @@ xlog_do_recover(
XFS_BUF_UNASYNC(bp);
bp->b_ops = &xfs_sb_buf_ops;
- if (XFS_FORCED_SHUTDOWN(log->l_mp)) {
- xfs_buf_relse(bp);
- return -EIO;
- }
-
- xfs_buf_iorequest(bp);
- error = xfs_buf_iowait(bp);
+ error = xfs_buf_submit_wait(bp);
if (error) {
- xfs_buf_ioerror_alert(bp, __func__);
- ASSERT(0);
+ if (XFS_FORCED_SHUTDOWN(log->l_mp)) {
Should this be !XFS_FORCED_SHUTDOWN()?

Brian
Post by Dave Chinner
+ xfs_buf_ioerror_alert(bp, __func__);
+ ASSERT(0);
+ }
xfs_buf_relse(bp);
return error;
}
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 9c3e610..4bdf02c 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -286,18 +286,13 @@ xfs_trans_read_buf_map(
bp->b_flags |= XBF_READ;
bp->b_ops = ops;
- if (XFS_FORCED_SHUTDOWN(mp)) {
- trace_xfs_bdstrat_shut(bp, _RET_IP_);
- error = -EIO;
- goto out_stale;
- }
-
- xfs_buf_iorequest(bp);
- error = xfs_buf_iowait(bp);
+ error = xfs_buf_submit_wait(bp);
if (error) {
- xfs_buf_ioerror_alert(bp, __func__);
- goto out_do_shutdown;
-
+ if (!XFS_FORCED_SHUTDOWN(mp)) {
+ xfs_buf_ioerror_alert(bp, __func__);
+ goto out_do_shutdown;
+ }
+ goto out_stale;
}
}
--
2.0.0
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
Dave Chinner
2014-08-15 23:58:04 UTC
Permalink
Post by Brian Foster
Post by Dave Chinner
+ if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
+ xfs_buf_ioerror(bp, -EIO);
+ bp->b_flags &= ~XBF_DONE;
+ xfs_buf_stale(bp);
+ xfs_buf_ioend(bp);
+ return;
+ }
if (bp->b_flags & XBF_WRITE)
xfs_buf_wait_unpin(bp);
@@ -1311,25 +1307,10 @@ xfs_buf_iorequest(
bp->b_io_error = 0;
I know this is from the previous patch, but I wonder if it's cleaner to
reset b_io_error when the error is transferred to b_error. That seems
slightly more future proof at least.
I much prefer zeroing just before the variable is needed to be zero,
simply to indicate the context in which we care that the value is
correct. Outside of actively submitted IO, the value of b_io_error
is irrelevant, so it's value really doesn't matter. The other
advantage of leaving it untocuhed is for debug purposes - the caller
might clear b_error, but we still know what the state of the last IO
that was completed in the buffer was...
Post by Brian Foster
Post by Dave Chinner
/*
- * Take references to the buffer. For XBF_ASYNC buffers, holding a
- * reference for as long as submission takes is all that is necessary
- * here. The IO inherits the lock and hold count from the submitter,
- * and these are release during IO completion processing. Taking a hold
- * over submission ensures that the buffer is not freed until we have
- * completed all processing, regardless of when IO errors occur or are
- * reported.
- *
- * However, for synchronous IO, the IO does not inherit the submitters
- * reference count, nor the buffer lock. Hence we need to take an extra
- * reference to the buffer for the for the IO context so that we can
- * guarantee the buffer is not freed until all IO completion processing
- * is done. Otherwise the caller can drop their reference while the IO
- * is still in progress and hence trigger a use-after-free situation.
+ * Take an extra reference so that we don't have to guess when it's no
+ * longer safe to reference the buffer that was passed to us.
*/
/*
* The caller's reference is released by buffer I/O completion. Technically
* this should not occur until we release the last b_io_remaining reference.
That's not quite right. The caller's reference is released some time
*after* b_io_remaining goes to zero. That's the reason we need to
hold a reference - after we drop our b_io_remaining count, we have
to have some other method of ensuring the buffer doesn't go away
until we have finished with the buffer.

I'll rewrite the comment.
Post by Brian Foster
Post by Dave Chinner
+xfs_buf_submit_wait(
+ struct xfs_buf *bp)
{
- trace_xfs_buf_iowait(bp, _RET_IP_);
+ int error;
+
+ trace_xfs_buf_iorequest(bp, _RET_IP_);
- if (!bp->b_error)
- wait_for_completion(&bp->b_iowait);
+ ASSERT(!(bp->b_flags & (_XBF_DELWRI_Q | XBF_ASYNC)));
+
+ if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
+ xfs_buf_ioerror(bp, -EIO);
+ xfs_buf_stale(bp);
+ bp->b_flags &= ~XBF_DONE;
+ return -EIO;
+ }
+ if (bp->b_flags & XBF_WRITE)
+ xfs_buf_wait_unpin(bp);
+
+ /* clear the internal error state to avoid spurious errors */
+ bp->b_io_error = 0;
+
+ /*
+ * For synchronous IO, the IO does not inherit the submitters reference
+ * count, nor the buffer lock. Hence we cannot release the reference we
+ * are about to take until we've waited for all IO completion to occur,
+ * including any xfs_buf_ioend_async() work that may be pending.
+ */
+ xfs_buf_hold(bp);
+
Harmless, but why is this necessary? The caller should have the
reference, the I/O completion won't release it and we wait on b_iowait
before we return. Isn't the caller's reference sufficient?
Consistency - I'd prefer that all IO has the same reference counting
behaviour. i.e. that the IO holds it's own reference to ensure,
regardless of anything else that happens, that the buffer has a
valid reference count the entire time the IO subsystem processing
the buffer.
Post by Brian Foster
Post by Dave Chinner
@@ -1838,7 +1853,10 @@ xfs_buf_delwri_submit(
bp = list_first_entry(&io_list, struct xfs_buf, b_list);
list_del_init(&bp->b_list);
- error2 = xfs_buf_iowait(bp);
+
+ /* locking the buffer will wait for async IO completion. */
+ xfs_buf_lock(bp);
+ error2 = bp->b_error;
Interesting delwri cleanup overall. The lock here works for
synchronization (blocking), but it doesn't look safe for error
processing. Once the buffer lock is dropped, who says b_error is from
our write (and not a subsequent, for example) and thus this caller's
responsibility to handle the error? I suspect this is a reason the lock
is typically held across a sync I/O, so the error is valid after the
I/O.
It's fine because there is only a limited set of blocking callers,
all of which are special cases. The only callers that use this
blocking xfs_buf_delwri_submit() interface are:

1. log recovery: running single threaded, so isn't going
to be racing with anything else that can modify the error

2. quotacheck: also running single threaded

3. quota shrinker, which has nowhere to report an error to,
so the error status simply doesn't matter.
Post by Brian Foster
Also, it looks like blocking on async I/O as such opens up the
possibility of blocking indefinitely on failing writes if the right
combination of delwri and b_iodone handler is used (see
xfs_buf_iodone_callbacks()). I'm not sure if that's a real problem
today, though.
I don't think it is - we don't permanently rewrite metadata writes
from IO completion anymore. We retry once, but then require higher
layers to restart the IO again (e.g. the xfsaild). Hence we can't
get stuck forever on async write errors.
Post by Brian Foster
Post by Dave Chinner
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 4ba19bf..1e14452 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -193,12 +193,8 @@ xlog_bread_noalign(
bp->b_io_length = nbblks;
bp->b_error = 0;
- if (XFS_FORCED_SHUTDOWN(log->l_mp))
- return -EIO;
-
- xfs_buf_iorequest(bp);
- error = xfs_buf_iowait(bp);
- if (error)
+ error = xfs_buf_submit_wait(bp);
+ if (error && !XFS_FORCED_SHUTDOWN(log->l_mp))
xfs_buf_ioerror_alert(bp, __func__);
return error;
}
@@ -4427,16 +4423,12 @@ xlog_do_recover(
XFS_BUF_UNASYNC(bp);
bp->b_ops = &xfs_sb_buf_ops;
- if (XFS_FORCED_SHUTDOWN(log->l_mp)) {
- xfs_buf_relse(bp);
- return -EIO;
- }
-
- xfs_buf_iorequest(bp);
- error = xfs_buf_iowait(bp);
+ error = xfs_buf_submit_wait(bp);
if (error) {
- xfs_buf_ioerror_alert(bp, __func__);
- ASSERT(0);
+ if (XFS_FORCED_SHUTDOWN(log->l_mp)) {
Should this be !XFS_FORCED_SHUTDOWN()?
Right, good catch, especially after reading all that code ;)

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Brian Foster
2014-08-18 14:26:45 UTC
Permalink
Post by Dave Chinner
Post by Brian Foster
Post by Dave Chinner
+ if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
+ xfs_buf_ioerror(bp, -EIO);
+ bp->b_flags &= ~XBF_DONE;
+ xfs_buf_stale(bp);
+ xfs_buf_ioend(bp);
+ return;
+ }
if (bp->b_flags & XBF_WRITE)
xfs_buf_wait_unpin(bp);
@@ -1311,25 +1307,10 @@ xfs_buf_iorequest(
bp->b_io_error = 0;
I know this is from the previous patch, but I wonder if it's cleaner to
reset b_io_error when the error is transferred to b_error. That seems
slightly more future proof at least.
I much prefer zeroing just before the variable is needed to be zero,
simply to indicate the context in which we care that the value is
correct. Outside of actively submitted IO, the value of b_io_error
is irrelevant, so it's value really doesn't matter. The other
advantage of leaving it untocuhed is for debug purposes - the caller
might clear b_error, but we still know what the state of the last IO
that was completed in the buffer was...
Sounds good, I don't really have a strong preference.
Post by Dave Chinner
Post by Brian Foster
Post by Dave Chinner
/*
- * Take references to the buffer. For XBF_ASYNC buffers, holding a
- * reference for as long as submission takes is all that is necessary
- * here. The IO inherits the lock and hold count from the submitter,
- * and these are release during IO completion processing. Taking a hold
- * over submission ensures that the buffer is not freed until we have
- * completed all processing, regardless of when IO errors occur or are
- * reported.
- *
- * However, for synchronous IO, the IO does not inherit the submitters
- * reference count, nor the buffer lock. Hence we need to take an extra
- * reference to the buffer for the for the IO context so that we can
- * guarantee the buffer is not freed until all IO completion processing
- * is done. Otherwise the caller can drop their reference while the IO
- * is still in progress and hence trigger a use-after-free situation.
+ * Take an extra reference so that we don't have to guess when it's no
+ * longer safe to reference the buffer that was passed to us.
*/
/*
* The caller's reference is released by buffer I/O completion. Technically
* this should not occur until we release the last b_io_remaining reference.
That's not quite right. The caller's reference is released some time
*after* b_io_remaining goes to zero. That's the reason we need to
hold a reference - after we drop our b_io_remaining count, we have
to have some other method of ensuring the buffer doesn't go away
until we have finished with the buffer.
Right, that's what I meant by the fact that we have to release the last
b_io_remaining ref one way or another first...
Post by Dave Chinner
I'll rewrite the comment.
Ok, the rest of the comment I wrote isn't really clear when I re-read it
back either.
Post by Dave Chinner
Post by Brian Foster
Post by Dave Chinner
+xfs_buf_submit_wait(
+ struct xfs_buf *bp)
{
- trace_xfs_buf_iowait(bp, _RET_IP_);
+ int error;
+
+ trace_xfs_buf_iorequest(bp, _RET_IP_);
- if (!bp->b_error)
- wait_for_completion(&bp->b_iowait);
+ ASSERT(!(bp->b_flags & (_XBF_DELWRI_Q | XBF_ASYNC)));
+
+ if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
+ xfs_buf_ioerror(bp, -EIO);
+ xfs_buf_stale(bp);
+ bp->b_flags &= ~XBF_DONE;
+ return -EIO;
+ }
+ if (bp->b_flags & XBF_WRITE)
+ xfs_buf_wait_unpin(bp);
+
+ /* clear the internal error state to avoid spurious errors */
+ bp->b_io_error = 0;
+
+ /*
+ * For synchronous IO, the IO does not inherit the submitters reference
+ * count, nor the buffer lock. Hence we cannot release the reference we
+ * are about to take until we've waited for all IO completion to occur,
+ * including any xfs_buf_ioend_async() work that may be pending.
+ */
+ xfs_buf_hold(bp);
+
Harmless, but why is this necessary? The caller should have the
reference, the I/O completion won't release it and we wait on b_iowait
before we return. Isn't the caller's reference sufficient?
Consistency - I'd prefer that all IO has the same reference counting
behaviour. i.e. that the IO holds it's own reference to ensure,
regardless of anything else that happens, that the buffer has a
valid reference count the entire time the IO subsystem processing
the buffer.
Sounds good.
Post by Dave Chinner
Post by Brian Foster
Post by Dave Chinner
@@ -1838,7 +1853,10 @@ xfs_buf_delwri_submit(
bp = list_first_entry(&io_list, struct xfs_buf, b_list);
list_del_init(&bp->b_list);
- error2 = xfs_buf_iowait(bp);
+
+ /* locking the buffer will wait for async IO completion. */
+ xfs_buf_lock(bp);
+ error2 = bp->b_error;
Interesting delwri cleanup overall. The lock here works for
synchronization (blocking), but it doesn't look safe for error
processing. Once the buffer lock is dropped, who says b_error is from
our write (and not a subsequent, for example) and thus this caller's
responsibility to handle the error? I suspect this is a reason the lock
is typically held across a sync I/O, so the error is valid after the
I/O.
It's fine because there is only a limited set of blocking callers,
all of which are special cases. The only callers that use this
1. log recovery: running single threaded, so isn't going
to be racing with anything else that can modify the error
2. quotacheck: also running single threaded
3. quota shrinker, which has nowhere to report an error to,
so the error status simply doesn't matter.
Sure, to be honest I wasn't really expecting there to be a scenario
where this is currently possible. I figured the log recovery context was
obviously not a concern because that occurs on mount. I hadn't gone to
look at the other contexts where we use delwri. quotacheck makes sense
for the same reason. We also use delwri for AIL pushing, but not the
synchronous version.

My comment was more from the perspective of this code will be around for
a long time and this little characteristic of the mechanism is not at
all explicit or obvious. So it won't ever be a problem until somebody
uses sync delwri in a context where racing is possible. Then it won't
ever reproduce until some user drives said mechanism and hits an error
in just the right manner to lead to some undefined error recovery
behavior. Maybe that doesn't happen for 20 years, but whenever it does,
it's guaranteed to not be fun to debug. ;) So my point is not to suggest
there's a race somewhere. It's just that the mechanism is racy and I'd
prefer we eliminate the possibility of having to debug the flaw that
this leaves behind. :) Another way to look at it is that we can't ever
use this delwri mechanism from anywhere but such special, isolated
contexts going forward. If we ever find that we want to, that punts the
problem to a prereq for whatever work that happens to be.

It's not clear to me if there's a way to deal with that without
fundamentally changing this mechanism. Anything explicit probably just
adds ugliness that's dedicated specifically to the fact that this is
racy (e.g., delwri specific lock/counter), so I don't think we want to
go there.

It seems like the more general problem is that we have two I/O models
(sync and async) that are fundamentally incompatible, but here we try to
build a batch sync I/O mechanism on top of the async submission
mechanism. So the definition of the async model is no longer sufficient
for our use, since we clearly have a use case to wait on an async I/O.
Perhaps we should think more about making the async/sync mechanisms more
generic/compatible..? Thinking out loud, making the lock context
dynamically transferrable to the I/O might be an interesting experiment
to start to decouple things (much like we do for joining items to
transactions, for example).

FWIW, I'm Ok with deferring that or adding it to my own todo list based
on the fact that there are currently no racing contexts, so long as the
async is a subset of sync I/O model is generally acceptable.
Post by Dave Chinner
Post by Brian Foster
Also, it looks like blocking on async I/O as such opens up the
possibility of blocking indefinitely on failing writes if the right
combination of delwri and b_iodone handler is used (see
xfs_buf_iodone_callbacks()). I'm not sure if that's a real problem
today, though.
I don't think it is - we don't permanently rewrite metadata writes
from IO completion anymore. We retry once, but then require higher
layers to restart the IO again (e.g. the xfsaild). Hence we can't
get stuck forever on async write errors.
Ok, we have the XBF_WRITE_FAIL thing now. Forgot about that.

Brian
Post by Dave Chinner
Post by Brian Foster
Post by Dave Chinner
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 4ba19bf..1e14452 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -193,12 +193,8 @@ xlog_bread_noalign(
bp->b_io_length = nbblks;
bp->b_error = 0;
- if (XFS_FORCED_SHUTDOWN(log->l_mp))
- return -EIO;
-
- xfs_buf_iorequest(bp);
- error = xfs_buf_iowait(bp);
- if (error)
+ error = xfs_buf_submit_wait(bp);
+ if (error && !XFS_FORCED_SHUTDOWN(log->l_mp))
xfs_buf_ioerror_alert(bp, __func__);
return error;
}
@@ -4427,16 +4423,12 @@ xlog_do_recover(
XFS_BUF_UNASYNC(bp);
bp->b_ops = &xfs_sb_buf_ops;
- if (XFS_FORCED_SHUTDOWN(log->l_mp)) {
- xfs_buf_relse(bp);
- return -EIO;
- }
-
- xfs_buf_iorequest(bp);
- error = xfs_buf_iowait(bp);
+ error = xfs_buf_submit_wait(bp);
if (error) {
- xfs_buf_ioerror_alert(bp, __func__);
- ASSERT(0);
+ if (XFS_FORCED_SHUTDOWN(log->l_mp)) {
Should this be !XFS_FORCED_SHUTDOWN()?
Right, good catch, especially after reading all that code ;)
Cheers,
Dave.
--
Dave Chinner
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
Brian Foster
2014-08-15 13:18:37 UTC
Permalink
Post by Dave Chinner
Currently the report of a bio error from completion
immediately marks the buffer with an error. The issue is that this
is racy w.r.t. synchronous IO - the submitter can see b_error being
set before the IO is complete, and hence we cannot differentiate
between submission failures and completion failures.
Add an internal b_io_error field protected by the b_lock to catch IO
completion errors, and only propagate that to the buffer during
final IO completion handling. Hence we can tell in xfs_buf_iorequest
if we've had a submission failure bey checking bp->b_error before
dropping our b_io_remaining reference - that reference will prevent
b_io_error values from being propagated to b_error in the event that
completion races with submission.
In doing so, allow xfs_buf_iorequest to return an error. That way,
the caller can check for submission errors safely if required, and
easily distinguish them from completion errors that come from
xfs_buf_iowait().
I don't see any changes to xfs_buf_iorequest() at all in this one. That
aside, it looks fine. It's not how I was thinking about it, but a clear
separation of submission error regardless.

Brian
Post by Dave Chinner
---
fs/xfs/xfs_buf.c | 18 ++++++++++++++++--
fs/xfs/xfs_buf.h | 1 +
2 files changed, 17 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 1b7f0bc..58ae34c 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1008,6 +1008,13 @@ xfs_buf_ioend(
bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);
+ /*
+ * Pull in IO completion errors now. We are guaranteed to be running
+ * single threaded, so we don't need the lock to read b_io_error.
+ */
+ if (!bp->b_error && bp->b_io_error)
+ xfs_buf_ioerror(bp, bp->b_io_error);
+
if (!bp->b_error) {
bp->b_flags |= XBF_DONE;
@@ -1186,8 +1193,12 @@ xfs_buf_bio_end_io(
* don't overwrite existing errors - otherwise we can lose errors on
* buffers that require multiple bios to complete.
*/
- if (!bp->b_error)
- xfs_buf_ioerror(bp, error);
+ if (error) {
+ spin_lock(&bp->b_lock);
+ if (!bp->b_io_error)
+ bp->b_io_error = error;
+ spin_unlock(&bp->b_lock);
+ }
if (!bp->b_error && xfs_buf_is_vmapped(bp) && (bp->b_flags & XBF_READ))
invalidate_kernel_vmap_range(bp->b_addr, xfs_buf_vmap_len(bp));
@@ -1373,6 +1384,9 @@ xfs_buf_iorequest(
if (bp->b_flags & XBF_WRITE)
xfs_buf_wait_unpin(bp);
+ /* clear the internal error state to avoid spurious errors */
+ bp->b_io_error = 0;
+
/*
* Take references to the buffer. For XBF_ASYNC buffers, holding a
* reference for as long as submission takes is all that is necessary
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 4585c15..44db8cd 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -158,6 +158,7 @@ typedef struct xfs_buf {
struct list_head b_lru; /* lru list */
spinlock_t b_lock; /* internal state lock */
unsigned int b_state; /* internal state flags */
+ int b_io_error; /* internal IO error state */
wait_queue_head_t b_waiters; /* unpin waiters */
struct list_head b_list;
struct xfs_perag *b_pag; /* contains rbtree root */
--
2.0.0
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
Dave Chinner
2014-08-15 23:25:59 UTC
Permalink
Post by Brian Foster
Post by Dave Chinner
Currently the report of a bio error from completion
immediately marks the buffer with an error. The issue is that this
is racy w.r.t. synchronous IO - the submitter can see b_error being
set before the IO is complete, and hence we cannot differentiate
between submission failures and completion failures.
Add an internal b_io_error field protected by the b_lock to catch IO
completion errors, and only propagate that to the buffer during
final IO completion handling. Hence we can tell in xfs_buf_iorequest
if we've had a submission failure bey checking bp->b_error before
dropping our b_io_remaining reference - that reference will prevent
b_io_error values from being propagated to b_error in the event that
completion races with submission.
In doing so, allow xfs_buf_iorequest to return an error. That way,
the caller can check for submission errors safely if required, and
easily distinguish them from completion errors that come from
xfs_buf_iowait().
I don't see any changes to xfs_buf_iorequest() at all in this one. That
aside, it looks fine. It's not how I was thinking about it, but a clear
separation of submission error regardless.
Apart from adding a zeroing of b_io_error, no, it doesn't change.
What I meant to describe was the new capability this adds. I'll
rewrite the paragraph to say:

In doing so, xfs_buf_iorequest now has the capability to
distinguish between submission and completion errors, and hence if
we need to we can return submission errors directly and allow
callers to gather completion errors from xfs_buf_iowait().

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Christoph Hellwig
2014-08-29 00:23:56 UTC
Permalink
Post by Dave Chinner
In doing so, allow xfs_buf_iorequest to return an error. That way,
the caller can check for submission errors safely if required, and
easily distinguish them from completion errors that come from
xfs_buf_iowait().
While this looks correct to me it also looks confusing. Why can't
we simply propagat the submission errors through error returns
and use ->b_error for completion errors instead?
Loading...