Brian Foster
2014-08-21 19:18:12 UTC
XFS log recovery builds up an xlog_recover object as it passes through
the log operations on the physical log. These structures are managed via
a hash table and are allocated when a new transaction is encountered and
freed once a commit operation for the transaction is encountered.
This state machine for active transactions is implemented by a
combination of xlog_do_recovery_pass(), which walks through the log
buffers and xlog_recover_process_data() which processes log operations
within each buffer. The latter function decides whether to allocate a
new xlog_recover, add to it or commit and ultimately free it. If an
error occurs at any point during the lifecycle of a particular
xlog_recover, xlog_recover_process_data() frees the object and returns
an error.
xlog_recover_commit_trans() handles the final processing of the
transaction. It submits whatever I/O is required for the transaction and
frees xlog_recover object along with the transaction items it tracks. If
an error occurs at the final stages of the commit operation, such as I/O
failure, both xlog_recover_commit_trans() and
xlog_recover_process_data() attempt to free the trans object.
Modify xlog_recover_commit_trans() to only free the trans object on
successful completion of the trans, including any I/O errors that might
occur when recovering the log.
Signed-off-by: Brian Foster <***@redhat.com>
---
Hi all,
I found that the recent buffer I/O rework fixes didn't address the crash
reproduced by the dm-flakey/log recovery test case I posted recently. I
tracked the crash down to this, which allows the test to pass. This
addresses the crash I saw when running the reproducer manually with the
metadump that Alex posted as well.
FWIW, I also went back and tested the xfs_buf_iowait() experiment in
both scenarios (Alex's metadump and xfstests test) and they all
reproduce the same crash for me. I think that either I'm still not
reproducing the original problem, something else might have contaminated
the original xfs_buf_iowait() test to give a false positive, or
something else entirely is going on.
Alex,
If you have a chance, I think it might be interesting to see whether you
reproduce any problems with this patch. It looks like this is a
regression introduced by:
2a84108f xfs: free the list of recovery items on error
... but I have no idea if that's in whatever kernel you're running.
Brian
fs/xfs/xfs_log_recover.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 176c4b3..daca9a6 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3528,10 +3528,15 @@ out:
if (!list_empty(&done_list))
list_splice_init(&done_list, &trans->r_itemq);
- xlog_recover_free_trans(trans);
-
error2 = xfs_buf_delwri_submit(&buffer_list);
- return error ? error : error2;
+
+ if (!error)
+ error = error2;
+ /* caller frees trans on error */
+ if (!error)
+ xlog_recover_free_trans(trans);
+
+ return error;
}
STATIC int
the log operations on the physical log. These structures are managed via
a hash table and are allocated when a new transaction is encountered and
freed once a commit operation for the transaction is encountered.
This state machine for active transactions is implemented by a
combination of xlog_do_recovery_pass(), which walks through the log
buffers and xlog_recover_process_data() which processes log operations
within each buffer. The latter function decides whether to allocate a
new xlog_recover, add to it or commit and ultimately free it. If an
error occurs at any point during the lifecycle of a particular
xlog_recover, xlog_recover_process_data() frees the object and returns
an error.
xlog_recover_commit_trans() handles the final processing of the
transaction. It submits whatever I/O is required for the transaction and
frees xlog_recover object along with the transaction items it tracks. If
an error occurs at the final stages of the commit operation, such as I/O
failure, both xlog_recover_commit_trans() and
xlog_recover_process_data() attempt to free the trans object.
Modify xlog_recover_commit_trans() to only free the trans object on
successful completion of the trans, including any I/O errors that might
occur when recovering the log.
Signed-off-by: Brian Foster <***@redhat.com>
---
Hi all,
I found that the recent buffer I/O rework fixes didn't address the crash
reproduced by the dm-flakey/log recovery test case I posted recently. I
tracked the crash down to this, which allows the test to pass. This
addresses the crash I saw when running the reproducer manually with the
metadump that Alex posted as well.
FWIW, I also went back and tested the xfs_buf_iowait() experiment in
both scenarios (Alex's metadump and xfstests test) and they all
reproduce the same crash for me. I think that either I'm still not
reproducing the original problem, something else might have contaminated
the original xfs_buf_iowait() test to give a false positive, or
something else entirely is going on.
Alex,
If you have a chance, I think it might be interesting to see whether you
reproduce any problems with this patch. It looks like this is a
regression introduced by:
2a84108f xfs: free the list of recovery items on error
... but I have no idea if that's in whatever kernel you're running.
Brian
fs/xfs/xfs_log_recover.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 176c4b3..daca9a6 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3528,10 +3528,15 @@ out:
if (!list_empty(&done_list))
list_splice_init(&done_list, &trans->r_itemq);
- xlog_recover_free_trans(trans);
-
error2 = xfs_buf_delwri_submit(&buffer_list);
- return error ? error : error2;
+
+ if (!error)
+ error = error2;
+ /* caller frees trans on error */
+ if (!error)
+ xlog_recover_free_trans(trans);
+
+ return error;
}
STATIC int
--
1.8.3.1
1.8.3.1