Questions about XFS discard and xfs_free_extent() code (newbie)

Discussion:

Questions about XFS discard and xfs_free_extent() code (newbie)

Alex Lyakas

2013-12-18 18:37:29 UTC

Greetings XFS developers & community,

I am studying the XFS code, primarily focusing now at the free-space
allocation and deallocation parts.

I learned that freeing an extent happens like this:
- xfs_free_extent() calls xfs_free_ag_extent(), which attempts to merge the
freed extents from left and from right in the by-bno btree. Then the by-size
btree is updated accordingly.
- xfs_free_extent marks the original (un-merged) extent as "busy" by
xfs_extent_busy_insert(). This prevents this original extent from being
allocated. (Except that for metadata allocations such extent or part of it
can be "unbusied", while it is still not marked for discard with
XFS_EXTENT_BUSY_DISCARDED).
- Once the appropriate part of the log is committed, xlog_cil_committed
calls xfs_discard_extents. This discards the extents using the synchronous
blkdev_issue_discard() API, and only them "unbusies" the extents. This makes
sense, because we cannot allow allocating these extents until discarding
completed.

WRT to this flow, I have some questions:

- xfs_free_extent first inserts the extent into the free-space btrees, and
only then marks it as busy. How come there is no race window here? Can
somebody allocate the freed extent before it is marked as busy? Or the
free-space btrees somehow are locked at this point? The code says "validate
the extent size is legal now we have the agf locked". I more or less see
that xfs_alloc_fix_freelist() locks *something*, but I don't see
xfs_free_extent() unlocking anything.

- If xfs_extent_busy_insert() fails to alloc a xfs_extent_busy structure,
such extent cannot be discarded, correct?

- xfs_discard_extents() doesn't check the discard granularity of the
underlying block device, like xfs_ioc_trim() does. So it may send a small
discard request, which cannot be handled. If it would have checked the
granularity, it could have avoided sending small requests. But the thing is
that the busy extent might have been merged in the free-space btree into a
larger extent, which is now suitable for discard.

I want to attempt the following logic in xfs_discard_extents():
# search the "by-bno" free-space btree for a larger extent that fully
encapsulates the busy extent (which we want to discard)
# if found, check whether some other part of the larger extent is still busy
(except for the current busy extent we want to discard)
# if no, send discard for the larger extent
Does this make send? And I think that we need to hold the larger extent
locked somehow until the
discard completes, to prevent allocation from the discarded range.

Can anybody please comment on these questions?

Thanks!
Alex.

Dave Chinner

2013-12-18 23:06:15 UTC

Post by Alex Lyakas
Greetings XFS developers & community,
I am studying the XFS code, primarily focusing now at the free-space
allocation and deallocation parts.
- xfs_free_extent() calls xfs_free_ag_extent(), which attempts to merge the
freed extents from left and from right in the by-bno btree. Then the by-size
btree is updated accordingly.
- xfs_free_extent marks the original (un-merged) extent as "busy" by
xfs_extent_busy_insert(). This prevents this original extent from being
allocated. (Except that for metadata allocations such extent or part of it
can be "unbusied", while it is still not marked for discard with
XFS_EXTENT_BUSY_DISCARDED).
- Once the appropriate part of the log is committed, xlog_cil_committed
calls xfs_discard_extents. This discards the extents using the synchronous
blkdev_issue_discard() API, and only them "unbusies" the extents. This makes
sense, because we cannot allow allocating these extents until discarding
completed.
- xfs_free_extent first inserts the extent into the free-space btrees, and
only then marks it as busy. How come there is no race window here?

Because the AGF is locked exclusively at this point, meaning only
one process can be modifying the free space tree at this point in
time.

Post by Alex Lyakas
Can
somebody allocate the freed extent before it is marked as busy? Or the
free-space btrees somehow are locked at this point? The code says "validate
the extent size is legal now we have the agf locked". I more or less see
that xfs_alloc_fix_freelist() locks *something*, but I don't see
xfs_free_extent() unlocking anything.

The AGF remains locked until the transaction is committed. The
transaction commit code unlocks items modified in the transaction
via the ->iop_unlock log item callback....

Post by Alex Lyakas
- If xfs_extent_busy_insert() fails to alloc a xfs_extent_busy structure,
such extent cannot be discarded, correct?

Correct.

Post by Alex Lyakas
- xfs_discard_extents() doesn't check the discard granularity of the
underlying block device, like xfs_ioc_trim() does. So it may send a small
discard request, which cannot be handled.

Discard is a "advisory" operation - it is never guaranteed to do
anything.

Post by Alex Lyakas
If it would have checked the
granularity, it could have avoided sending small requests. But the thing is
that the busy extent might have been merged in the free-space btree into a
larger extent, which is now suitable for discard.

Sure, but the busy extent tree tracks extents across multiple
transaction contexts, and we cannot merge extents that are in
different contexts.

Post by Alex Lyakas
# search the "by-bno" free-space btree for a larger extent that fully
encapsulates the busy extent (which we want to discard)
# if found, check whether some other part of the larger extent is still busy
(except for the current busy extent we want to discard)
# if no, send discard for the larger extent
Does this make send? And I think that we need to hold the larger
extent locked somehow until the
discard completes, to prevent allocation from the discarded range.

You can't search the freespace btrees in log IO completion context -
that will cause deadlocks because we can be holding the locks
searching the freespace trees when we issue a log force and block
waiting for log IO completion to occur. e.g. in
xfs_extent_busy_reuse()....

Also, walking the free space btrees can be an IO bound operation,
overhead/latency we absolutely do not want to add to log IO completion.

Further, walking the free space btrees can be a memory intensive
operation (buffers are demand paged from disk) and log IO completion
may be necessary for memory reclaim to make progress in low memory
situations. So adding unbound memory demand to log IO completion
will cause low memory deadlocks, too.

IOWs, adding freespace tree processing to xfs_discard_extents() just
won't work.

What we really need is a smarter block layer implementation of the
discard operation - it needs to be asynchronous, and it needs to
support merging of adjacent discard requests. Now that SATA 3.1
devices are appearing on the market, queued trim operations are now
possible. Dispatching discard oeprations as synchronous operations
prevents us from taking advantage of these operations. Further,
because it's synchronous, the block layer can't merge adjacent
discards, not batch multiple discard ranges up into a single TRIM
command.

IOWs, what we really need is for the block layer discard code to be
brought up to the capabilities of the hardware on the market first.
Then we will be in a position to be able to optimise the XFS code to
use async dispatch and new IO completion handlers to finish the log
IO completion processing, and at that point we shouldn't need to
care anymore. Note that XFS already dispatches discards in ascending
block order, so if we issue adjacent discards the block layer will
be able to merge them appropriately. Hence we don't need to add that
complexity to XFS....

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Alex Lyakas

2013-12-19 09:24:15 UTC

Hi Dave,
Thank you for your comments.
I realize now that what I proposed cannot be done; I need to understand
deeper how XFS transactions work (unfortunately, the awesome "XFS Filesystem
Structure" doc has a TODO in the "Journaling Log" section).

Can you please comment on one more question:
Let's say we had such fully asynchronous "fire-and-forget" discard operation
(I can implement one myself for my block-device via a custom IOCTL). What is
wrong if we trigger such operation in xfs_free_ag_extent(), right after we
have merged the freed extent into a bigger one? I understand that the
extent-free-intent is not yet committed to the log at this point. But from
the user's point of view, the extent has been deleted, no? So if the
underlying block device discards the merged extent right away, before
committing to the log, what issues this can cause?

Thanks,
Alex.

-----Original Message-----
From: Dave Chinner
Sent: 19 December, 2013 1:06 AM
To: Alex Lyakas
Cc: ***@oss.sgi.com
Subject: Re: Questions about XFS discard and xfs_free_extent() code (newbie)

Post by Alex Lyakas
Greetings XFS developers & community,
I am studying the XFS code, primarily focusing now at the free-space
allocation and deallocation parts.
- xfs_free_extent() calls xfs_free_ag_extent(), which attempts to merge the
freed extents from left and from right in the by-bno btree. Then the by-size
btree is updated accordingly.
- xfs_free_extent marks the original (un-merged) extent as "busy" by
xfs_extent_busy_insert(). This prevents this original extent from being
allocated. (Except that for metadata allocations such extent or part of it
can be "unbusied", while it is still not marked for discard with
XFS_EXTENT_BUSY_DISCARDED).
- Once the appropriate part of the log is committed, xlog_cil_committed
calls xfs_discard_extents. This discards the extents using the synchronous
blkdev_issue_discard() API, and only them "unbusies" the extents. This makes
sense, because we cannot allow allocating these extents until discarding
completed.
- xfs_free_extent first inserts the extent into the free-space btrees, and
only then marks it as busy. How come there is no race window here?

Because the AGF is locked exclusively at this point, meaning only
one process can be modifying the free space tree at this point in
time.

Post by Alex Lyakas
Can
somebody allocate the freed extent before it is marked as busy? Or the
free-space btrees somehow are locked at this point? The code says "validate
the extent size is legal now we have the agf locked". I more or less see
that xfs_alloc_fix_freelist() locks *something*, but I don't see
xfs_free_extent() unlocking anything.

The AGF remains locked until the transaction is committed. The
transaction commit code unlocks items modified in the transaction
via the ->iop_unlock log item callback....

Post by Alex Lyakas
- If xfs_extent_busy_insert() fails to alloc a xfs_extent_busy structure,
such extent cannot be discarded, correct?

Correct.

Post by Alex Lyakas
- xfs_discard_extents() doesn't check the discard granularity of the
underlying block device, like xfs_ioc_trim() does. So it may send a small
discard request, which cannot be handled.

Discard is a "advisory" operation - it is never guaranteed to do
anything.

Post by Alex Lyakas
If it would have checked the
granularity, it could have avoided sending small requests. But the thing is
that the busy extent might have been merged in the free-space btree into a
larger extent, which is now suitable for discard.

Sure, but the busy extent tree tracks extents across multiple
transaction contexts, and we cannot merge extents that are in
different contexts.

Post by Alex Lyakas
# search the "by-bno" free-space btree for a larger extent that fully
encapsulates the busy extent (which we want to discard)
# if found, check whether some other part of the larger extent is still busy
(except for the current busy extent we want to discard)
# if no, send discard for the larger extent
Does this make send? And I think that we need to hold the larger
extent locked somehow until the
discard completes, to prevent allocation from the discarded range.

You can't search the freespace btrees in log IO completion context -
that will cause deadlocks because we can be holding the locks
searching the freespace trees when we issue a log force and block
waiting for log IO completion to occur. e.g. in
xfs_extent_busy_reuse()....

Also, walking the free space btrees can be an IO bound operation,
overhead/latency we absolutely do not want to add to log IO completion.

Further, walking the free space btrees can be a memory intensive
operation (buffers are demand paged from disk) and log IO completion
may be necessary for memory reclaim to make progress in low memory
situations. So adding unbound memory demand to log IO completion
will cause low memory deadlocks, too.

IOWs, adding freespace tree processing to xfs_discard_extents() just
won't work.

What we really need is a smarter block layer implementation of the
discard operation - it needs to be asynchronous, and it needs to
support merging of adjacent discard requests. Now that SATA 3.1
devices are appearing on the market, queued trim operations are now
possible. Dispatching discard oeprations as synchronous operations
prevents us from taking advantage of these operations. Further,
because it's synchronous, the block layer can't merge adjacent
discards, not batch multiple discard ranges up into a single TRIM
command.

IOWs, what we really need is for the block layer discard code to be
brought up to the capabilities of the hardware on the market first.
Then we will be in a position to be able to optimise the XFS code to
use async dispatch and new IO completion handlers to finish the log
IO completion processing, and at that point we shouldn't need to
care anymore. Note that XFS already dispatches discards in ascending
block order, so if we issue adjacent discards the block layer will
be able to merge them appropriately. Hence we don't need to add that
complexity to XFS....

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Dave Chinner

2013-12-19 10:55:13 UTC

Post by Alex Lyakas
Hi Dave,
Thank you for your comments.
I realize now that what I proposed cannot be done; I need to
understand deeper how XFS transactions work (unfortunately, the
awesome "XFS Filesystem Structure" doc has a TODO in the "Journaling
Log" section).
Let's say we had such fully asynchronous "fire-and-forget" discard
operation (I can implement one myself for my block-device via a
custom IOCTL). What is wrong if we trigger such operation in
xfs_free_ag_extent(), right after we have merged the freed extent
into a bigger one? I understand that the extent-free-intent is not
yet committed to the log at this point. But from the user's point of
view, the extent has been deleted, no? So if the underlying block
device discards the merged extent right away, before committing to
the log, what issues this can cause?

Think of what happens when a crash occurs immediately after the
discard completes. The freeing of the extent never made it to th
elog, so after recovery, the file still exists and the user can
access it. Except that it's contents are now all different to
before the crash occurred.

IOWs, issuing the discard before the transaction that frees the
extent is on stable storage means we are discarding user data or
metadata before we've guaranteed that the extent free transaction
is permanent and that means we violate certain guarantees with
respect to crash recovery...

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Alex Lyakas

2013-12-19 19:24:40 UTC

Hi Dave,
It makes sense. I agree it might break some guarantees. Although if the user
deleted some blocks in the file or the whole file, maybe it's ok to not have
a clear promise what he sees after the crash. But I agree, it's not a clear
semantics.

Thanks for the comments,
Alex.

-----Original Message-----
From: Dave Chinner
Sent: 19 December, 2013 12:55 PM
To: Alex Lyakas
Cc: ***@oss.sgi.com
Subject: Re: Questions about XFS discard and xfs_free_extent() code (newbie)

Post by Alex Lyakas
Hi Dave,
Thank you for your comments.
I realize now that what I proposed cannot be done; I need to
understand deeper how XFS transactions work (unfortunately, the
awesome "XFS Filesystem Structure" doc has a TODO in the "Journaling
Log" section).
Let's say we had such fully asynchronous "fire-and-forget" discard
operation (I can implement one myself for my block-device via a
custom IOCTL). What is wrong if we trigger such operation in
xfs_free_ag_extent(), right after we have merged the freed extent
into a bigger one? I understand that the extent-free-intent is not
yet committed to the log at this point. But from the user's point of
view, the extent has been deleted, no? So if the underlying block
device discards the merged extent right away, before committing to
the log, what issues this can cause?

Think of what happens when a crash occurs immediately after the
discard completes. The freeing of the extent never made it to th
elog, so after recovery, the file still exists and the user can
access it. Except that it's contents are now all different to
before the crash occurred.

IOWs, issuing the discard before the transaction that frees the
extent is on stable storage means we are discarding user data or
metadata before we've guaranteed that the extent free transaction
is permanent and that means we violate certain guarantees with
respect to crash recovery...

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Chris Murphy

2013-12-21 17:03:47 UTC

Post by Alex Lyakas
Hi Dave,
It makes sense. I agree it might break some guarantees. Although if the user deleted some blocks in the file or the whole file, maybe it's ok to not have a clear promise what he sees after the crash.

User perspective: I disagree. Sounds like a possible zombie file invasion, with no clear way of reversion. The file either needs to be gone, as in not accessible in user space, or it needs to be present and intact. There isn't a reasonable expectation for a file to be resurrected from the dead that's also corrupted.

If the file name isn't also corrupted, the problem is worse. It looks like a legitimate file, yet it's useless. The zombie files will be subject to backup and restore, just like their valid predecessors. All I need is to stumble upon a handful of these files, which I won't necessarily remember were deleted files, to start assuming I have some sort of weird file system corruption, at which point at best I'll become really confused not knowing what to do next. At worse, I may start throwing hammers that end up causing worse problems.

Chris Murphy

Alex Lyakas

2013-12-24 18:21:50 UTC

Hi Dave,
Reading through the code some more, I see that the extent that is freed
through xfs_free_extent() can be an XFS metadata extent as well.
For example, xfs_inobt_free_block() frees a block of the AG's free-inode
btree. Also, xfs_bmbt_free_block() frees a generic btree block by putting it
onto the cursor's "to-be-freed" list, which will be dropped into the
free-space btree (by xfs_free_extent) in xfs_bmap_finish(). If we discard
such metadata block before the transaction is committed to the log and we
crash, we might not be able to properly mount after reboot, is that right? I
mean it's not that some file's data block will show 0s to the user instead
of before-delete data, but some XFS btree node (for example) will be wiped
in such case. Can this happen?

Thanks,
Alex.

-----Original Message-----
From: Dave Chinner
Sent: 19 December, 2013 12:55 PM
To: Alex Lyakas
Cc: ***@oss.sgi.com
Subject: Re: Questions about XFS discard and xfs_free_extent() code (newbie)

Post by Alex Lyakas
Hi Dave,
Thank you for your comments.
I realize now that what I proposed cannot be done; I need to
understand deeper how XFS transactions work (unfortunately, the
awesome "XFS Filesystem Structure" doc has a TODO in the "Journaling
Log" section).
Let's say we had such fully asynchronous "fire-and-forget" discard
operation (I can implement one myself for my block-device via a
custom IOCTL). What is wrong if we trigger such operation in
xfs_free_ag_extent(), right after we have merged the freed extent
into a bigger one? I understand that the extent-free-intent is not
yet committed to the log at this point. But from the user's point of
view, the extent has been deleted, no? So if the underlying block
device discards the merged extent right away, before committing to
the log, what issues this can cause?

Think of what happens when a crash occurs immediately after the
discard completes. The freeing of the extent never made it to th
elog, so after recovery, the file still exists and the user can
access it. Except that it's contents are now all different to
before the crash occurred.

IOWs, issuing the discard before the transaction that frees the
extent is on stable storage means we are discarding user data or
metadata before we've guaranteed that the extent free transaction
is permanent and that means we violate certain guarantees with
respect to crash recovery...

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Dave Chinner

2013-12-26 23:00:18 UTC

Post by Alex Lyakas
Hi Dave,
Reading through the code some more, I see that the extent that is
freed through xfs_free_extent() can be an XFS metadata extent as
well.
For example, xfs_inobt_free_block() frees a block of the AG's
free-inode btree. Also, xfs_bmbt_free_block() frees a generic btree
block by putting it onto the cursor's "to-be-freed" list, which will
be dropped into the free-space btree (by xfs_free_extent) in
xfs_bmap_finish(). If we discard such metadata block before the
transaction is committed to the log and we crash, we might not be
able to properly mount after reboot, is that right?

Yes. The log stores a delta of the transactional changes, and so
requires th eprevious version of the block to be intact for revoery
to take place.

Post by Alex Lyakas
I mean it's not
that some file's data block will show 0s to the user instead of
before-delete data, but some XFS btree node (for example) will be
wiped in such case. Can this happen?

Yes, it could. That's what I meant by:

[snip]

Post by Alex Lyakas

Post by Dave Chinner
IOWs, issuing the discard before the transaction that frees the
extent is on stable storage means we are discarding user data or

^^

Post by Alex Lyakas

Post by Dave Chinner
metadata before we've guaranteed that the extent free transaction

^^^^^^^^

Post by Alex Lyakas

Post by Dave Chinner
is permanent and that means we violate certain guarantees with
respect to crash recovery...

The "or metadata" part of the above sentence.

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Alex Lyakas

2014-07-01 15:06:38 UTC

Greetings,

It appears that if xfs_growfs_data_private fails during the "new AG headers"
loop, it does not free all the per-AG structures for the new AGs. When XFS
is unmounted later, they are not freed as well, because
xfs_growfs_data_private did not update the "sb_agcount" field, so
xfs_free_perag will not free them. This happens on 3.8.13, but looking at
the latest master branch, it seems to have the same issue.

Code like [1] in xfs_growfs_data, seems to fix the issue.

A follow-up question: if xfs_grows_data_private fails during the loop that
updates all the secondary superblocks, what is the consequence? (I am aware
that in the latest master branch, the loop is not broken on first error, but
attempts to initialize whatever possible). When these secondary superblocks
will get updated? Is there a way to force-update them? Otherwise, what can
be the consequence of leaving them not updated?

Thanks,
Alex.

[1]
/*
* If we had an error, we might have allocated
* PAGs, which are >=sb_agcount. We need to free
* those, because they will not get freed in
* xfs_free_perag().
*/
if (error) {
unsigned int n_pags = 0;
xfs_perag_t* pags[16] = {0};
xfs_agnumber_t start_agno = mp->m_sb.sb_agcount;

do {
unsigned int pag_idx = 0;

spin_lock(&mp->m_perag_lock);
n_pags = radix_tree_gang_lookup(&mp->m_perag_tree, (void**)pags,
start_agno, ARRAY_SIZE(pags));
for (pag_idx = 0; pag_idx < n_pags; ++pag_idx) {
xfs_perag_t *deleted = NULL;

/* for next lookup */
start_agno = pags[pag_idx]->pag_agno + 1;

/* nobody should really be touching these AGs...*/
if (WARN_ON(atomic_read(&pags[pag_idx]->pag_ref) > 0)) {
pags[pag_idx] = NULL;
continue;
}

deleted = radix_tree_delete(&mp->m_perag_tree,
pags[pag_idx]->pag_agno);
ASSERT(deleted == pags[pag_idx]);
}
spin_unlock(&mp->m_perag_lock);

/* now delete all those still marked for deletion */
for (pag_idx = 0; pag_idx < n_pags; ++pag_idx) {
if (pags[pag_idx])
call_rcu(&pags[pag_idx]->rcu_head,
xfs_free_perag_rcu_cb);
}
} while (n_pags > 0);
}

xfs_free_perag_rcu_cb is similar to __xfs_free_perag, but can be called from
other files.

Dave Chinner

2014-07-01 21:56:26 UTC

Post by Alex Lyakas
Greetings,
It appears that if xfs_growfs_data_private fails during the "new AG
headers" loop, it does not free all the per-AG structures for the
new AGs. When XFS is unmounted later, they are not freed as well,
because xfs_growfs_data_private did not update the "sb_agcount"
field, so xfs_free_perag will not free them. This happens on 3.8.13,
but looking at the latest master branch, it seems to have the same
issue.
Code like [1] in xfs_growfs_data, seems to fix the issue.

Why not just do this in the appropriate error stack, like is
done inside xfs_initialize_perag() on error?

for (i = oagcount; i < nagcount; i++) {
pag = radix_tree_delete(&mp->m_perag_tree, index);
kmem_free(pag);
}

(though it might need RCU freeing)

When you have a fix, can you send a proper patch with a sign-off on
it?

Post by Alex Lyakas
A follow-up question: if xfs_grows_data_private fails during the
loop that updates all the secondary superblocks, what is the
consequence? (I am aware that in the latest master branch, the loop
is not broken on first error, but attempts to initialize whatever
possible). When these secondary superblocks will get updated? Is
there a way to force-update them? Otherwise, what can be the
consequence of leaving them not updated?

The consequence is documented in mainline tree - if we don't update
them all, then repair will do the wrong thing. Repair requires a
majority iof identical secondaries to determine if the primary is
correct or out of date. The old behaviour of not updating after the
first error meant that the majority were old superblocks and so at
some time in the future repair could decide your filesystem is
smaller than it really is and hence truncate away the grown section
of the filesystem. i.e. trigger catastrophic, unrecoverable data
loss.

Hence it's far better to write every seconday we can than to leave
a majority in a bad state....

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Alex Lyakas

2014-07-02 12:27:25 UTC

Hi Dave,
Thank you for your comments.

I realize that secondary superblocks are needed mostly for repairing a
broken filesystem. However, I don't see that they get updated regularly,
i.e., during normal operation they don't seem to get updated at all. I put a
print in xfs_sb_write_verify, and it gets called only with:
bp->b_bn==XFS_SB_DADDR.

So do I understand correctly (also from comments in
xfs_growfs_data_private), that it is safe to operate a filesystem while
having broken secondary superblocks? For me, it appears to mount properly,
and all the data seems to be there, but xfs_check complains like:
bad sb magic # 0xc2a4baf2 in ag 6144
bad sb version # 0x4b5d in ag 6144
blocks 6144/65536..2192631388 out of range
blocks 6144/65536..2192631388 claimed by block 6144/0
bad sb magic # 0xb20f3079 in ag 6145
bad sb version # 0x6505 in ag 6145
blocks 6145/65536..3530010017 out of range
blocks 6145/65536..3530010017 claimed by block 6145/0
...

Also, if secondary superblocks do not get updated regularly, and there is no
way to ask an operational XFS to update them, then during repair we may not
find a good secondary superblock.

As for the patch, I cannot post a patch against the upstream kernel, because
I am running an older kernel. Unfortunately, I cannot qualify an upstream
patch properly in a reasonable time. Is there a value in posting a patch
against 3.8.13? Otherwise, it's fine by me if somebody else posts it and
takes the credit.

Thanks,
Alex.

-----Original Message-----
From: Dave Chinner
Sent: 02 July, 2014 12:56 AM
To: Alex Lyakas
Cc: ***@oss.sgi.com
Subject: Re: xfs_growfs_data_private memory leak

Post by Alex Lyakas
Greetings,
It appears that if xfs_growfs_data_private fails during the "new AG
headers" loop, it does not free all the per-AG structures for the
new AGs. When XFS is unmounted later, they are not freed as well,
because xfs_growfs_data_private did not update the "sb_agcount"
field, so xfs_free_perag will not free them. This happens on 3.8.13,
but looking at the latest master branch, it seems to have the same
issue.
Code like [1] in xfs_growfs_data, seems to fix the issue.

Why not just do this in the appropriate error stack, like is
done inside xfs_initialize_perag() on error?

for (i = oagcount; i < nagcount; i++) {
pag = radix_tree_delete(&mp->m_perag_tree, index);
kmem_free(pag);
}

(though it might need RCU freeing)

When you have a fix, can you send a proper patch with a sign-off on
it?

Post by Alex Lyakas
A follow-up question: if xfs_grows_data_private fails during the
loop that updates all the secondary superblocks, what is the
consequence? (I am aware that in the latest master branch, the loop
is not broken on first error, but attempts to initialize whatever
possible). When these secondary superblocks will get updated? Is
there a way to force-update them? Otherwise, what can be the
consequence of leaving them not updated?

The consequence is documented in mainline tree - if we don't update
them all, then repair will do the wrong thing. Repair requires a
majority iof identical secondaries to determine if the primary is
correct or out of date. The old behaviour of not updating after the
first error meant that the majority were old superblocks and so at
some time in the future repair could decide your filesystem is
smaller than it really is and hence truncate away the grown section
of the filesystem. i.e. trigger catastrophic, unrecoverable data
loss.

Hence it's far better to write every seconday we can than to leave
a majority in a bad state....

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Eric Sandeen

2014-08-04 18:15:00 UTC

Post by Alex Lyakas
Hi Dave,
Thank you for your comments.
I realize that secondary superblocks are needed mostly for repairing

s/mostly/only/

Post by Alex Lyakas
a broken filesystem. However, I don't see that they get updated
regularly, i.e., during normal operation they don't seem to get
updated at all. I put a print in xfs_sb_write_verify, and it gets
called only with: bp->b_bn==XFS_SB_DADDR.

See the comments above verify_sb(): not everything is validated, so
not everything needs to be constantly updated. It's just the basic
fs geometry, not counters, etc.

/*
* verify a superblock -- does not verify root inode #
* can only check that geometry info is internally
* consistent. because of growfs, that's no guarantee
* of correctness (e.g. geometry may have changed)
*
* fields verified or consistency checked:
*
* sb_magicnum
*
* sb_versionnum
*
* sb_inprogress
*
* sb_blocksize (as a group)
* sb_blocklog
*
* geometry info - sb_dblocks (as a group)
* sb_agcount
* sb_agblocks
* sb_agblklog
*
* inode info - sb_inodesize (x-checked with geo info)
* sb_inopblock
*
* sector size info -
* sb_sectsize
* sb_sectlog
* sb_logsectsize
* sb_logsectlog
*
* not checked here -
* sb_rootino
* sb_fname
* sb_fpack
* sb_logstart
* sb_uuid
*
* ALL real-time fields
* final 4 summary counters
*/

Post by Alex Lyakas
So do I understand correctly (also from comments in
xfs_growfs_data_private), that it is safe to operate a filesystem
while having broken secondary superblocks? For me, it appears to
mount properly, and all the data seems to be there, but xfs_check
bad sb magic # 0xc2a4baf2 in ag 6144
bad sb version # 0x4b5d in ag 6144
blocks 6144/65536..2192631388 out of range
blocks 6144/65536..2192631388 claimed by block 6144/0
bad sb magic # 0xb20f3079 in ag 6145
bad sb version # 0x6505 in ag 6145
blocks 6145/65536..3530010017 out of range
blocks 6145/65536..3530010017 claimed by block 6145/0
...

some of that looks more serious than "just" bad backup sb's.
But the bad secondaries shouldn't cause runtime problems AFAIK.

Post by Alex Lyakas
Also, if secondary superblocks do not get updated regularly, and
there is no way to ask an operational XFS to update them, then during
repair we may not find a good secondary superblock.

You seem to have 6144 (!) allocation groups; one would hope that a
majority of those supers would be "good" and the others will be
properly corrected by an xfs_repair.

Post by Alex Lyakas
As for the patch, I cannot post a patch against the upstream kernel,
because I am running an older kernel. Unfortunately, I cannot qualify
an upstream patch properly in a reasonable time. Is there a value in
posting a patch against 3.8.13? Otherwise, it's fine by me if
somebody else posts it and takes the credit.

If the patch applies cleanly to both kernels, probably fine to
go ahead and post it, with that caveat.

-Eric

Post by Alex Lyakas
Thanks,
Alex.
-----Original Message----- From: Dave Chinner
Sent: 02 July, 2014 12:56 AM
To: Alex Lyakas
Subject: Re: xfs_growfs_data_private memory leak

Post by Alex Lyakas
Greetings,
It appears that if xfs_growfs_data_private fails during the "new AG
headers" loop, it does not free all the per-AG structures for the
new AGs. When XFS is unmounted later, they are not freed as well,
because xfs_growfs_data_private did not update the "sb_agcount"
field, so xfs_free_perag will not free them. This happens on 3.8.13,
but looking at the latest master branch, it seems to have the same
issue.
Code like [1] in xfs_growfs_data, seems to fix the issue.

Why not just do this in the appropriate error stack, like is
done inside xfs_initialize_perag() on error?
for (i = oagcount; i < nagcount; i++) {
pag = radix_tree_delete(&mp->m_perag_tree, index);
kmem_free(pag);
}
(though it might need RCU freeing)
When you have a fix, can you send a proper patch with a sign-off on
it?

Post by Alex Lyakas
A follow-up question: if xfs_grows_data_private fails during the
loop that updates all the secondary superblocks, what is the
consequence? (I am aware that in the latest master branch, the loop
is not broken on first error, but attempts to initialize whatever
possible). When these secondary superblocks will get updated? Is
there a way to force-update them? Otherwise, what can be the
consequence of leaving them not updated?

The consequence is documented in mainline tree - if we don't update
them all, then repair will do the wrong thing. Repair requires a
majority iof identical secondaries to determine if the primary is
correct or out of date. The old behaviour of not updating after the
first error meant that the majority were old superblocks and so at
some time in the future repair could decide your filesystem is
smaller than it really is and hence truncate away the grown section
of the filesystem. i.e. trigger catastrophic, unrecoverable data
loss.
Hence it's far better to write every seconday we can than to leave
a majority in a bad state....
Cheers,
Dave.

Alex Lyakas

2014-08-06 08:56:25 UTC

Hi Eric,
What I was trying to achieve is to allow aborting XFS resize in the middle.
I am not sure whether a typical user needs this feature, but for my use case
it is needed.

So XFS resize is performed in two steps:
1) Initialize the newly-added AGs (AGFL, BNO btree, CNT btree etc)
2) Commit this by updating the primary superblock
3) Update all the existing secondary superblocks, and initialize all the
newly-added secondary superblocks

So I added a custom IOCTL, which can tell this sequence to abort gracefully
(just checking a flag on each loop iteration).

Aborting during step 1 works perfectly (apart from the memory leak that I
discovered and fixed). Since the primary superblock has not been updated,
XFS doesn't know about the newly-added AGs, so it doesn't care what's
written there. Resize can be restarted, and XFS will redo this step from the
beginning.
Aborting during step 3 can leave some of the secondary superblocks totally
uninitialized. However, from XFS standpoint, the resize has been committed
already. That's where my question came from. I realize that kernel can crash
during step 3, leading to same results as with my code that aborts this
step.

Anyways, since I was not sure about these secondary superblocks, I did not
add an exit point at this step. (Again, realize that we can crash during
this step anyways).

Thanks,
Alex.

-----Original Message-----
From: Eric Sandeen
Sent: 04 August, 2014 9:15 PM
To: Alex Lyakas ; Dave Chinner
Cc: ***@oss.sgi.com
Subject: Re: xfs_growfs_data_private memory leak

Post by Alex Lyakas
Hi Dave,
Thank you for your comments.
I realize that secondary superblocks are needed mostly for repairing

s/mostly/only/

Post by Alex Lyakas
a broken filesystem. However, I don't see that they get updated
regularly, i.e., during normal operation they don't seem to get
updated at all. I put a print in xfs_sb_write_verify, and it gets
called only with: bp->b_bn==XFS_SB_DADDR.

See the comments above verify_sb(): not everything is validated, so
not everything needs to be constantly updated. It's just the basic
fs geometry, not counters, etc.

/*
* verify a superblock -- does not verify root inode #
* can only check that geometry info is internally
* consistent. because of growfs, that's no guarantee
* of correctness (e.g. geometry may have changed)
*
* fields verified or consistency checked:
*
* sb_magicnum
*
* sb_versionnum
*
* sb_inprogress
*
* sb_blocksize (as a group)
* sb_blocklog
*
* geometry info - sb_dblocks (as a group)
* sb_agcount
* sb_agblocks
* sb_agblklog
*
* inode info - sb_inodesize (x-checked with geo info)
* sb_inopblock
*
* sector size info -
* sb_sectsize
* sb_sectlog
* sb_logsectsize
* sb_logsectlog
*
* not checked here -
* sb_rootino
* sb_fname
* sb_fpack
* sb_logstart
* sb_uuid
*
* ALL real-time fields
* final 4 summary counters
*/

Post by Alex Lyakas
So do I understand correctly (also from comments in
xfs_growfs_data_private), that it is safe to operate a filesystem
while having broken secondary superblocks? For me, it appears to
mount properly, and all the data seems to be there, but xfs_check
bad sb magic # 0xc2a4baf2 in ag 6144
bad sb version # 0x4b5d in ag 6144
blocks 6144/65536..2192631388 out of range
blocks 6144/65536..2192631388 claimed by block 6144/0
bad sb magic # 0xb20f3079 in ag 6145
bad sb version # 0x6505 in ag 6145
blocks 6145/65536..3530010017 out of range
blocks 6145/65536..3530010017 claimed by block 6145/0
...

some of that looks more serious than "just" bad backup sb's.
But the bad secondaries shouldn't cause runtime problems AFAIK.

Post by Alex Lyakas
Also, if secondary superblocks do not get updated regularly, and
there is no way to ask an operational XFS to update them, then during
repair we may not find a good secondary superblock.

You seem to have 6144 (!) allocation groups; one would hope that a
majority of those supers would be "good" and the others will be
properly corrected by an xfs_repair.

Post by Alex Lyakas
As for the patch, I cannot post a patch against the upstream kernel,
because I am running an older kernel. Unfortunately, I cannot qualify
an upstream patch properly in a reasonable time. Is there a value in
posting a patch against 3.8.13? Otherwise, it's fine by me if
somebody else posts it and takes the credit.

If the patch applies cleanly to both kernels, probably fine to
go ahead and post it, with that caveat.

-Eric

Post by Alex Lyakas
Thanks,
Alex.
-----Original Message----- From: Dave Chinner
Sent: 02 July, 2014 12:56 AM
To: Alex Lyakas
Subject: Re: xfs_growfs_data_private memory leak

Post by Alex Lyakas
Greetings,
It appears that if xfs_growfs_data_private fails during the "new AG
headers" loop, it does not free all the per-AG structures for the
new AGs. When XFS is unmounted later, they are not freed as well,
because xfs_growfs_data_private did not update the "sb_agcount"
field, so xfs_free_perag will not free them. This happens on 3.8.13,
but looking at the latest master branch, it seems to have the same
issue.
Code like [1] in xfs_growfs_data, seems to fix the issue.

Why not just do this in the appropriate error stack, like is
done inside xfs_initialize_perag() on error?
for (i = oagcount; i < nagcount; i++) {
pag = radix_tree_delete(&mp->m_perag_tree, index);
kmem_free(pag);
}
(though it might need RCU freeing)
When you have a fix, can you send a proper patch with a sign-off on
it?

Post by Alex Lyakas
A follow-up question: if xfs_grows_data_private fails during the
loop that updates all the secondary superblocks, what is the
consequence? (I am aware that in the latest master branch, the loop
is not broken on first error, but attempts to initialize whatever
possible). When these secondary superblocks will get updated? Is
there a way to force-update them? Otherwise, what can be the
consequence of leaving them not updated?

The consequence is documented in mainline tree - if we don't update
them all, then repair will do the wrong thing. Repair requires a
majority iof identical secondaries to determine if the primary is
correct or out of date. The old behaviour of not updating after the
first error meant that the majority were old superblocks and so at
some time in the future repair could decide your filesystem is
smaller than it really is and hence truncate away the grown section
of the filesystem. i.e. trigger catastrophic, unrecoverable data
loss.
Hence it's far better to write every seconday we can than to leave
a majority in a bad state....
Cheers,
Dave.

Alex Lyakas

2014-08-04 11:00:05 UTC

Greetings,

we had a log replay failure due to some errors that the underlying block
device returned:
[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
("xlog_recover_iodone") error 28 numblks 16
[49133.802495] XFS (dm-95): log mount/recovery failed: error 28
[49133.802644] XFS (dm-95): log mount failed

and then kernel panicked [1].

Looking at the code, when xfs_mountfs() fails, xfs_fs_fill_super() goes and
cleans up and eventually frees the "xfs_mount" structure. But then
xfs_buf_iodone_work() can still be delivered through "xfslogd_workqueue",
which is static and not per-XFS. But this callback has a pointer to
"xfs_mount", and may try to access it as in [1]. Does this analysis sound
correct? Kernel is 3.8.13, but looking at the XFS master branch, it might
have the same issue.

Should we flush this static workqueue before unmounting?

Thanks,
Alex.

[1]
[49133.804546] general protection fault: 0000 [#1] SMP
[49133.808033] xcbc rmd160 crypto_null af_key xfrm_algo scsi_dh cirrus
psmouse ttm drm_kms_helper serio_raw drm i2c_piix4 sysimgblt virtio_balloon
sysfillrect syscopyarea nfsd(OF) kvm nfs_acl auth_rpcgss nfs fscache
microcode lockd mac_hid sunrpc lp parport floppy ixgbevf(OF)
[49133.808033] CPU 2
[49133.808033] Pid: 2907, comm: kworker/2:1H Tainted: GF W O
3.8.13-030813-generic #201305111843 Bochs Bochs
[49133.808033] RIP: 0010:[<ffffffff813582fb>] [<ffffffff813582fb>]
strnlen+0xb/0x30
[49133.808033] RSP: 0018:ffff8801e31c5b08 EFLAGS: 00010086
[49133.808033] RAX: 0000000000000000 RBX: ffffffff81e4e527 RCX:
0000000000000000
[49133.808033] RDX: 640000450008cf9d RSI: ffffffffffffffff RDI:
640000450008cf9d
[49133.808033] RBP: ffff8801e31c5b08 R08: 000000000000ffff R09:
000000000000ffff
[49133.808033] R10: 0000000000000000 R11: 0000000000000ffe R12:
640000450008cf9d
[49133.808033] R13: ffffffff81e4e900 R14: 0000000000000000 R15:
000000000000ffff
[49133.808033] FS: 0000000000000000(0000) GS:ffff88021fd00000(0000)
knlGS:0000000000000000
[49133.808033] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[49133.808033] CR2: 00007fa4a91abd80 CR3: 000000020e783000 CR4:
00000000000006e0
[49133.808033] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[49133.808033] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[49133.808033] Process kworker/2:1H (pid: 2907, threadinfo ffff8801e31c4000,
task ffff8802124f45c0)
[49133.808033] Stack:
[49133.808033] ffff8801e31c5b48 ffffffff81359d8e ffff8801e31c5b28
ffffffff81e4e527
[49133.808033] ffffffffa0636a1e ffff8801e31c5c80 ffffffffa0636a1e
ffffffff81e4e900
[49133.808033] ffff8801e31c5bc8 ffffffff8135af89 ffff8801e31c5bc8
ffffffff8105a4e7
[49133.808033] Call Trace:
[49133.808033] [<ffffffff81359d8e>] string.isra.4+0x3e/0xd0
[49133.808033] [<ffffffff8135af89>] vsnprintf+0x219/0x640
[49133.808033] [<ffffffff8105a4e7>] ? msg_print_text+0xb7/0x1b0
[49133.808033] [<ffffffff8135b471>] vscnprintf+0x11/0x30
[49133.808033] [<ffffffff8105b3b1>] vprintk_emit+0xc1/0x490
[49133.808033] [<ffffffff8105b460>] ? vprintk_emit+0x170/0x490
[49133.808033] [<ffffffff816d5848>] printk+0x61/0x63
[49133.808033] [<ffffffffa05ba261>] __xfs_printk+0x31/0x50 [xfs]
[49133.808033] [<ffffffffa05ba4b3>] xfs_notice+0x53/0x60 [xfs]
[49133.808033] [<ffffffffa05b14a5>] xfs_do_force_shutdown+0xf5/0x180 [xfs]
[49133.808033] [<ffffffffa05f6c38>] ? xlog_recover_iodone+0x48/0x70 [xfs]
[49133.808033] [<ffffffffa05f6c38>] xlog_recover_iodone+0x48/0x70 [xfs]
[49133.808033] [<ffffffffa05a749d>] xfs_buf_iodone_work+0x4d/0xa0 [xfs]
[49133.808033] [<ffffffff81078b81>] process_one_work+0x141/0x490
[49133.808033] [<ffffffff81079b48>] worker_thread+0x168/0x400
[49133.808033] [<ffffffff810799e0>] ? manage_workers+0x120/0x120
[49133.808033] [<ffffffff8107f050>] kthread+0xc0/0xd0
[49133.808033] [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0
[49133.808033] [<ffffffff816f61ec>] ret_from_fork+0x7c/0xb0
[49133.808033] [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0

Brian Foster

2014-08-04 14:12:57 UTC

Post by Alex Lyakas
Greetings,
we had a log replay failure due to some errors that the underlying block
[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
("xlog_recover_iodone") error 28 numblks 16
[49133.802495] XFS (dm-95): log mount/recovery failed: error 28
[49133.802644] XFS (dm-95): log mount failed
and then kernel panicked [1].
Looking at the code, when xfs_mountfs() fails, xfs_fs_fill_super() goes and
cleans up and eventually frees the "xfs_mount" structure. But then
xfs_buf_iodone_work() can still be delivered through "xfslogd_workqueue",
which is static and not per-XFS. But this callback has a pointer to
"xfs_mount", and may try to access it as in [1]. Does this analysis sound
correct? Kernel is 3.8.13, but looking at the XFS master branch, it might
have the same issue.

Seems possible... we call xfs_buf_delwri_submit() via log recovery which
does an xfs_buf_iowait() on each buffer for synchronous I/O.
xfs_buf_iowait() doesn't wait on b_iowait if b_error is set, however. In
the callback side, xfs_buf_ioerror() is called before the
_xfs_buf_ioend() sequence that calls into the workqueue. Perhaps the
error can be detected by the iowait before the xfslogd_workqueue job can
run..?

Post by Alex Lyakas
Should we flush this static workqueue before unmounting?

It's not totally clear that fixes the problem since we set b_error on
the buffer before we even schedule on the workqueue. Perhaps it depends
on the nature of the race. Another option could be to not use the wq on
error, but that may or may not be ideal for other contexts besides log
recovery.

It looks like the bio code always returns error via the callback and the
xlog_recover_iodone() handler does call back into ioend, presumably to
do the io completion. It might not be appropriate to remove the b_error
bypass in xfs_buf_iowait() as we have ioerror() callers on the buffer
that can occur before I/O submission, but I wonder if it could be made
conditional for this particular case. That would be an interesting
experiment at least to see if it fixes this problem.

Brian

Post by Alex Lyakas
Thanks,
Alex.
[1]
[49133.804546] general protection fault: 0000 [#1] SMP
[49133.808033] xcbc rmd160 crypto_null af_key xfrm_algo scsi_dh cirrus
psmouse ttm drm_kms_helper serio_raw drm i2c_piix4 sysimgblt virtio_balloon
sysfillrect syscopyarea nfsd(OF) kvm nfs_acl auth_rpcgss nfs fscache
microcode lockd mac_hid sunrpc lp parport floppy ixgbevf(OF)
[49133.808033] CPU 2
[49133.808033] Pid: 2907, comm: kworker/2:1H Tainted: GF W O
3.8.13-030813-generic #201305111843 Bochs Bochs
[49133.808033] RIP: 0010:[<ffffffff813582fb>] [<ffffffff813582fb>]
strnlen+0xb/0x30
[49133.808033] RSP: 0018:ffff8801e31c5b08 EFLAGS: 00010086
0000000000000000
640000450008cf9d
000000000000ffff
640000450008cf9d
000000000000ffff
[49133.808033] FS: 0000000000000000(0000) GS:ffff88021fd00000(0000)
knlGS:0000000000000000
[49133.808033] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
00000000000006e0
0000000000000000
0000000000000400
[49133.808033] Process kworker/2:1H (pid: 2907, threadinfo ffff8801e31c4000,
task ffff8802124f45c0)
[49133.808033] ffff8801e31c5b48 ffffffff81359d8e ffff8801e31c5b28
ffffffff81e4e527
[49133.808033] ffffffffa0636a1e ffff8801e31c5c80 ffffffffa0636a1e
ffffffff81e4e900
[49133.808033] ffff8801e31c5bc8 ffffffff8135af89 ffff8801e31c5bc8
ffffffff8105a4e7
[49133.808033] [<ffffffff81359d8e>] string.isra.4+0x3e/0xd0
[49133.808033] [<ffffffff8135af89>] vsnprintf+0x219/0x640
[49133.808033] [<ffffffff8105a4e7>] ? msg_print_text+0xb7/0x1b0
[49133.808033] [<ffffffff8135b471>] vscnprintf+0x11/0x30
[49133.808033] [<ffffffff8105b3b1>] vprintk_emit+0xc1/0x490
[49133.808033] [<ffffffff8105b460>] ? vprintk_emit+0x170/0x490
[49133.808033] [<ffffffff816d5848>] printk+0x61/0x63
[49133.808033] [<ffffffffa05ba261>] __xfs_printk+0x31/0x50 [xfs]
[49133.808033] [<ffffffffa05ba4b3>] xfs_notice+0x53/0x60 [xfs]
[49133.808033] [<ffffffffa05b14a5>] xfs_do_force_shutdown+0xf5/0x180 [xfs]
[49133.808033] [<ffffffffa05f6c38>] ? xlog_recover_iodone+0x48/0x70 [xfs]
[49133.808033] [<ffffffffa05f6c38>] xlog_recover_iodone+0x48/0x70 [xfs]
[49133.808033] [<ffffffffa05a749d>] xfs_buf_iodone_work+0x4d/0xa0 [xfs]
[49133.808033] [<ffffffff81078b81>] process_one_work+0x141/0x490
[49133.808033] [<ffffffff81079b48>] worker_thread+0x168/0x400
[49133.808033] [<ffffffff810799e0>] ? manage_workers+0x120/0x120
[49133.808033] [<ffffffff8107f050>] kthread+0xc0/0xd0
[49133.808033] [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0
[49133.808033] [<ffffffff816f61ec>] ret_from_fork+0x7c/0xb0
[49133.808033] [<ffffffff8107ef90>] ? flush_kthread_worker+0xb0/0xb0
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs

Dave Chinner

2014-08-04 23:07:21 UTC

Post by Alex Lyakas
Greetings,
we had a log replay failure due to some errors that the underlying
[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
("xlog_recover_iodone") error 28 numblks 16
[49133.802495] XFS (dm-95): log mount/recovery failed: error 28
[49133.802644] XFS (dm-95): log mount failed

#define ENOSPC 28 /* No space left on device */

You're getting an ENOSPC as a metadata IO error during log recovery?
Thin provisioning problem, perhaps, and the error is occurring on
submission rather than completion? If so:

8d6c121 xfs: fix buffer use after free on IO error

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Alex Lyakas

2014-08-06 10:05:34 UTC

Hi Dave,

Post by Dave Chinner

Post by Alex Lyakas
Greetings,
we had a log replay failure due to some errors that the underlying
[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
("xlog_recover_iodone") error 28 numblks 16
[49133.802495] XFS (dm-95): log mount/recovery failed: error 28
[49133.802644] XFS (dm-95): log mount failed

#define ENOSPC 28 /* No space left on device */
You're getting an ENOSPC as a metadata IO error during log recovery?
Thin provisioning problem, perhaps,

Yes, it is a thin provisioning problem (which I already know the cause for).

Post by Dave Chinner
and the error is occurring on
8d6c121 xfs: fix buffer use after free on IO error

I am not sure what do you mean by "submission rather than completion".
Do you mean that xfs_buf_ioapply_map() returns without submitting any
bios? In that case, no, bios are submitted to the block device, and it
fails them through a different context with ENOSPC error. I will still
try the patch you mentioned, because it also looks relevant to another
question I addressed to you earlier in:
http://oss.sgi.com/archives/xfs/2013-11/msg00648.html

Thanks,
Alex.

Post by Dave Chinner
Cheers,
Dave.
--
Dave Chinner

Dave Chinner

2014-08-06 12:32:15 UTC

Post by Alex Lyakas
Hi Dave,

Post by Dave Chinner

Post by Alex Lyakas
Greetings,
we had a log replay failure due to some errors that the underlying
[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
("xlog_recover_iodone") error 28 numblks 16
[49133.802495] XFS (dm-95): log mount/recovery failed: error 28
[49133.802644] XFS (dm-95): log mount failed

#define ENOSPC 28 /* No space left on device */
You're getting an ENOSPC as a metadata IO error during log recovery?
Thin provisioning problem, perhaps,

Yes, it is a thin provisioning problem (which I already know the cause for).

Post by Dave Chinner
and the error is occurring on
8d6c121 xfs: fix buffer use after free on IO error

I am not sure what do you mean by "submission rather than completion".
Do you mean that xfs_buf_ioapply_map() returns without submitting any
bios?

No, that the bio submission results in immediate failure (e.g. the
device goes away, so submission results in ENODEV). Hence when
_xfs_buf_ioapply() releases it's IO reference itis the only
remaining reference to the buffer and so completion processing is
run immediately. i.e. inline from the submission path.

Normally IO errors are reported through the bio in IO completion
interrupt context. i.e the IO is completed by the hardware and the
error status is attached to bio, which is then completed and we get
into XFS that way. The IO submision context is long gone at this
point....

Post by Alex Lyakas
In that case, no, bios are submitted to the block device, and it
fails them through a different context with ENOSPC error. I will still
try the patch you mentioned, because it also looks relevant to another
http://oss.sgi.com/archives/xfs/2013-11/msg00648.html

No, that's a different problem.

9c23ecc xfs: unmount does not wait for shutdown during unmount

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Alex Lyakas

2014-08-06 14:43:02 UTC

Hello Dave,
I applied this patch manually to 3.8.13, but still hitting exact same issue
with my reproduction.
I still have xlog_recover_iodone callbacks delivered through
xfs_buf_bio_end_io scheduling it through xfslogd_workqueue.

Some of the callbacks arrive before mount completes:
Aug 6 17:31:33 dev kernel: [ 3258.774970] XFS (dm-19): metadata I/O error:
block 0x3780001 ("xlog_recover_iodone") error 28 numblks 1
Aug 6 17:31:33 dev kernel: [ 3258.776687] XFS (dm-19):
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0743848
Aug 6 17:31:33 dev kernel: [ 3258.777262] XFS (dm-19): I/O Error Detected.
Shutting down filesystem
Aug 6 17:31:33 dev kernel: [ 3258.778369] XFS (dm-19): Please umount the
filesystem and rectify the problem(s)
Aug 6 17:31:33 dev kernel: [ 3258.779634] XFS (dm-19): metadata I/O error:
block 0x3780002 ("xlog_recover_iodone") error 28 numblks 1
Aug 6 17:31:33 dev kernel: [ 3258.781929] XFS (dm-19):
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0743848
Aug 6 17:31:33 dev kernel: [ 3258.781939] XFS (dm-19): metadata I/O error:
block 0x3780008 ("xlog_recover_iodone") error 28 numblks 8
Aug 6 17:31:33 dev kernel: [ 3258.784235] XFS (dm-19):
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0743848
Aug 6 17:31:33 dev kernel: [ 3258.784260] XFS (dm-19): metadata I/O error:
block 0x3780010 ("xlog_recover_iodone") error 28 numblks 8
Aug 6 17:31:33 dev kernel: [ 3258.784389] XFS (dm-19): log mount/recovery
failed: error 28
Aug 6 17:31:33 dev kernel: [ 3258.784549] XFS (dm-19): log mount failed

And some arrive afterwards, and print garbage.
Aug 6 17:31:33 dev kernel: [ 3258.786398] XFS (ˆm&_):
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0743848
Aug 6 17:31:33 dev kernel: [ 3258.786404] XFS (ˆm&_): metadata I/O error:
block 0x3780018 ("xlog_recover_iodone") error 28 numblks 8
Aug 6 17:31:33 dev kernel: [ 3258.788575] XFS (ˆm&_):
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0743848
Aug 6 17:31:33 dev kernel: [ 3258.788614] XFS (ˆm&_): metadata I/O error:
block 0x37800a0 ("xlog_recover_iodone") error 28 numblks 16
Aug 6 17:31:33 dev kernel: [ 3258.790849] XFS (ˆm&_):
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0743848

Maybe some additional patch is needed for my kernel?

Thanks,
Alex.

-----Original Message-----
From: Dave Chinner
Sent: 06 August, 2014 3:32 PM
To: Alex Lyakas
Cc: ***@oss.sgi.com
Subject: Re: use-after-free on log replay failure

Post by Alex Lyakas
Hi Dave,

Post by Dave Chinner

Post by Alex Lyakas
Greetings,
we had a log replay failure due to some errors that the underlying
[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
("xlog_recover_iodone") error 28 numblks 16
[49133.802495] XFS (dm-95): log mount/recovery failed: error 28
[49133.802644] XFS (dm-95): log mount failed

#define ENOSPC 28 /* No space left on device */
You're getting an ENOSPC as a metadata IO error during log recovery?
Thin provisioning problem, perhaps,

Yes, it is a thin provisioning problem (which I already know the cause for).

Post by Dave Chinner
and the error is occurring on
8d6c121 xfs: fix buffer use after free on IO error

I am not sure what do you mean by "submission rather than completion".
Do you mean that xfs_buf_ioapply_map() returns without submitting any
bios?

No, that the bio submission results in immediate failure (e.g. the
device goes away, so submission results in ENODEV). Hence when
_xfs_buf_ioapply() releases it's IO reference itis the only
remaining reference to the buffer and so completion processing is
run immediately. i.e. inline from the submission path.

Normally IO errors are reported through the bio in IO completion
interrupt context. i.e the IO is completed by the hardware and the
error status is attached to bio, which is then completed and we get
into XFS that way. The IO submision context is long gone at this
point....

Post by Alex Lyakas
In that case, no, bios are submitted to the block device, and it
fails them through a different context with ENOSPC error. I will still
try the patch you mentioned, because it also looks relevant to another
http://oss.sgi.com/archives/xfs/2013-11/msg00648.html

No, that's a different problem.

9c23ecc xfs: unmount does not wait for shutdown during unmount

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Alex Lyakas

2014-08-10 16:26:24 UTC

Hello Dave,

Post by Dave Chinner

Post by Alex Lyakas
Hi Dave,

Post by Dave Chinner

Post by Alex Lyakas
Greetings,
we had a log replay failure due to some errors that the underlying
[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
("xlog_recover_iodone") error 28 numblks 16
[49133.802495] XFS (dm-95): log mount/recovery failed: error 28
[49133.802644] XFS (dm-95): log mount failed

#define ENOSPC 28 /* No space left on device */
You're getting an ENOSPC as a metadata IO error during log recovery?
Thin provisioning problem, perhaps,

Yes, it is a thin provisioning problem (which I already know the cause for).

Post by Dave Chinner
and the error is occurring on
8d6c121 xfs: fix buffer use after free on IO error

I am not sure what do you mean by "submission rather than completion".
Do you mean that xfs_buf_ioapply_map() returns without submitting any
bios?

No, that the bio submission results in immediate failure (e.g. the
device goes away, so submission results in ENODEV). Hence when
_xfs_buf_ioapply() releases it's IO reference itis the only
remaining reference to the buffer and so completion processing is
run immediately. i.e. inline from the submission path.
Normally IO errors are reported through the bio in IO completion
interrupt context. i.e the IO is completed by the hardware and the
error status is attached to bio, which is then completed and we get
into XFS that way. The IO submision context is long gone at this
point....

Post by Alex Lyakas
In that case, no, bios are submitted to the block device, and it
fails them through a different context with ENOSPC error. I will still
try the patch you mentioned, because it also looks relevant to another
http://oss.sgi.com/archives/xfs/2013-11/msg00648.html

No, that's a different problem.
9c23ecc xfs: unmount does not wait for shutdown during unmount

Yes, this patch appears to fix the problem that I reported in the
past. XFS survives the unmount and kmemleak is also happy. Thanks! Is
this patch safe to apply to 3.8.13?

Thanks,
Alex.

Post by Dave Chinner
Cheers,
Dave.
--
Dave Chinner

Alex Lyakas

2014-08-06 12:52:03 UTC

Hello Dave and Brian,

Dave, I tried the patch you suggested, but it does not fix the issue. I did
some further digging, and it appears that _xfs_buf_ioend(schedule=1) can be
called from xfs_buf_iorequest(), which the patch fixes, but also from
xfs_buf_bio_end_io() which is my case. I am reproducing the issue pretty
easily. The flow that I have is like this:
- xlog_recover() calls xlog_find_tail(). This works alright.
- Now I add a small sleep before calling xlog_do_recover(), and meanwhile I
instruct my block device to return ENOSPC for any WRITE from now on.

What seems to happen is that several WRITE bios are submitted and they all
fail. When they do, they reach xfs_buf_ioend() through a stack like this:

Aug 6 15:23:07 dev kernel: [ 304.410528] [56]xfs*[xfs_buf_ioend:1056]
XFS(dm-19): Scheduling xfs_buf_iodone_work on error
Aug 6 15:23:07 dev kernel: [ 304.410534] Pid: 56, comm: kworker/u:1
Tainted: G W O 3.8.13-557-generic #1382000791
Aug 6 15:23:07 dev kernel: [ 304.410537] Call Trace:
Aug 6 15:23:07 dev kernel: [ 304.410587] [<ffffffffa04d6654>]
xfs_buf_ioend+0x1a4/0x1b0 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.410621] [<ffffffffa04d6685>]
_xfs_buf_ioend+0x25/0x30 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.410643] [<ffffffffa04d6b3d>]
xfs_buf_bio_end_io+0x3d/0x50 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.410652] [<ffffffff811c3d8d>]
bio_endio+0x1d/0x40
...

At this point, they are scheduled to run xlog_recover_iodone through
xfslogd_workqueue.
The first callback that gets called, calls xfs_do_force_shutdown in stack
like this:

Aug 6 15:23:07 dev kernel: [ 304.411791] XFS (dm-19): metadata I/O error:
block 0x3780001 ("xlog_recover_iodone") error 28 numblks 1
Aug 6 15:23:07 dev kernel: [ 304.413493] XFS (dm-19):
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
Aug 6 15:23:07 dev kernel: [ 304.413837] [<ffffffffa04e0b60>]
xfs_do_force_shutdown+0x40/0x180 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.413870] [<ffffffffa0526848>] ?
xlog_recover_iodone+0x48/0x70 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.413902] [<ffffffffa0526848>]
xlog_recover_iodone+0x48/0x70 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.413923] [<ffffffffa04d645d>]
xfs_buf_iodone_work+0x4d/0xa0 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.413930] [<ffffffff81077a11>]
process_one_work+0x141/0x4a0
Aug 6 15:23:07 dev kernel: [ 304.413937] [<ffffffff810789e8>]
worker_thread+0x168/0x410
Aug 6 15:23:07 dev kernel: [ 304.413943] [<ffffffff81078880>] ?
manage_workers+0x120/0x120
Aug 6 15:23:07 dev kernel: [ 304.413949] [<ffffffff8107df10>]
kthread+0xc0/0xd0
Aug 6 15:23:07 dev kernel: [ 304.413954] [<ffffffff8107de50>] ?
flush_kthread_worker+0xb0/0xb0
Aug 6 15:23:07 dev kernel: [ 304.413976] [<ffffffff816ab86c>]
ret_from_fork+0x7c/0xb0
Aug 6 15:23:07 dev kernel: [ 304.413986] [<ffffffff8107de50>] ?
flush_kthread_worker+0xb0/0xb0
Aug 6 15:23:07 dev kernel: [ 304.413990] ---[ end trace
988d698520e1fa81 ]---
Aug 6 15:23:07 dev kernel: [ 304.414012] XFS (dm-19): I/O Error Detected.
Shutting down filesystem
Aug 6 15:23:07 dev kernel: [ 304.415936] XFS (dm-19): Please umount the
filesystem and rectify the problem(s)

But the rest of the callbacks also arrive:
Aug 6 15:23:07 dev kernel: [ 304.417812] XFS (dm-19): metadata I/O error:
block 0x3780002 ("xlog_recover_iodone") error 28 numblks 1
Aug 6 15:23:07 dev kernel: [ 304.420420] XFS (dm-19):
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
Aug 6 15:23:07 dev kernel: [ 304.420427] XFS (dm-19): metadata I/O error:
block 0x3780008 ("xlog_recover_iodone") error 28 numblks 8
Aug 6 15:23:07 dev kernel: [ 304.422708] XFS (dm-19):
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
Aug 6 15:23:07 dev kernel: [ 304.422738] XFS (dm-19): metadata I/O error:
block 0x3780010 ("xlog_recover_iodone") error 28 numblks 8

The mount sequence fails and goes back to the caller:
Aug 6 15:23:07 dev kernel: [ 304.423438] XFS (dm-19): log mount/recovery
failed: error 28
Aug 6 15:23:07 dev kernel: [ 304.423757] XFS (dm-19): log mount failed

But there are still additional callbacks to deliver, which the mount
sequence did not wait for!
Aug 6 15:23:07 dev kernel: [ 304.425717] XFS (@dR):
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
Aug 6 15:23:07 dev kernel: [ 304.425723] XFS (@dR): metadata I/O error:
block 0x3780018 ("xlog_recover_iodone") error 28 numblks 8
Aug 6 15:23:07 dev kernel: [ 304.428239] XFS (@dR):
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
Aug 6 15:23:07 dev kernel: [ 304.428246] XFS (@dR): metadata I/O error:
block 0x37800a0 ("xlog_recover_iodone") error 28 numblks 16

Notice the junk that they are printing! Naturally, because xfs_mount
structure has been kfreed.

Finally the kernel crashes (instead of printing junk), because the xfs_mount
structure is gone, but the callback tries to access it (printing the name):

Aug 6 15:23:07 dev kernel: [ 304.430796] general protection fault: 0000
[#1] SMP
Aug 6 15:23:07 dev kernel: [ 304.432035] Modules linked in: xfrm_user
xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 iscsi_scst_tcp(O)
scst_vdisk(O) scst(O) dm_zcache(O) dm_btrfs(O) xfs(O) btrfs(O) libcrc32c
raid456(O) async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq
async_tx raid1(O) md_mod deflate zlib_deflate ctr twofish_generic
twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64
twofish_common camellia_generic serpent_generic blowfish_generic
blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic xcbc
rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin kvm vfat
fat ppdev psmouse microcode nfsd nfs_acl dm_multipath(O) serio_raw
parport_pc nfsv4 dm_iostat(O) mac_hid i2c_piix4 auth_rpcgss nfs fscache
lockd sunrpc lp parport floppy
Aug 6 15:23:07 dev kernel: [ 304.432035] CPU 1
Aug 6 15:23:07 dev kernel: [ 304.432035] Pid: 133, comm: kworker/1:1H
Tainted: G W O 3.8.13-557-generic #1382000791 Bochs Bochs
Aug 6 15:23:07 dev kernel: [ 304.432035] RIP: 0010:[<ffffffff8133c2cb>]
[<ffffffff8133c2cb>] strnlen+0xb/0x30
Aug 6 15:23:07 dev kernel: [ 304.432035] RSP: 0018:ffff880035461b08
EFLAGS: 00010086
Aug 6 15:23:07 dev kernel: [ 304.432035] RAX: 0000000000000000 RBX:
ffffffff81e6a4e7 RCX: 0000000000000000
Aug 6 15:23:07 dev kernel: [ 304.432035] RDX: e4e8390a265c0000 RSI:
ffffffffffffffff RDI: e4e8390a265c0000
Aug 6 15:23:07 dev kernel: [ 304.432035] RBP: ffff880035461b08 R08:
000000000000ffff R09: 000000000000ffff
Aug 6 15:23:07 dev kernel: [ 304.432035] R10: 0000000000000000 R11:
00000000000004cd R12: e4e8390a265c0000
Aug 6 15:23:07 dev kernel: [ 304.432035] R13: ffffffff81e6a8c0 R14:
0000000000000000 R15: 000000000000ffff
Aug 6 15:23:07 dev kernel: [ 304.432035] FS: 0000000000000000(0000)
GS:ffff88007fc80000(0000) knlGS:0000000000000000
Aug 6 15:23:07 dev kernel: [ 304.432035] CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Aug 6 15:23:07 dev kernel: [ 304.432035] CR2: 00007fc902ffbfd8 CR3:
000000007702a000 CR4: 00000000000006e0
Aug 6 15:23:07 dev kernel: [ 304.432035] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Aug 6 15:23:07 dev kernel: [ 304.432035] DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Aug 6 15:23:07 dev kernel: [ 304.432035] Process kworker/1:1H (pid: 133,
threadinfo ffff880035460000, task ffff880035412e00)
Aug 6 15:23:07 dev kernel: [ 304.432035] Stack:
Aug 6 15:23:07 dev kernel: [ 304.432035] ffff880035461b48
ffffffff8133dd5e 0000000000000000 ffffffff81e6a4e7
Aug 6 15:23:07 dev kernel: [ 304.432035] ffffffffa0566cba
ffff880035461c80 ffffffffa0566cba ffffffff81e6a8c0
Aug 6 15:23:07 dev kernel: [ 304.432035] ffff880035461bc8
ffffffff8133ef59 ffff880035461bc8 ffffffff81c84040
Aug 6 15:23:07 dev kernel: [ 304.432035] Call Trace:
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133dd5e>]
string.isra.4+0x3e/0xd0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133ef59>]
vsnprintf+0x219/0x640
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133f441>]
vscnprintf+0x11/0x30
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8105a971>]
vprintk_emit+0xc1/0x490
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8105aa20>] ?
vprintk_emit+0x170/0x490
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8168b992>]
printk+0x61/0x63
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e9bf1>]
__xfs_printk+0x31/0x50 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e9e43>]
xfs_notice+0x53/0x60 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e0c15>]
xfs_do_force_shutdown+0xf5/0x180 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa0526848>] ?
xlog_recover_iodone+0x48/0x70 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa0526848>]
xlog_recover_iodone+0x48/0x70 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04d645d>]
xfs_buf_iodone_work+0x4d/0xa0 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff81077a11>]
process_one_work+0x141/0x4a0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff810789e8>]
worker_thread+0x168/0x410
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff81078880>] ?
manage_workers+0x120/0x120
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107df10>]
kthread+0xc0/0xd0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107de50>] ?
flush_kthread_worker+0xb0/0xb0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff816ab86c>]
ret_from_fork+0x7c/0xb0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107de50>] ?
flush_kthread_worker+0xb0/0xb0
Aug 6 15:23:07 dev kernel: [ 304.432035] Code: 31 c0 80 3f 00 55 48 89 e5
74 11 48 89 f8 66 90 48 83 c0 01 80 38 00 75 f7 48 29 f8 5d c3 66 90 55 31
c0 48 85 f6 48 89 e5 74 23 <80> 3f 00 74 1e 48 89 f8 eb 0c 0f 1f 00 48 83 ee
01 80 38 00 74
Aug 6 15:23:07 dev kernel: [ 304.432035] RIP [<ffffffff8133c2cb>]
strnlen+0xb/0x30
Aug 6 15:23:07 dev kernel: [ 304.432035] RSP <ffff880035461b08>

So previously you said: "So, something is corrupting memory and stamping all
over the XFS structures." and also "given you have a bunch of out of tree
modules loaded (and some which are experiemental) suggests that you have a
problem with your storage...".

But I believe, my analysis shows that during the mount sequence XFS does not
wait properly for all the bios to complete, before failing the mount
sequence back to the caller.

Thanks,
Alex.

-----Original Message-----
From: Dave Chinner
Sent: 05 August, 2014 2:07 AM
To: Alex Lyakas
Cc: ***@oss.sgi.com
Subject: Re: use-after-free on log replay failure

Post by Alex Lyakas
Greetings,
we had a log replay failure due to some errors that the underlying
[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
("xlog_recover_iodone") error 28 numblks 16
[49133.802495] XFS (dm-95): log mount/recovery failed: error 28
[49133.802644] XFS (dm-95): log mount failed

#define ENOSPC 28 /* No space left on device */

You're getting an ENOSPC as a metadata IO error during log recovery?
Thin provisioning problem, perhaps, and the error is occurring on
submission rather than completion? If so:

8d6c121 xfs: fix buffer use after free on IO error

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Brian Foster

2014-08-06 15:20:42 UTC

Post by Alex Lyakas
Hello Dave and Brian,
Dave, I tried the patch you suggested, but it does not fix the issue. I did
some further digging, and it appears that _xfs_buf_ioend(schedule=1) can be
called from xfs_buf_iorequest(), which the patch fixes, but also from
xfs_buf_bio_end_io() which is my case. I am reproducing the issue pretty
- xlog_recover() calls xlog_find_tail(). This works alright.

What's the purpose of a sleep here?

Post by Alex Lyakas
- Now I add a small sleep before calling xlog_do_recover(), and meanwhile I
instruct my block device to return ENOSPC for any WRITE from now on.
What seems to happen is that several WRITE bios are submitted and they all
Aug 6 15:23:07 dev kernel: [ 304.410528] [56]xfs*[xfs_buf_ioend:1056]
XFS(dm-19): Scheduling xfs_buf_iodone_work on error
Aug 6 15:23:07 dev kernel: [ 304.410534] Pid: 56, comm: kworker/u:1
Tainted: G W O 3.8.13-557-generic #1382000791
Aug 6 15:23:07 dev kernel: [ 304.410587] [<ffffffffa04d6654>]
xfs_buf_ioend+0x1a4/0x1b0 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.410621] [<ffffffffa04d6685>]
_xfs_buf_ioend+0x25/0x30 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.410643] [<ffffffffa04d6b3d>]
xfs_buf_bio_end_io+0x3d/0x50 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.410652] [<ffffffff811c3d8d>]
bio_endio+0x1d/0x40
...
At this point, they are scheduled to run xlog_recover_iodone through
xfslogd_workqueue.
The first callback that gets called, calls xfs_do_force_shutdown in stack
block 0x3780001 ("xlog_recover_iodone") error 28 numblks 1
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
Aug 6 15:23:07 dev kernel: [ 304.413837] [<ffffffffa04e0b60>]
xfs_do_force_shutdown+0x40/0x180 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.413870] [<ffffffffa0526848>] ?
xlog_recover_iodone+0x48/0x70 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.413902] [<ffffffffa0526848>]
xlog_recover_iodone+0x48/0x70 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.413923] [<ffffffffa04d645d>]
xfs_buf_iodone_work+0x4d/0xa0 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.413930] [<ffffffff81077a11>]
process_one_work+0x141/0x4a0
Aug 6 15:23:07 dev kernel: [ 304.413937] [<ffffffff810789e8>]
worker_thread+0x168/0x410
Aug 6 15:23:07 dev kernel: [ 304.413943] [<ffffffff81078880>] ?
manage_workers+0x120/0x120
Aug 6 15:23:07 dev kernel: [ 304.413949] [<ffffffff8107df10>]
kthread+0xc0/0xd0
Aug 6 15:23:07 dev kernel: [ 304.413954] [<ffffffff8107de50>] ?
flush_kthread_worker+0xb0/0xb0
Aug 6 15:23:07 dev kernel: [ 304.413976] [<ffffffff816ab86c>]
ret_from_fork+0x7c/0xb0
Aug 6 15:23:07 dev kernel: [ 304.413986] [<ffffffff8107de50>] ?
flush_kthread_worker+0xb0/0xb0
Aug 6 15:23:07 dev kernel: [ 304.413990] ---[ end trace 988d698520e1fa81
]---
Aug 6 15:23:07 dev kernel: [ 304.414012] XFS (dm-19): I/O Error Detected.
Shutting down filesystem
Aug 6 15:23:07 dev kernel: [ 304.415936] XFS (dm-19): Please umount the
filesystem and rectify the problem(s)
block 0x3780002 ("xlog_recover_iodone") error 28 numblks 1
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
block 0x3780008 ("xlog_recover_iodone") error 28 numblks 8
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
block 0x3780010 ("xlog_recover_iodone") error 28 numblks 8
Aug 6 15:23:07 dev kernel: [ 304.423438] XFS (dm-19): log mount/recovery
failed: error 28
Aug 6 15:23:07 dev kernel: [ 304.423757] XFS (dm-19): log mount failed
But there are still additional callbacks to deliver, which the mount
sequence did not wait for!
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
block 0x3780018 ("xlog_recover_iodone") error 28 numblks 8
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
block 0x37800a0 ("xlog_recover_iodone") error 28 numblks 16
Notice the junk that they are printing! Naturally, because xfs_mount
structure has been kfreed.
Finally the kernel crashes (instead of printing junk), because the xfs_mount
Aug 6 15:23:07 dev kernel: [ 304.430796] general protection fault: 0000
[#1] SMP
Aug 6 15:23:07 dev kernel: [ 304.432035] Modules linked in: xfrm_user
xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 iscsi_scst_tcp(O)
scst_vdisk(O) scst(O) dm_zcache(O) dm_btrfs(O) xfs(O) btrfs(O) libcrc32c
raid456(O) async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq
async_tx raid1(O) md_mod deflate zlib_deflate ctr twofish_generic
twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64
twofish_common camellia_generic serpent_generic blowfish_generic
blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic xcbc
rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin kvm vfat
fat ppdev psmouse microcode nfsd nfs_acl dm_multipath(O) serio_raw
parport_pc nfsv4 dm_iostat(O) mac_hid i2c_piix4 auth_rpcgss nfs fscache
lockd sunrpc lp parport floppy
Aug 6 15:23:07 dev kernel: [ 304.432035] CPU 1
Aug 6 15:23:07 dev kernel: [ 304.432035] Pid: 133, comm: kworker/1:1H
Tainted: G W O 3.8.13-557-generic #1382000791 Bochs Bochs
Aug 6 15:23:07 dev kernel: [ 304.432035] RIP: 0010:[<ffffffff8133c2cb>]
[<ffffffff8133c2cb>] strnlen+0xb/0x30
Aug 6 15:23:07 dev kernel: [ 304.432035] RSP: 0018:ffff880035461b08
EFLAGS: 00010086
ffffffff81e6a4e7 RCX: 0000000000000000
ffffffffffffffff RDI: e4e8390a265c0000
000000000000ffff R09: 000000000000ffff
00000000000004cd R12: e4e8390a265c0000
0000000000000000 R15: 000000000000ffff
Aug 6 15:23:07 dev kernel: [ 304.432035] FS: 0000000000000000(0000)
GS:ffff88007fc80000(0000) knlGS:0000000000000000
000000008005003b
000000007702a000 CR4: 00000000000006e0
0000000000000000 DR2: 0000000000000000
00000000ffff0ff0 DR7: 0000000000000400
Aug 6 15:23:07 dev kernel: [ 304.432035] Process kworker/1:1H (pid: 133,
threadinfo ffff880035460000, task ffff880035412e00)
Aug 6 15:23:07 dev kernel: [ 304.432035] ffff880035461b48
ffffffff8133dd5e 0000000000000000 ffffffff81e6a4e7
Aug 6 15:23:07 dev kernel: [ 304.432035] ffffffffa0566cba
ffff880035461c80 ffffffffa0566cba ffffffff81e6a8c0
Aug 6 15:23:07 dev kernel: [ 304.432035] ffff880035461bc8
ffffffff8133ef59 ffff880035461bc8 ffffffff81c84040
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133dd5e>]
string.isra.4+0x3e/0xd0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133ef59>]
vsnprintf+0x219/0x640
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133f441>]
vscnprintf+0x11/0x30
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8105a971>]
vprintk_emit+0xc1/0x490
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8105aa20>] ?
vprintk_emit+0x170/0x490
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8168b992>]
printk+0x61/0x63
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e9bf1>]
__xfs_printk+0x31/0x50 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e9e43>]
xfs_notice+0x53/0x60 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e0c15>]
xfs_do_force_shutdown+0xf5/0x180 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa0526848>] ?
xlog_recover_iodone+0x48/0x70 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa0526848>]
xlog_recover_iodone+0x48/0x70 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04d645d>]
xfs_buf_iodone_work+0x4d/0xa0 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff81077a11>]
process_one_work+0x141/0x4a0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff810789e8>]
worker_thread+0x168/0x410
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff81078880>] ?
manage_workers+0x120/0x120
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107df10>]
kthread+0xc0/0xd0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107de50>] ?
flush_kthread_worker+0xb0/0xb0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff816ab86c>]
ret_from_fork+0x7c/0xb0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107de50>] ?
flush_kthread_worker+0xb0/0xb0
Aug 6 15:23:07 dev kernel: [ 304.432035] Code: 31 c0 80 3f 00 55 48 89 e5
74 11 48 89 f8 66 90 48 83 c0 01 80 38 00 75 f7 48 29 f8 5d c3 66 90 55 31
c0 48 85 f6 48 89 e5 74 23 <80> 3f 00 74 1e 48 89 f8 eb 0c 0f 1f 00 48 83 ee
01 80 38 00 74
Aug 6 15:23:07 dev kernel: [ 304.432035] RIP [<ffffffff8133c2cb>]
strnlen+0xb/0x30
Aug 6 15:23:07 dev kernel: [ 304.432035] RSP <ffff880035461b08>
So previously you said: "So, something is corrupting memory and stamping all
over the XFS structures." and also "given you have a bunch of out of tree
modules loaded (and some which are experiemental) suggests that you have a
problem with your storage...".
But I believe, my analysis shows that during the mount sequence XFS does not
wait properly for all the bios to complete, before failing the mount
sequence back to the caller.

As an experiment, what about the following? Compile tested only and not
safe for general use.

What might help more is to see if you can create a reproducer on a
recent, clean kernel. Perhaps a metadump of your reproducer fs combined
with whatever block device ENOSPC hack you're using would do it.

Brian

---8<---

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index cd7b8ca..fbcf524 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1409,19 +1409,27 @@ xfs_buf_iorequest(
* case nothing will ever complete. It returns the I/O error code, if any, or
* 0 if there was no error.
*/
-int
-xfs_buf_iowait(
- xfs_buf_t *bp)
+static int
+__xfs_buf_iowait(
+ struct xfs_buf *bp,
+ bool skip_error)
{
trace_xfs_buf_iowait(bp, _RET_IP_);

- if (!bp->b_error)
+ if (skip_error || !bp->b_error)
wait_for_completion(&bp->b_iowait);

trace_xfs_buf_iowait_done(bp, _RET_IP_);
return bp->b_error;
}

+int
+xfs_buf_iowait(
+ struct xfs_buf *bp)
+{
+ return __xfs_buf_iowait(bp, false);
+}
+
xfs_caddr_t
xfs_buf_offset(
xfs_buf_t *bp,
@@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit(
bp = list_first_entry(&io_list, struct xfs_buf, b_list);

list_del_init(&bp->b_list);
- error2 = xfs_buf_iowait(bp);
+ error2 = __xfs_buf_iowait(bp, true);
xfs_buf_relse(bp);
if (!error)
error = error2;

---

Post by Alex Lyakas
Thanks,
Alex.
-----Original Message----- From: Dave Chinner
Sent: 05 August, 2014 2:07 AM
To: Alex Lyakas
Subject: Re: use-after-free on log replay failure

Post by Alex Lyakas
Greetings,
we had a log replay failure due to some errors that the underlying
[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
("xlog_recover_iodone") error 28 numblks 16
[49133.802495] XFS (dm-95): log mount/recovery failed: error 28
[49133.802644] XFS (dm-95): log mount failed

#define ENOSPC 28 /* No space left on device */
You're getting an ENOSPC as a metadata IO error during log recovery?
Thin provisioning problem, perhaps, and the error is occurring on
8d6c121 xfs: fix buffer use after free on IO error
Cheers,
Dave.
--
Dave Chinner

Alex Lyakas

2014-08-06 15:28:34 UTC

Hi Brian,
The purpose of the sleep is like this: I added a module parameter to my
block device, which says "from now on fail all WRITEs with ENOSPC". If I set
this parameter before mounting, then xlog_find_tail() fails. This does not
reproduce my problem. So I want xlog_find_tail() to succeed, and then start
returning errors. So I need the sleep so that I can set my parameter, before
actual log recovery is attempted:)

Will try what you suggest.

Thanks,
Alex.

-----Original Message-----
From: Brian Foster
Sent: 06 August, 2014 6:20 PM
To: Alex Lyakas
Cc: Dave Chinner ; ***@oss.sgi.com
Subject: Re: use-after-free on log replay failure

Post by Alex Lyakas
Hello Dave and Brian,
Dave, I tried the patch you suggested, but it does not fix the issue. I did
some further digging, and it appears that _xfs_buf_ioend(schedule=1) can be
called from xfs_buf_iorequest(), which the patch fixes, but also from
xfs_buf_bio_end_io() which is my case. I am reproducing the issue pretty
- xlog_recover() calls xlog_find_tail(). This works alright.

What's the purpose of a sleep here?

Post by Alex Lyakas
- Now I add a small sleep before calling xlog_do_recover(), and meanwhile I
instruct my block device to return ENOSPC for any WRITE from now on.
What seems to happen is that several WRITE bios are submitted and they all
Aug 6 15:23:07 dev kernel: [ 304.410528] [56]xfs*[xfs_buf_ioend:1056]
XFS(dm-19): Scheduling xfs_buf_iodone_work on error
Aug 6 15:23:07 dev kernel: [ 304.410534] Pid: 56, comm: kworker/u:1
Tainted: G W O 3.8.13-557-generic #1382000791
Aug 6 15:23:07 dev kernel: [ 304.410587] [<ffffffffa04d6654>]
xfs_buf_ioend+0x1a4/0x1b0 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.410621] [<ffffffffa04d6685>]
_xfs_buf_ioend+0x25/0x30 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.410643] [<ffffffffa04d6b3d>]
xfs_buf_bio_end_io+0x3d/0x50 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.410652] [<ffffffff811c3d8d>]
bio_endio+0x1d/0x40
...
At this point, they are scheduled to run xlog_recover_iodone through
xfslogd_workqueue.
The first callback that gets called, calls xfs_do_force_shutdown in stack
block 0x3780001 ("xlog_recover_iodone") error 28 numblks 1
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
Aug 6 15:23:07 dev kernel: [ 304.413837] [<ffffffffa04e0b60>]
xfs_do_force_shutdown+0x40/0x180 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.413870] [<ffffffffa0526848>] ?
xlog_recover_iodone+0x48/0x70 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.413902] [<ffffffffa0526848>]
xlog_recover_iodone+0x48/0x70 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.413923] [<ffffffffa04d645d>]
xfs_buf_iodone_work+0x4d/0xa0 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.413930] [<ffffffff81077a11>]
process_one_work+0x141/0x4a0
Aug 6 15:23:07 dev kernel: [ 304.413937] [<ffffffff810789e8>]
worker_thread+0x168/0x410
Aug 6 15:23:07 dev kernel: [ 304.413943] [<ffffffff81078880>] ?
manage_workers+0x120/0x120
Aug 6 15:23:07 dev kernel: [ 304.413949] [<ffffffff8107df10>]
kthread+0xc0/0xd0
Aug 6 15:23:07 dev kernel: [ 304.413954] [<ffffffff8107de50>] ?
flush_kthread_worker+0xb0/0xb0
Aug 6 15:23:07 dev kernel: [ 304.413976] [<ffffffff816ab86c>]
ret_from_fork+0x7c/0xb0
Aug 6 15:23:07 dev kernel: [ 304.413986] [<ffffffff8107de50>] ?
flush_kthread_worker+0xb0/0xb0
Aug 6 15:23:07 dev kernel: [ 304.413990] ---[ end trace 988d698520e1fa81
]---
Aug 6 15:23:07 dev kernel: [ 304.414012] XFS (dm-19): I/O Error Detected.
Shutting down filesystem
Aug 6 15:23:07 dev kernel: [ 304.415936] XFS (dm-19): Please umount the
filesystem and rectify the problem(s)
block 0x3780002 ("xlog_recover_iodone") error 28 numblks 1
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
block 0x3780008 ("xlog_recover_iodone") error 28 numblks 8
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
block 0x3780010 ("xlog_recover_iodone") error 28 numblks 8
Aug 6 15:23:07 dev kernel: [ 304.423438] XFS (dm-19): log mount/recovery
failed: error 28
Aug 6 15:23:07 dev kernel: [ 304.423757] XFS (dm-19): log mount failed
But there are still additional callbacks to deliver, which the mount
sequence did not wait for!
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
block 0x3780018 ("xlog_recover_iodone") error 28 numblks 8
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
block 0x37800a0 ("xlog_recover_iodone") error 28 numblks 16
Notice the junk that they are printing! Naturally, because xfs_mount
structure has been kfreed.
Finally the kernel crashes (instead of printing junk), because the xfs_mount
Aug 6 15:23:07 dev kernel: [ 304.430796] general protection fault: 0000
[#1] SMP
Aug 6 15:23:07 dev kernel: [ 304.432035] Modules linked in: xfrm_user
xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 iscsi_scst_tcp(O)
scst_vdisk(O) scst(O) dm_zcache(O) dm_btrfs(O) xfs(O) btrfs(O) libcrc32c
raid456(O) async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq
async_tx raid1(O) md_mod deflate zlib_deflate ctr twofish_generic
twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64
twofish_common camellia_generic serpent_generic blowfish_generic
blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic xcbc
rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin kvm vfat
fat ppdev psmouse microcode nfsd nfs_acl dm_multipath(O) serio_raw
parport_pc nfsv4 dm_iostat(O) mac_hid i2c_piix4 auth_rpcgss nfs fscache
lockd sunrpc lp parport floppy
Aug 6 15:23:07 dev kernel: [ 304.432035] CPU 1
Aug 6 15:23:07 dev kernel: [ 304.432035] Pid: 133, comm: kworker/1:1H
Tainted: G W O 3.8.13-557-generic #1382000791 Bochs Bochs
Aug 6 15:23:07 dev kernel: [ 304.432035] RIP: 0010:[<ffffffff8133c2cb>]
[<ffffffff8133c2cb>] strnlen+0xb/0x30
Aug 6 15:23:07 dev kernel: [ 304.432035] RSP: 0018:ffff880035461b08
EFLAGS: 00010086
ffffffff81e6a4e7 RCX: 0000000000000000
ffffffffffffffff RDI: e4e8390a265c0000
000000000000ffff R09: 000000000000ffff
00000000000004cd R12: e4e8390a265c0000
0000000000000000 R15: 000000000000ffff
Aug 6 15:23:07 dev kernel: [ 304.432035] FS: 0000000000000000(0000)
GS:ffff88007fc80000(0000) knlGS:0000000000000000
000000008005003b
000000007702a000 CR4: 00000000000006e0
0000000000000000 DR2: 0000000000000000
00000000ffff0ff0 DR7: 0000000000000400
Aug 6 15:23:07 dev kernel: [ 304.432035] Process kworker/1:1H (pid: 133,
threadinfo ffff880035460000, task ffff880035412e00)
Aug 6 15:23:07 dev kernel: [ 304.432035] ffff880035461b48
ffffffff8133dd5e 0000000000000000 ffffffff81e6a4e7
Aug 6 15:23:07 dev kernel: [ 304.432035] ffffffffa0566cba
ffff880035461c80 ffffffffa0566cba ffffffff81e6a8c0
Aug 6 15:23:07 dev kernel: [ 304.432035] ffff880035461bc8
ffffffff8133ef59 ffff880035461bc8 ffffffff81c84040
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133dd5e>]
string.isra.4+0x3e/0xd0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133ef59>]
vsnprintf+0x219/0x640
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133f441>]
vscnprintf+0x11/0x30
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8105a971>]
vprintk_emit+0xc1/0x490
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8105aa20>] ?
vprintk_emit+0x170/0x490
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8168b992>]
printk+0x61/0x63
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e9bf1>]
__xfs_printk+0x31/0x50 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e9e43>]
xfs_notice+0x53/0x60 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e0c15>]
xfs_do_force_shutdown+0xf5/0x180 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa0526848>] ?
xlog_recover_iodone+0x48/0x70 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa0526848>]
xlog_recover_iodone+0x48/0x70 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04d645d>]
xfs_buf_iodone_work+0x4d/0xa0 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff81077a11>]
process_one_work+0x141/0x4a0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff810789e8>]
worker_thread+0x168/0x410
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff81078880>] ?
manage_workers+0x120/0x120
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107df10>]
kthread+0xc0/0xd0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107de50>] ?
flush_kthread_worker+0xb0/0xb0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff816ab86c>]
ret_from_fork+0x7c/0xb0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107de50>] ?
flush_kthread_worker+0xb0/0xb0
Aug 6 15:23:07 dev kernel: [ 304.432035] Code: 31 c0 80 3f 00 55 48 89 e5
74 11 48 89 f8 66 90 48 83 c0 01 80 38 00 75 f7 48 29 f8 5d c3 66 90 55 31
c0 48 85 f6 48 89 e5 74 23 <80> 3f 00 74 1e 48 89 f8 eb 0c 0f 1f 00 48 83 ee
01 80 38 00 74
Aug 6 15:23:07 dev kernel: [ 304.432035] RIP [<ffffffff8133c2cb>]
strnlen+0xb/0x30
Aug 6 15:23:07 dev kernel: [ 304.432035] RSP <ffff880035461b08>
So previously you said: "So, something is corrupting memory and stamping all
over the XFS structures." and also "given you have a bunch of out of tree
modules loaded (and some which are experiemental) suggests that you have a
problem with your storage...".
But I believe, my analysis shows that during the mount sequence XFS does not
wait properly for all the bios to complete, before failing the mount
sequence back to the caller.

As an experiment, what about the following? Compile tested only and not
safe for general use.

What might help more is to see if you can create a reproducer on a
recent, clean kernel. Perhaps a metadump of your reproducer fs combined
with whatever block device ENOSPC hack you're using would do it.

Brian

---8<---

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index cd7b8ca..fbcf524 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1409,19 +1409,27 @@ xfs_buf_iorequest(
* case nothing will ever complete. It returns the I/O error code, if any,
or
* 0 if there was no error.
*/
-int
-xfs_buf_iowait(
- xfs_buf_t *bp)
+static int
+__xfs_buf_iowait(
+ struct xfs_buf *bp,
+ bool skip_error)
{
trace_xfs_buf_iowait(bp, _RET_IP_);

- if (!bp->b_error)
+ if (skip_error || !bp->b_error)
wait_for_completion(&bp->b_iowait);

trace_xfs_buf_iowait_done(bp, _RET_IP_);
return bp->b_error;
}

+int
+xfs_buf_iowait(
+ struct xfs_buf *bp)
+{
+ return __xfs_buf_iowait(bp, false);
+}
+
xfs_caddr_t
xfs_buf_offset(
xfs_buf_t *bp,
@@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit(
bp = list_first_entry(&io_list, struct xfs_buf, b_list);

list_del_init(&bp->b_list);
- error2 = xfs_buf_iowait(bp);
+ error2 = __xfs_buf_iowait(bp, true);
xfs_buf_relse(bp);
if (!error)
error = error2;

---

Post by Alex Lyakas
Thanks,
Alex.
-----Original Message----- From: Dave Chinner
Sent: 05 August, 2014 2:07 AM
To: Alex Lyakas
Subject: Re: use-after-free on log replay failure

Post by Alex Lyakas
Greetings,
we had a log replay failure due to some errors that the underlying
[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
("xlog_recover_iodone") error 28 numblks 16
[49133.802495] XFS (dm-95): log mount/recovery failed: error 28
[49133.802644] XFS (dm-95): log mount failed

#define ENOSPC 28 /* No space left on device */
You're getting an ENOSPC as a metadata IO error during log recovery?
Thin provisioning problem, perhaps, and the error is occurring on
8d6c121 xfs: fix buffer use after free on IO error
Cheers,
Dave.
--
Dave Chinner

Alex Lyakas

2014-08-10 12:20:50 UTC

Hi Brian,

Post by Brian Foster

Post by Alex Lyakas
Hello Dave and Brian,
Dave, I tried the patch you suggested, but it does not fix the issue. I did
some further digging, and it appears that _xfs_buf_ioend(schedule=1) can be
called from xfs_buf_iorequest(), which the patch fixes, but also from
xfs_buf_bio_end_io() which is my case. I am reproducing the issue pretty
- xlog_recover() calls xlog_find_tail(). This works alright.

What's the purpose of a sleep here?

Post by Alex Lyakas
- Now I add a small sleep before calling xlog_do_recover(), and meanwhile I
instruct my block device to return ENOSPC for any WRITE from now on.
What seems to happen is that several WRITE bios are submitted and they all
Aug 6 15:23:07 dev kernel: [ 304.410528] [56]xfs*[xfs_buf_ioend:1056]
XFS(dm-19): Scheduling xfs_buf_iodone_work on error
Aug 6 15:23:07 dev kernel: [ 304.410534] Pid: 56, comm: kworker/u:1
Tainted: G W O 3.8.13-557-generic #1382000791
Aug 6 15:23:07 dev kernel: [ 304.410587] [<ffffffffa04d6654>]
xfs_buf_ioend+0x1a4/0x1b0 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.410621] [<ffffffffa04d6685>]
_xfs_buf_ioend+0x25/0x30 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.410643] [<ffffffffa04d6b3d>]
xfs_buf_bio_end_io+0x3d/0x50 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.410652] [<ffffffff811c3d8d>]
bio_endio+0x1d/0x40
...
At this point, they are scheduled to run xlog_recover_iodone through
xfslogd_workqueue.
The first callback that gets called, calls xfs_do_force_shutdown in stack
block 0x3780001 ("xlog_recover_iodone") error 28 numblks 1
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
Aug 6 15:23:07 dev kernel: [ 304.413837] [<ffffffffa04e0b60>]
xfs_do_force_shutdown+0x40/0x180 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.413870] [<ffffffffa0526848>] ?
xlog_recover_iodone+0x48/0x70 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.413902] [<ffffffffa0526848>]
xlog_recover_iodone+0x48/0x70 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.413923] [<ffffffffa04d645d>]
xfs_buf_iodone_work+0x4d/0xa0 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.413930] [<ffffffff81077a11>]
process_one_work+0x141/0x4a0
Aug 6 15:23:07 dev kernel: [ 304.413937] [<ffffffff810789e8>]
worker_thread+0x168/0x410
Aug 6 15:23:07 dev kernel: [ 304.413943] [<ffffffff81078880>] ?
manage_workers+0x120/0x120
Aug 6 15:23:07 dev kernel: [ 304.413949] [<ffffffff8107df10>]
kthread+0xc0/0xd0
Aug 6 15:23:07 dev kernel: [ 304.413954] [<ffffffff8107de50>] ?
flush_kthread_worker+0xb0/0xb0
Aug 6 15:23:07 dev kernel: [ 304.413976] [<ffffffff816ab86c>]
ret_from_fork+0x7c/0xb0
Aug 6 15:23:07 dev kernel: [ 304.413986] [<ffffffff8107de50>] ?
flush_kthread_worker+0xb0/0xb0
Aug 6 15:23:07 dev kernel: [ 304.413990] ---[ end trace 988d698520e1fa81
]---
Aug 6 15:23:07 dev kernel: [ 304.414012] XFS (dm-19): I/O Error Detected.
Shutting down filesystem
Aug 6 15:23:07 dev kernel: [ 304.415936] XFS (dm-19): Please umount the
filesystem and rectify the problem(s)
block 0x3780002 ("xlog_recover_iodone") error 28 numblks 1
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
block 0x3780008 ("xlog_recover_iodone") error 28 numblks 8
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
block 0x3780010 ("xlog_recover_iodone") error 28 numblks 8
Aug 6 15:23:07 dev kernel: [ 304.423438] XFS (dm-19): log mount/recovery
failed: error 28
Aug 6 15:23:07 dev kernel: [ 304.423757] XFS (dm-19): log mount failed
But there are still additional callbacks to deliver, which the mount
sequence did not wait for!
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
block 0x3780018 ("xlog_recover_iodone") error 28 numblks 8
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
block 0x37800a0 ("xlog_recover_iodone") error 28 numblks 16
Notice the junk that they are printing! Naturally, because xfs_mount
structure has been kfreed.
Finally the kernel crashes (instead of printing junk), because the xfs_mount
Aug 6 15:23:07 dev kernel: [ 304.430796] general protection fault: 0000
[#1] SMP
Aug 6 15:23:07 dev kernel: [ 304.432035] Modules linked in: xfrm_user
xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 iscsi_scst_tcp(O)
scst_vdisk(O) scst(O) dm_zcache(O) dm_btrfs(O) xfs(O) btrfs(O) libcrc32c
raid456(O) async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq
async_tx raid1(O) md_mod deflate zlib_deflate ctr twofish_generic
twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64
twofish_common camellia_generic serpent_generic blowfish_generic
blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic xcbc
rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin kvm vfat
fat ppdev psmouse microcode nfsd nfs_acl dm_multipath(O) serio_raw
parport_pc nfsv4 dm_iostat(O) mac_hid i2c_piix4 auth_rpcgss nfs fscache
lockd sunrpc lp parport floppy
Aug 6 15:23:07 dev kernel: [ 304.432035] CPU 1
Aug 6 15:23:07 dev kernel: [ 304.432035] Pid: 133, comm: kworker/1:1H
Tainted: G W O 3.8.13-557-generic #1382000791 Bochs Bochs
Aug 6 15:23:07 dev kernel: [ 304.432035] RIP: 0010:[<ffffffff8133c2cb>]
[<ffffffff8133c2cb>] strnlen+0xb/0x30
Aug 6 15:23:07 dev kernel: [ 304.432035] RSP: 0018:ffff880035461b08
EFLAGS: 00010086
ffffffff81e6a4e7 RCX: 0000000000000000
ffffffffffffffff RDI: e4e8390a265c0000
000000000000ffff R09: 000000000000ffff
00000000000004cd R12: e4e8390a265c0000
0000000000000000 R15: 000000000000ffff
Aug 6 15:23:07 dev kernel: [ 304.432035] FS: 0000000000000000(0000)
GS:ffff88007fc80000(0000) knlGS:0000000000000000
000000008005003b
000000007702a000 CR4: 00000000000006e0
0000000000000000 DR2: 0000000000000000
00000000ffff0ff0 DR7: 0000000000000400
Aug 6 15:23:07 dev kernel: [ 304.432035] Process kworker/1:1H (pid: 133,
threadinfo ffff880035460000, task ffff880035412e00)
Aug 6 15:23:07 dev kernel: [ 304.432035] ffff880035461b48
ffffffff8133dd5e 0000000000000000 ffffffff81e6a4e7
Aug 6 15:23:07 dev kernel: [ 304.432035] ffffffffa0566cba
ffff880035461c80 ffffffffa0566cba ffffffff81e6a8c0
Aug 6 15:23:07 dev kernel: [ 304.432035] ffff880035461bc8
ffffffff8133ef59 ffff880035461bc8 ffffffff81c84040
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133dd5e>]
string.isra.4+0x3e/0xd0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133ef59>]
vsnprintf+0x219/0x640
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133f441>]
vscnprintf+0x11/0x30
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8105a971>]
vprintk_emit+0xc1/0x490
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8105aa20>] ?
vprintk_emit+0x170/0x490
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8168b992>]
printk+0x61/0x63
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e9bf1>]
__xfs_printk+0x31/0x50 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e9e43>]
xfs_notice+0x53/0x60 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e0c15>]
xfs_do_force_shutdown+0xf5/0x180 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa0526848>] ?
xlog_recover_iodone+0x48/0x70 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa0526848>]
xlog_recover_iodone+0x48/0x70 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04d645d>]
xfs_buf_iodone_work+0x4d/0xa0 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff81077a11>]
process_one_work+0x141/0x4a0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff810789e8>]
worker_thread+0x168/0x410
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff81078880>] ?
manage_workers+0x120/0x120
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107df10>]
kthread+0xc0/0xd0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107de50>] ?
flush_kthread_worker+0xb0/0xb0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff816ab86c>]
ret_from_fork+0x7c/0xb0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107de50>] ?
flush_kthread_worker+0xb0/0xb0
Aug 6 15:23:07 dev kernel: [ 304.432035] Code: 31 c0 80 3f 00 55 48 89 e5
74 11 48 89 f8 66 90 48 83 c0 01 80 38 00 75 f7 48 29 f8 5d c3 66 90 55 31
c0 48 85 f6 48 89 e5 74 23 <80> 3f 00 74 1e 48 89 f8 eb 0c 0f 1f 00 48 83 ee
01 80 38 00 74
Aug 6 15:23:07 dev kernel: [ 304.432035] RIP [<ffffffff8133c2cb>]
strnlen+0xb/0x30
Aug 6 15:23:07 dev kernel: [ 304.432035] RSP <ffff880035461b08>
So previously you said: "So, something is corrupting memory and stamping all
over the XFS structures." and also "given you have a bunch of out of tree
modules loaded (and some which are experiemental) suggests that you have a
problem with your storage...".
But I believe, my analysis shows that during the mount sequence XFS does not
wait properly for all the bios to complete, before failing the mount
sequence back to the caller.

As an experiment, what about the following? Compile tested only and not
safe for general use.
What might help more is to see if you can create a reproducer on a
recent, clean kernel. Perhaps a metadump of your reproducer fs combined
with whatever block device ENOSPC hack you're using would do it.
Brian
---8<---
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index cd7b8ca..fbcf524 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1409,19 +1409,27 @@ xfs_buf_iorequest(
* case nothing will ever complete. It returns the I/O error code, if any, or
* 0 if there was no error.
*/
-int
-xfs_buf_iowait(
- xfs_buf_t *bp)
+static int
+__xfs_buf_iowait(
+ struct xfs_buf *bp,
+ bool skip_error)
{
trace_xfs_buf_iowait(bp, _RET_IP_);
- if (!bp->b_error)
+ if (skip_error || !bp->b_error)
wait_for_completion(&bp->b_iowait);
trace_xfs_buf_iowait_done(bp, _RET_IP_);
return bp->b_error;
}
+int
+xfs_buf_iowait(
+ struct xfs_buf *bp)
+{
+ return __xfs_buf_iowait(bp, false);
+}
+
xfs_caddr_t
xfs_buf_offset(
xfs_buf_t *bp,
@@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit(
bp = list_first_entry(&io_list, struct xfs_buf, b_list);
list_del_init(&bp->b_list);
- error2 = xfs_buf_iowait(bp);
+ error2 = __xfs_buf_iowait(bp, true);
xfs_buf_relse(bp);
if (!error)
error = error2;
---

I think that this patch fixes the problem. I tried reproducing it like
30 times, and it doesn't happen with this patch. Dropping this patch
reproduces the problem within 1 or 2 tries. Thanks!
What are next steps? How to make it "safe for general use"?

Thanks,
Alex.

Post by Brian Foster

Post by Alex Lyakas
Thanks,
Alex.
-----Original Message----- From: Dave Chinner
Sent: 05 August, 2014 2:07 AM
To: Alex Lyakas
Subject: Re: use-after-free on log replay failure

Post by Alex Lyakas
Greetings,
we had a log replay failure due to some errors that the underlying
[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
("xlog_recover_iodone") error 28 numblks 16
[49133.802495] XFS (dm-95): log mount/recovery failed: error 28
[49133.802644] XFS (dm-95): log mount failed

#define ENOSPC 28 /* No space left on device */
You're getting an ENOSPC as a metadata IO error during log recovery?
Thin provisioning problem, perhaps, and the error is occurring on
8d6c121 xfs: fix buffer use after free on IO error
Cheers,
Dave.
--
Dave Chinner

Brian Foster

2014-08-11 13:20:57 UTC

Post by Alex Lyakas
Hi Brian,

Post by Brian Foster

Post by Alex Lyakas
Hello Dave and Brian,
Dave, I tried the patch you suggested, but it does not fix the issue. I did
some further digging, and it appears that _xfs_buf_ioend(schedule=1) can be
called from xfs_buf_iorequest(), which the patch fixes, but also from
xfs_buf_bio_end_io() which is my case. I am reproducing the issue pretty
- xlog_recover() calls xlog_find_tail(). This works alright.

What's the purpose of a sleep here?

Post by Alex Lyakas
- Now I add a small sleep before calling xlog_do_recover(), and meanwhile I
instruct my block device to return ENOSPC for any WRITE from now on.
What seems to happen is that several WRITE bios are submitted and they all
Aug 6 15:23:07 dev kernel: [ 304.410528] [56]xfs*[xfs_buf_ioend:1056]
XFS(dm-19): Scheduling xfs_buf_iodone_work on error
Aug 6 15:23:07 dev kernel: [ 304.410534] Pid: 56, comm: kworker/u:1
Tainted: G W O 3.8.13-557-generic #1382000791
Aug 6 15:23:07 dev kernel: [ 304.410587] [<ffffffffa04d6654>]
xfs_buf_ioend+0x1a4/0x1b0 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.410621] [<ffffffffa04d6685>]
_xfs_buf_ioend+0x25/0x30 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.410643] [<ffffffffa04d6b3d>]
xfs_buf_bio_end_io+0x3d/0x50 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.410652] [<ffffffff811c3d8d>]
bio_endio+0x1d/0x40
...
At this point, they are scheduled to run xlog_recover_iodone through
xfslogd_workqueue.
The first callback that gets called, calls xfs_do_force_shutdown in stack
block 0x3780001 ("xlog_recover_iodone") error 28 numblks 1
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
Aug 6 15:23:07 dev kernel: [ 304.413837] [<ffffffffa04e0b60>]
xfs_do_force_shutdown+0x40/0x180 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.413870] [<ffffffffa0526848>] ?
xlog_recover_iodone+0x48/0x70 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.413902] [<ffffffffa0526848>]
xlog_recover_iodone+0x48/0x70 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.413923] [<ffffffffa04d645d>]
xfs_buf_iodone_work+0x4d/0xa0 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.413930] [<ffffffff81077a11>]
process_one_work+0x141/0x4a0
Aug 6 15:23:07 dev kernel: [ 304.413937] [<ffffffff810789e8>]
worker_thread+0x168/0x410
Aug 6 15:23:07 dev kernel: [ 304.413943] [<ffffffff81078880>] ?
manage_workers+0x120/0x120
Aug 6 15:23:07 dev kernel: [ 304.413949] [<ffffffff8107df10>]
kthread+0xc0/0xd0
Aug 6 15:23:07 dev kernel: [ 304.413954] [<ffffffff8107de50>] ?
flush_kthread_worker+0xb0/0xb0
Aug 6 15:23:07 dev kernel: [ 304.413976] [<ffffffff816ab86c>]
ret_from_fork+0x7c/0xb0
Aug 6 15:23:07 dev kernel: [ 304.413986] [<ffffffff8107de50>] ?
flush_kthread_worker+0xb0/0xb0
Aug 6 15:23:07 dev kernel: [ 304.413990] ---[ end trace 988d698520e1fa81
]---
Aug 6 15:23:07 dev kernel: [ 304.414012] XFS (dm-19): I/O Error Detected.
Shutting down filesystem
Aug 6 15:23:07 dev kernel: [ 304.415936] XFS (dm-19): Please umount the
filesystem and rectify the problem(s)
block 0x3780002 ("xlog_recover_iodone") error 28 numblks 1
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
block 0x3780008 ("xlog_recover_iodone") error 28 numblks 8
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
block 0x3780010 ("xlog_recover_iodone") error 28 numblks 8
Aug 6 15:23:07 dev kernel: [ 304.423438] XFS (dm-19): log mount/recovery
failed: error 28
Aug 6 15:23:07 dev kernel: [ 304.423757] XFS (dm-19): log mount failed
But there are still additional callbacks to deliver, which the mount
sequence did not wait for!
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
block 0x3780018 ("xlog_recover_iodone") error 28 numblks 8
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa0526848
block 0x37800a0 ("xlog_recover_iodone") error 28 numblks 16
Notice the junk that they are printing! Naturally, because xfs_mount
structure has been kfreed.
Finally the kernel crashes (instead of printing junk), because the xfs_mount
Aug 6 15:23:07 dev kernel: [ 304.430796] general protection fault: 0000
[#1] SMP
Aug 6 15:23:07 dev kernel: [ 304.432035] Modules linked in: xfrm_user
xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 iscsi_scst_tcp(O)
scst_vdisk(O) scst(O) dm_zcache(O) dm_btrfs(O) xfs(O) btrfs(O) libcrc32c
raid456(O) async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq
async_tx raid1(O) md_mod deflate zlib_deflate ctr twofish_generic
twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64
twofish_common camellia_generic serpent_generic blowfish_generic
blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic xcbc
rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin kvm vfat
fat ppdev psmouse microcode nfsd nfs_acl dm_multipath(O) serio_raw
parport_pc nfsv4 dm_iostat(O) mac_hid i2c_piix4 auth_rpcgss nfs fscache
lockd sunrpc lp parport floppy
Aug 6 15:23:07 dev kernel: [ 304.432035] CPU 1
Aug 6 15:23:07 dev kernel: [ 304.432035] Pid: 133, comm: kworker/1:1H
Tainted: G W O 3.8.13-557-generic #1382000791 Bochs Bochs
Aug 6 15:23:07 dev kernel: [ 304.432035] RIP: 0010:[<ffffffff8133c2cb>]
[<ffffffff8133c2cb>] strnlen+0xb/0x30
Aug 6 15:23:07 dev kernel: [ 304.432035] RSP: 0018:ffff880035461b08
EFLAGS: 00010086
ffffffff81e6a4e7 RCX: 0000000000000000
ffffffffffffffff RDI: e4e8390a265c0000
000000000000ffff R09: 000000000000ffff
00000000000004cd R12: e4e8390a265c0000
0000000000000000 R15: 000000000000ffff
Aug 6 15:23:07 dev kernel: [ 304.432035] FS: 0000000000000000(0000)
GS:ffff88007fc80000(0000) knlGS:0000000000000000
000000008005003b
000000007702a000 CR4: 00000000000006e0
0000000000000000 DR2: 0000000000000000
00000000ffff0ff0 DR7: 0000000000000400
Aug 6 15:23:07 dev kernel: [ 304.432035] Process kworker/1:1H (pid: 133,
threadinfo ffff880035460000, task ffff880035412e00)
Aug 6 15:23:07 dev kernel: [ 304.432035] ffff880035461b48
ffffffff8133dd5e 0000000000000000 ffffffff81e6a4e7
Aug 6 15:23:07 dev kernel: [ 304.432035] ffffffffa0566cba
ffff880035461c80 ffffffffa0566cba ffffffff81e6a8c0
Aug 6 15:23:07 dev kernel: [ 304.432035] ffff880035461bc8
ffffffff8133ef59 ffff880035461bc8 ffffffff81c84040
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133dd5e>]
string.isra.4+0x3e/0xd0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133ef59>]
vsnprintf+0x219/0x640
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8133f441>]
vscnprintf+0x11/0x30
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8105a971>]
vprintk_emit+0xc1/0x490
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8105aa20>] ?
vprintk_emit+0x170/0x490
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8168b992>]
printk+0x61/0x63
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e9bf1>]
__xfs_printk+0x31/0x50 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e9e43>]
xfs_notice+0x53/0x60 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04e0c15>]
xfs_do_force_shutdown+0xf5/0x180 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa0526848>] ?
xlog_recover_iodone+0x48/0x70 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa0526848>]
xlog_recover_iodone+0x48/0x70 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffffa04d645d>]
xfs_buf_iodone_work+0x4d/0xa0 [xfs]
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff81077a11>]
process_one_work+0x141/0x4a0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff810789e8>]
worker_thread+0x168/0x410
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff81078880>] ?
manage_workers+0x120/0x120
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107df10>]
kthread+0xc0/0xd0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107de50>] ?
flush_kthread_worker+0xb0/0xb0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff816ab86c>]
ret_from_fork+0x7c/0xb0
Aug 6 15:23:07 dev kernel: [ 304.432035] [<ffffffff8107de50>] ?
flush_kthread_worker+0xb0/0xb0
Aug 6 15:23:07 dev kernel: [ 304.432035] Code: 31 c0 80 3f 00 55 48 89 e5
74 11 48 89 f8 66 90 48 83 c0 01 80 38 00 75 f7 48 29 f8 5d c3 66 90 55 31
c0 48 85 f6 48 89 e5 74 23 <80> 3f 00 74 1e 48 89 f8 eb 0c 0f 1f 00 48 83 ee
01 80 38 00 74
Aug 6 15:23:07 dev kernel: [ 304.432035] RIP [<ffffffff8133c2cb>]
strnlen+0xb/0x30
Aug 6 15:23:07 dev kernel: [ 304.432035] RSP <ffff880035461b08>
So previously you said: "So, something is corrupting memory and stamping all
over the XFS structures." and also "given you have a bunch of out of tree
modules loaded (and some which are experiemental) suggests that you have a
problem with your storage...".
But I believe, my analysis shows that during the mount sequence XFS does not
wait properly for all the bios to complete, before failing the mount
sequence back to the caller.

As an experiment, what about the following? Compile tested only and not
safe for general use.
What might help more is to see if you can create a reproducer on a
recent, clean kernel. Perhaps a metadump of your reproducer fs combined
with whatever block device ENOSPC hack you're using would do it.
Brian
---8<---
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index cd7b8ca..fbcf524 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1409,19 +1409,27 @@ xfs_buf_iorequest(
* case nothing will ever complete. It returns the I/O error code, if any, or
* 0 if there was no error.
*/
-int
-xfs_buf_iowait(
- xfs_buf_t *bp)
+static int
+__xfs_buf_iowait(
+ struct xfs_buf *bp,
+ bool skip_error)
{
trace_xfs_buf_iowait(bp, _RET_IP_);
- if (!bp->b_error)
+ if (skip_error || !bp->b_error)
wait_for_completion(&bp->b_iowait);
trace_xfs_buf_iowait_done(bp, _RET_IP_);
return bp->b_error;
}
+int
+xfs_buf_iowait(
+ struct xfs_buf *bp)
+{
+ return __xfs_buf_iowait(bp, false);
+}
+
xfs_caddr_t
xfs_buf_offset(
xfs_buf_t *bp,
@@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit(
bp = list_first_entry(&io_list, struct xfs_buf, b_list);
list_del_init(&bp->b_list);
- error2 = xfs_buf_iowait(bp);
+ error2 = __xfs_buf_iowait(bp, true);
xfs_buf_relse(bp);
if (!error)
error = error2;
---

I think that this patch fixes the problem. I tried reproducing it like
30 times, and it doesn't happen with this patch. Dropping this patch
reproduces the problem within 1 or 2 tries. Thanks!
What are next steps? How to make it "safe for general use"?

Ok, thanks for testing. I think that implicates the caller bypassing the
expected blocking with the right sequence of log recovery I/Os and
device failure. TBH, I'd still like to see the specifics, if possible.
Could you come up with a generic reproducer for this? I think a metadump
of the fs with the dirty log plus whatever device failure simulation
hack you're using would suffice.

The ideal fix is not yet clear to me. Technically, we could always find
a way to customize this particular path to rely on b_iowait since that
appears safe, but that could just be a band aid over a larger problem.
I'll need to step back and stare at this code some more to try and
understand the layering better, then follow up with something when
things are more clear.

Brian

Post by Alex Lyakas
Thanks,
Alex.

Post by Brian Foster

Post by Alex Lyakas
Thanks,
Alex.
-----Original Message----- From: Dave Chinner
Sent: 05 August, 2014 2:07 AM
To: Alex Lyakas
Subject: Re: use-after-free on log replay failure

Post by Alex Lyakas
Greetings,
we had a log replay failure due to some errors that the underlying
[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
("xlog_recover_iodone") error 28 numblks 16
[49133.802495] XFS (dm-95): log mount/recovery failed: error 28
[49133.802644] XFS (dm-95): log mount failed

#define ENOSPC 28 /* No space left on device */
You're getting an ENOSPC as a metadata IO error during log recovery?
Thin provisioning problem, perhaps, and the error is occurring on
8d6c121 xfs: fix buffer use after free on IO error
Cheers,
Dave.
--
Dave Chinner

_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs

Dave Chinner

2014-08-11 21:52:07 UTC

.....

Post by Brian Foster

Post by Alex Lyakas

Post by Brian Foster

Post by Alex Lyakas
But I believe, my analysis shows that during the mount sequence XFS does not
wait properly for all the bios to complete, before failing the mount
sequence back to the caller.

As an experiment, what about the following? Compile tested only and not
safe for general use.

...

Post by Brian Foster

Post by Alex Lyakas

Post by Brian Foster
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index cd7b8ca..fbcf524 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1409,19 +1409,27 @@ xfs_buf_iorequest(
* case nothing will ever complete. It returns the I/O error code, if any, or
* 0 if there was no error.
*/
-int
-xfs_buf_iowait(
- xfs_buf_t *bp)
+static int
+__xfs_buf_iowait(
+ struct xfs_buf *bp,
+ bool skip_error)
{
trace_xfs_buf_iowait(bp, _RET_IP_);
- if (!bp->b_error)
+ if (skip_error || !bp->b_error)
wait_for_completion(&bp->b_iowait);
trace_xfs_buf_iowait_done(bp, _RET_IP_);
return bp->b_error;
}
+int
+xfs_buf_iowait(
+ struct xfs_buf *bp)
+{
+ return __xfs_buf_iowait(bp, false);
+}
+
xfs_caddr_t
xfs_buf_offset(
xfs_buf_t *bp,
@@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit(
bp = list_first_entry(&io_list, struct xfs_buf, b_list);
list_del_init(&bp->b_list);
- error2 = xfs_buf_iowait(bp);
+ error2 = __xfs_buf_iowait(bp, true);
xfs_buf_relse(bp);
if (!error)
error = error2;

Not waiting here on buffer error should not matter. Any buffer that
is under IO and requires completion should be referenced, and that
means it should be caught and waited on by xfs_wait_buftarg() in the
mount failure path after log recovery fails.

Post by Brian Foster

Post by Alex Lyakas

Post by Brian Foster
---

I think that this patch fixes the problem. I tried reproducing it like
30 times, and it doesn't happen with this patch. Dropping this patch
reproduces the problem within 1 or 2 tries. Thanks!
What are next steps? How to make it "safe for general use"?

Ok, thanks for testing. I think that implicates the caller bypassing the
expected blocking with the right sequence of log recovery I/Os and
device failure. TBH, I'd still like to see the specifics, if possible.
Could you come up with a generic reproducer for this? I think a metadump
of the fs with the dirty log plus whatever device failure simulation
hack you're using would suffice.

The real issue is we don't know exactly what code is being tested
(it's 3.8 + random bug fix backports + custom code). Even if we have
a reproducer there's no guarantee it will reproduce on a current
kernel. IOWs, we are stumbling around in the dark bashing our heads
against everything in the room, and that just wastes everyone's
time.

We need a reproducer that triggers on a current, unmodified
kernel release. You can use dm-faulty to error out all writes just
like you are doing with your custom code. See
xfstests::tests/generic/321 and common/dmflakey for to do this.
Ideally the reproducer is in a form that xfstests can use....

If you can't reproduce it on an upstream kernel, then git bisect is
your friend. It will find the commit that fixed the problem you are
seeing....

Post by Brian Foster
The ideal fix is not yet clear to me.

We are not even that far along - the root cause of the bug is not at
all clear to me. :/

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Brian Foster

2014-08-12 12:03:41 UTC

Post by Dave Chinner
.....

Post by Brian Foster

Post by Alex Lyakas

Post by Brian Foster

Post by Alex Lyakas
But I believe, my analysis shows that during the mount sequence XFS does not
wait properly for all the bios to complete, before failing the mount
sequence back to the caller.

As an experiment, what about the following? Compile tested only and not
safe for general use.

...

Post by Brian Foster

Post by Alex Lyakas

Post by Brian Foster
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index cd7b8ca..fbcf524 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1409,19 +1409,27 @@ xfs_buf_iorequest(
* case nothing will ever complete. It returns the I/O error code, if any, or
* 0 if there was no error.
*/
-int
-xfs_buf_iowait(
- xfs_buf_t *bp)
+static int
+__xfs_buf_iowait(
+ struct xfs_buf *bp,
+ bool skip_error)
{
trace_xfs_buf_iowait(bp, _RET_IP_);
- if (!bp->b_error)
+ if (skip_error || !bp->b_error)
wait_for_completion(&bp->b_iowait);
trace_xfs_buf_iowait_done(bp, _RET_IP_);
return bp->b_error;
}
+int
+xfs_buf_iowait(
+ struct xfs_buf *bp)
+{
+ return __xfs_buf_iowait(bp, false);
+}
+
xfs_caddr_t
xfs_buf_offset(
xfs_buf_t *bp,
@@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit(
bp = list_first_entry(&io_list, struct xfs_buf, b_list);
list_del_init(&bp->b_list);
- error2 = xfs_buf_iowait(bp);
+ error2 = __xfs_buf_iowait(bp, true);
xfs_buf_relse(bp);
if (!error)
error = error2;

Not waiting here on buffer error should not matter. Any buffer that
is under IO and requires completion should be referenced, and that
means it should be caught and waited on by xfs_wait_buftarg() in the
mount failure path after log recovery fails.

I think that assumes the I/O is successful. Looking through
xlog_recover_buffer_pass2() as an example, we read the buffer which
should return with b_hold == 1. The delwri queue increments the hold and
we xfs_buf_relse() in the return path (i.e., buffer is now held by the
delwri queue awaiting submission).

Sometime later we delwri_submit()... xfs_buf_iorequest() does an
xfs_buf_hold() and xfs_buf_rele() within that single function. The
delwri_submit() releases its hold after xfs_buf_iowait(), which I guess
at that point bp should go onto the lru (b_hold back to 1 in
xfs_buf_rele(). Indeed, the caller has lost scope of the buffer at this
point.

So unless I miss something or got the lifecycle wrong here, which is
easily possible ;), this all hinges on xfs_buf_iowait(). That's where
the last hold forcing the buffer to stay around goes away.
xfs_buftarg_wait_rele() will dispose the buffer if b_hold == 1. If
xfs_buf_iowait() is racy in the event of I/O errors via the bio
callback, I think this path is susceptible just the same.

Post by Dave Chinner

Post by Brian Foster

Post by Alex Lyakas

Post by Brian Foster
---

I think that this patch fixes the problem. I tried reproducing it like
30 times, and it doesn't happen with this patch. Dropping this patch
reproduces the problem within 1 or 2 tries. Thanks!
What are next steps? How to make it "safe for general use"?

Ok, thanks for testing. I think that implicates the caller bypassing the
expected blocking with the right sequence of log recovery I/Os and
device failure. TBH, I'd still like to see the specifics, if possible.
Could you come up with a generic reproducer for this? I think a metadump
of the fs with the dirty log plus whatever device failure simulation
hack you're using would suffice.

The real issue is we don't know exactly what code is being tested
(it's 3.8 + random bug fix backports + custom code). Even if we have
a reproducer there's no guarantee it will reproduce on a current
kernel. IOWs, we are stumbling around in the dark bashing our heads
against everything in the room, and that just wastes everyone's
time.
We need a reproducer that triggers on a current, unmodified
kernel release. You can use dm-faulty to error out all writes just
like you are doing with your custom code. See
xfstests::tests/generic/321 and common/dmflakey for to do this.
Ideally the reproducer is in a form that xfstests can use....
If you can't reproduce it on an upstream kernel, then git bisect is
your friend. It will find the commit that fixed the problem you are
seeing....

Ugh, yeah. The fact that this was customized as such apparently went
over my head. I agree completely. This needs to be genericized to a
pristine, preferably current kernel. The experiment patch could be
papering over something completely different.

Post by Dave Chinner

Post by Brian Foster
The ideal fix is not yet clear to me.

We are not even that far along - the root cause of the bug is not at
all clear to me. :/

Yeah.. the above was just the theory that motivated the experiment in
the previously posted patch. It of course remains a theory until we can
see the race in action. I was referring to the potential fix for the
raciness of xfs_buf_iowait() with regard to bio errors and the wq iodone
handling, while still asking for a reproducer to confirm the actual
problem. FWIW, I'm not too high on changes in the buf management code,
even a smallish behavior change, without a real trace of some sort that
documents the problem and justifies the change.

Brian

Post by Dave Chinner
Cheers,
Dave.
--
Dave Chinner

Alex Lyakas

2014-08-12 12:39:02 UTC

Hello Dave, Brian,
I will describe a generic reproduction that you ask for.

It was performed on pristine XFS code from 3.8.13, taken from here:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git

top commit being:
commit dbf932a9b316d5b29b3e220e5a30e7a165ad2992
Author: Greg Kroah-Hartman <***@linuxfoundation.org>
Date: Sat May 11 13:57:46 2013 -0700

Linux 3.8.13

I made a single (I swear!) code change in XFS:

diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 96fcbb8..d756bf6 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3736,6 +3736,9 @@ xlog_recover(
if ((error = xlog_find_tail(log, &head_blk, &tail_blk)))
return error;

+ xfs_notice(log->l_mp, "Sleep 10s before xlog_do_recover");
+ msleep(10000);
+
if (tail_blk != head_blk) {
/* There used to be a comment here:
*

Fresh XFS was formatted on a 20 GB block device within a VM, using:
mkfs.xfs -f -K /dev/vde -p /etc/zadara/xfs.protofile
and:
***@vc-00-00-1383-dev:~# cat /etc/zadara/xfs.protofile
dummy : bootfilename, not used, backward compatibility
0 0 : numbers of blocks and inodes, not used,
backward compatibility
d--777 0 0 : set 777 perms for the root dir
$
$

I mounted XFS with the following options:
rw,sync,noatime,wsync,attr2,inode64,noquota 0 0

I started a couple of processes writing files sequentially onto this mount
point, and after few seconds crashed the VM.
When the VM came up, I took the metadump file and placed it in:
https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing

Then I set up the following Device Mapper target onto /dev/vde:
dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0"
I am attaching the code (and Makefile) of dm-linear-custom target. It is
exact copy of dm-linear, except that it has a module parameter. With the
parameter set to 0, this is an identity mapping onto /dev/vde. If the
parameter is set to non-0, all WRITE bios are failed with ENOSPC. There is a
workqueue to fail them in a different context (not sure if really needed,
but that's what our "real" custom
block device does).

Now I did:
mount -o noatime,sync /dev/mapper/VDE /mnt/xfs

The log recovery flow went into the sleep that I added, and then I did:
echo 1 > /sys/module/dm_linear_custom/parameters/fail_writes

Problem reproduced:
Aug 12 14:23:04 vc-00-00-1383-dev kernel: [ 175.000657] XFS (dm-0):
Mounting Filesystem
Aug 12 14:23:04 vc-00-00-1383-dev kernel: [ 175.026991] XFS (dm-0): Sleep
10s before xlog_do_recover
Aug 12 14:23:14 vc-00-00-1383-dev kernel: [ 185.028113] XFS (dm-0):
Starting recovery (logdev: internal)
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556622] XFS (dm-0):
metadata I/O error: block 0x2 ("xlog_recover_iodone") error 28 numblks 1
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556675] XFS (dm-0):
metadata I/O error: block 0x40 ("xlog_recover_iodone") error 28 numblks 16
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556680] XFS (dm-0):
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address
= 0xffffffffa0349f68
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556683] XFS (dm-0): I/O
Error Detected. Shutting down filesystem
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556684] XFS (dm-0): Please
umount the filesystem and rectify the problem(s)
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556766] XFS (dm-0):
metadata I/O error: block 0xa00002 ("xlog_recover_iodone") error 5 numblks 1
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556769] XFS (dm-0):
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address
= 0xffffffffa0349f68
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556771] XFS (dm-0):
metadata I/O error: block 0xa00008 ("xlog_recover_iodone") error 5 numblks 8
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556774] XFS (dm-0):
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address
= 0xffffffffa0349f68
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556776] XFS (dm-0):
metadata I/O error: block 0xa00010 ("xlog_recover_iodone") error 5 numblks 8
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556779] XFS (dm-0):
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address
= 0xffffffffa0349f68
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556781] XFS (dm-0):
metadata I/O error: block 0xa00018 ("xlog_recover_iodone") error 5 numblks 8
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556783] XFS (dm-0):
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address
= 0xffffffffa0349f68
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556785] XFS (dm-0):
metadata I/O error: block 0xa00040 ("xlog_recover_iodone") error 5 numblks
16
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556788] XFS (dm-0):
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address
= 0xffffffffa0349f68
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556790] XFS (dm-0):
metadata I/O error: block 0xa00050 ("xlog_recover_iodone") error 5 numblks
16
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556793] XFS (dm-0):
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address
= 0xffffffffa0349f68
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556880] XFS (dm-0):
metadata I/O error: block 0xa00001 ("xlog_recover_iodone") error 28 numblks
1
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556884] XFS (dm-0):
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address
= 0xffffffffa0349f68
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556977] XFS (dm-0): log
mount/recovery failed: error 28
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.557215] XFS (dm-0): log
mount failed
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.573194] XFS ():
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address
= 0xffffffffa0349f68
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.573214] XFS (): metadata
I/O error: block 0x18 ("xlog_recover_iodone") error 28 numblks 8
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.574685] XFS ():
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address
= 0xffffffffa0349f68

As you see, after mount completes, IO callbacks are still arriving and
printing junk (empty string in this case). Immediately after that kernel
dies.

Is this description generic enough?

Thanks,
Alex.

-----Original Message-----
From: Brian Foster
Sent: 12 August, 2014 3:03 PM
To: Dave Chinner
Cc: Alex Lyakas ; ***@oss.sgi.com
Subject: Re: use-after-free on log replay failure

Post by Dave Chinner
.....

Post by Brian Foster

Post by Alex Lyakas

Post by Brian Foster

Post by Alex Lyakas
But I believe, my analysis shows that during the mount sequence XFS does not
wait properly for all the bios to complete, before failing the mount
sequence back to the caller.

As an experiment, what about the following? Compile tested only and not
safe for general use.

...

Post by Brian Foster

Post by Alex Lyakas

Post by Brian Foster
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index cd7b8ca..fbcf524 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1409,19 +1409,27 @@ xfs_buf_iorequest(
* case nothing will ever complete. It returns the I/O error code, if any, or
* 0 if there was no error.
*/
-int
-xfs_buf_iowait(
- xfs_buf_t *bp)
+static int
+__xfs_buf_iowait(
+ struct xfs_buf *bp,
+ bool skip_error)
{
trace_xfs_buf_iowait(bp, _RET_IP_);
- if (!bp->b_error)
+ if (skip_error || !bp->b_error)
wait_for_completion(&bp->b_iowait);
trace_xfs_buf_iowait_done(bp, _RET_IP_);
return bp->b_error;
}
+int
+xfs_buf_iowait(
+ struct xfs_buf *bp)
+{
+ return __xfs_buf_iowait(bp, false);
+}
+
xfs_caddr_t
xfs_buf_offset(
xfs_buf_t *bp,
@@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit(
bp = list_first_entry(&io_list, struct xfs_buf, b_list);
list_del_init(&bp->b_list);
- error2 = xfs_buf_iowait(bp);
+ error2 = __xfs_buf_iowait(bp, true);
xfs_buf_relse(bp);
if (!error)
error = error2;

Not waiting here on buffer error should not matter. Any buffer that
is under IO and requires completion should be referenced, and that
means it should be caught and waited on by xfs_wait_buftarg() in the
mount failure path after log recovery fails.

I think that assumes the I/O is successful. Looking through
xlog_recover_buffer_pass2() as an example, we read the buffer which
should return with b_hold == 1. The delwri queue increments the hold and
we xfs_buf_relse() in the return path (i.e., buffer is now held by the
delwri queue awaiting submission).

Sometime later we delwri_submit()... xfs_buf_iorequest() does an
xfs_buf_hold() and xfs_buf_rele() within that single function. The
delwri_submit() releases its hold after xfs_buf_iowait(), which I guess
at that point bp should go onto the lru (b_hold back to 1 in
xfs_buf_rele(). Indeed, the caller has lost scope of the buffer at this
point.

So unless I miss something or got the lifecycle wrong here, which is
easily possible ;), this all hinges on xfs_buf_iowait(). That's where
the last hold forcing the buffer to stay around goes away.
xfs_buftarg_wait_rele() will dispose the buffer if b_hold == 1. If
xfs_buf_iowait() is racy in the event of I/O errors via the bio
callback, I think this path is susceptible just the same.

Post by Dave Chinner

Post by Brian Foster

Post by Alex Lyakas

Post by Brian Foster
---

I think that this patch fixes the problem. I tried reproducing it like
30 times, and it doesn't happen with this patch. Dropping this patch
reproduces the problem within 1 or 2 tries. Thanks!
What are next steps? How to make it "safe for general use"?

Ok, thanks for testing. I think that implicates the caller bypassing the
expected blocking with the right sequence of log recovery I/Os and
device failure. TBH, I'd still like to see the specifics, if possible.
Could you come up with a generic reproducer for this? I think a metadump
of the fs with the dirty log plus whatever device failure simulation
hack you're using would suffice.

The real issue is we don't know exactly what code is being tested
(it's 3.8 + random bug fix backports + custom code). Even if we have
a reproducer there's no guarantee it will reproduce on a current
kernel. IOWs, we are stumbling around in the dark bashing our heads
against everything in the room, and that just wastes everyone's
time.
We need a reproducer that triggers on a current, unmodified
kernel release. You can use dm-faulty to error out all writes just
like you are doing with your custom code. See
xfstests::tests/generic/321 and common/dmflakey for to do this.
Ideally the reproducer is in a form that xfstests can use....
If you can't reproduce it on an upstream kernel, then git bisect is
your friend. It will find the commit that fixed the problem you are
seeing....

Ugh, yeah. The fact that this was customized as such apparently went
over my head. I agree completely. This needs to be genericized to a
pristine, preferably current kernel. The experiment patch could be
papering over something completely different.

Post by Dave Chinner

Post by Brian Foster
The ideal fix is not yet clear to me.

We are not even that far along - the root cause of the bug is not at
all clear to me. :/

Yeah.. the above was just the theory that motivated the experiment in
the previously posted patch. It of course remains a theory until we can
see the race in action. I was referring to the potential fix for the
raciness of xfs_buf_iowait() with regard to bio errors and the wq iodone
handling, while still asking for a reproducer to confirm the actual
problem. FWIW, I'm not too high on changes in the buf management code,
even a smallish behavior change, without a real trace of some sort that
documents the problem and justifies the change.

Brian

Post by Dave Chinner
Cheers,
Dave.
--
Dave Chinner

Brian Foster

2014-08-12 19:31:04 UTC

Post by Alex Lyakas
Hello Dave, Brian,
I will describe a generic reproduction that you ask for.
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git

This seems generic enough to me. Could you try on a more recent kernel?
Dave had mentioned there were fixes in this area of log recovery, so a
bisect might be all that is necessary to track down the patch you need.
Otherwise, we can pick up debugging from something more recent.

Brian

Post by Alex Lyakas
commit dbf932a9b316d5b29b3e220e5a30e7a165ad2992
Date: Sat May 11 13:57:46 2013 -0700
Linux 3.8.13
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 96fcbb8..d756bf6 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3736,6 +3736,9 @@ xlog_recover(
if ((error = xlog_find_tail(log, &head_blk, &tail_blk)))
return error;
+ xfs_notice(log->l_mp, "Sleep 10s before xlog_do_recover");
+ msleep(10000);
+
if (tail_blk != head_blk) {
*
mkfs.xfs -f -K /dev/vde -p /etc/zadara/xfs.protofile
dummy : bootfilename, not used, backward compatibility
0 0 : numbers of blocks and inodes, not used,
backward compatibility
d--777 0 0 : set 777 perms for the root dir
$
$
rw,sync,noatime,wsync,attr2,inode64,noquota 0 0
I started a couple of processes writing files sequentially onto this mount
point, and after few seconds crashed the VM.
https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing
dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0"
I am attaching the code (and Makefile) of dm-linear-custom target. It is
exact copy of dm-linear, except that it has a module parameter. With the
parameter set to 0, this is an identity mapping onto /dev/vde. If the
parameter is set to non-0, all WRITE bios are failed with ENOSPC. There is a
workqueue to fail them in a different context (not sure if really needed,
but that's what our "real" custom
block device does).
mount -o noatime,sync /dev/mapper/VDE /mnt/xfs
echo 1 > /sys/module/dm_linear_custom/parameters/fail_writes
Mounting Filesystem
Aug 12 14:23:04 vc-00-00-1383-dev kernel: [ 175.026991] XFS (dm-0): Sleep
10s before xlog_do_recover
Starting recovery (logdev: internal)
metadata I/O error: block 0x2 ("xlog_recover_iodone") error 28 numblks 1
metadata I/O error: block 0x40 ("xlog_recover_iodone") error 28 numblks 16
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address
= 0xffffffffa0349f68
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556683] XFS (dm-0): I/O
Error Detected. Shutting down filesystem
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556684] XFS (dm-0): Please
umount the filesystem and rectify the problem(s)
metadata I/O error: block 0xa00002 ("xlog_recover_iodone") error 5 numblks 1
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address
= 0xffffffffa0349f68
metadata I/O error: block 0xa00008 ("xlog_recover_iodone") error 5 numblks 8
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address
= 0xffffffffa0349f68
metadata I/O error: block 0xa00010 ("xlog_recover_iodone") error 5 numblks 8
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address
= 0xffffffffa0349f68
metadata I/O error: block 0xa00018 ("xlog_recover_iodone") error 5 numblks 8
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address
= 0xffffffffa0349f68
metadata I/O error: block 0xa00040 ("xlog_recover_iodone") error 5 numblks
16
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address
= 0xffffffffa0349f68
metadata I/O error: block 0xa00050 ("xlog_recover_iodone") error 5 numblks
16
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address
= 0xffffffffa0349f68
metadata I/O error: block 0xa00001 ("xlog_recover_iodone") error 28 numblks
1
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address
= 0xffffffffa0349f68
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.556977] XFS (dm-0): log
mount/recovery failed: error 28
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.557215] XFS (dm-0): log
mount failed
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address
= 0xffffffffa0349f68
Aug 12 14:23:18 vc-00-00-1383-dev kernel: [ 188.573214] XFS (): metadata
I/O error: block 0x18 ("xlog_recover_iodone") error 28 numblks 8
xfs_do_force_shutdown(0x1) called from line 377 of file
/mnt/work/alex/linux-stable/source/fs/xfs/xfs_log_recover.c. Return address
= 0xffffffffa0349f68
As you see, after mount completes, IO callbacks are still arriving and
printing junk (empty string in this case). Immediately after that kernel
dies.
Is this description generic enough?
Thanks,
Alex.
-----Original Message----- From: Brian Foster
Sent: 12 August, 2014 3:03 PM
To: Dave Chinner
Subject: Re: use-after-free on log replay failure

Post by Dave Chinner
.....

Post by Brian Foster

Post by Alex Lyakas

Post by Brian Foster

Post by Alex Lyakas
But I believe, my analysis shows that during the mount sequence XFS
does not
wait properly for all the bios to complete, before failing the > >

mount

Post by Brian Foster

Post by Alex Lyakas
sequence back to the caller.

As an experiment, what about the following? Compile tested only and not
safe for general use.

...

Post by Brian Foster

Post by Alex Lyakas

Post by Brian Foster
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index cd7b8ca..fbcf524 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1409,19 +1409,27 @@ xfs_buf_iorequest(
* case nothing will ever complete. It returns the I/O error code,
if any, or
* 0 if there was no error.
*/
-int
-xfs_buf_iowait(
- xfs_buf_t *bp)
+static int
+__xfs_buf_iowait(
+ struct xfs_buf *bp,
+ bool skip_error)
{
trace_xfs_buf_iowait(bp, _RET_IP_);
- if (!bp->b_error)
+ if (skip_error || !bp->b_error)
wait_for_completion(&bp->b_iowait);
trace_xfs_buf_iowait_done(bp, _RET_IP_);
return bp->b_error;
}
+int
+xfs_buf_iowait(
+ struct xfs_buf *bp)
+{
+ return __xfs_buf_iowait(bp, false);
+}
+
xfs_caddr_t
xfs_buf_offset(
xfs_buf_t *bp,
@@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit(
bp = list_first_entry(&io_list, struct xfs_buf, > >

b_list);

Post by Alex Lyakas

Post by Brian Foster
list_del_init(&bp->b_list);
- error2 = xfs_buf_iowait(bp);
+ error2 = __xfs_buf_iowait(bp, true);
xfs_buf_relse(bp);
if (!error)
error = error2;

Not waiting here on buffer error should not matter. Any buffer that
is under IO and requires completion should be referenced, and that
means it should be caught and waited on by xfs_wait_buftarg() in the
mount failure path after log recovery fails.

I think that assumes the I/O is successful. Looking through
xlog_recover_buffer_pass2() as an example, we read the buffer which
should return with b_hold == 1. The delwri queue increments the hold and
we xfs_buf_relse() in the return path (i.e., buffer is now held by the
delwri queue awaiting submission).
Sometime later we delwri_submit()... xfs_buf_iorequest() does an
xfs_buf_hold() and xfs_buf_rele() within that single function. The
delwri_submit() releases its hold after xfs_buf_iowait(), which I guess
at that point bp should go onto the lru (b_hold back to 1 in
xfs_buf_rele(). Indeed, the caller has lost scope of the buffer at this
point.
So unless I miss something or got the lifecycle wrong here, which is
easily possible ;), this all hinges on xfs_buf_iowait(). That's where
the last hold forcing the buffer to stay around goes away.
xfs_buftarg_wait_rele() will dispose the buffer if b_hold == 1. If
xfs_buf_iowait() is racy in the event of I/O errors via the bio
callback, I think this path is susceptible just the same.

Post by Dave Chinner

Post by Brian Foster

Post by Alex Lyakas

Post by Brian Foster
---

I think that this patch fixes the problem. I tried reproducing it like
30 times, and it doesn't happen with this patch. Dropping this patch
reproduces the problem within 1 or 2 tries. Thanks!
What are next steps? How to make it "safe for general use"?

Ok, thanks for testing. I think that implicates the caller bypassing the
expected blocking with the right sequence of log recovery I/Os and
device failure. TBH, I'd still like to see the specifics, if possible.
Could you come up with a generic reproducer for this? I think a metadump
of the fs with the dirty log plus whatever device failure simulation
hack you're using would suffice.

The real issue is we don't know exactly what code is being tested
(it's 3.8 + random bug fix backports + custom code). Even if we have
a reproducer there's no guarantee it will reproduce on a current
kernel. IOWs, we are stumbling around in the dark bashing our heads
against everything in the room, and that just wastes everyone's
time.
We need a reproducer that triggers on a current, unmodified
kernel release. You can use dm-faulty to error out all writes just
like you are doing with your custom code. See
xfstests::tests/generic/321 and common/dmflakey for to do this.
Ideally the reproducer is in a form that xfstests can use....
If you can't reproduce it on an upstream kernel, then git bisect is
your friend. It will find the commit that fixed the problem you are
seeing....

Ugh, yeah. The fact that this was customized as such apparently went
over my head. I agree completely. This needs to be genericized to a
pristine, preferably current kernel. The experiment patch could be
papering over something completely different.

Post by Dave Chinner

Post by Brian Foster
The ideal fix is not yet clear to me.

We are not even that far along - the root cause of the bug is not at
all clear to me. :/

Yeah.. the above was just the theory that motivated the experiment in
the previously posted patch. It of course remains a theory until we can
see the race in action. I was referring to the potential fix for the
raciness of xfs_buf_iowait() with regard to bio errors and the wq iodone
handling, while still asking for a reproducer to confirm the actual
problem. FWIW, I'm not too high on changes in the buf management code,
even a smallish behavior change, without a real trace of some sort that
documents the problem and justifies the change.
Brian

Post by Dave Chinner
Cheers,
Dave.
--
Dave Chinner

_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs

Dave Chinner

2014-08-12 23:56:15 UTC

Post by Alex Lyakas
Hello Dave, Brian,
I will describe a generic reproduction that you ask for.
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git

....

Post by Alex Lyakas
rw,sync,noatime,wsync,attr2,inode64,noquota 0 0
I started a couple of processes writing files sequentially onto this
mount point, and after few seconds crashed the VM.
https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing
dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0"
I am attaching the code (and Makefile) of dm-linear-custom target.
It is exact copy of dm-linear, except that it has a module
parameter. With the parameter set to 0, this is an identity mapping
onto /dev/vde. If the parameter is set to non-0, all WRITE bios are
failed with ENOSPC. There is a workqueue to fail them in a different
context (not sure if really needed, but that's what our "real"
custom
block device does).

Well, they you go. That explains it - an asynchronous dispatch error
happening fast enough to race with the synchronous XFS dispatch
processing.

dispatch thread device workqueue
xfs_buf_hold();
atomic_set(b_io_remaining, 1)
atomic_inc(b_io_remaining)
submit_bio(bio)
queue_work(bio)
xfs_buf_ioend(bp, ....);
atomic_dec(b_io_remaining)
xfs_buf_rele()
bio error set to ENOSPC
bio->end_io()
xfs_buf_bio_endio()
bp->b_error = ENOSPC
_xfs_buf_ioend(bp, 1);
atomic_dec(b_io_remaining)
xfs_buf_ioend(bp, 1);
queue_work(bp)
xfs_buf_iowait()
if (bp->b_error) return error;
if (error)
xfs_buf_relse()
xfs_buf_rele()
xfs_buf_free()

And now we have a freed buffer that is queued on the io completion
queue. Basically, it requires the buffer error to be set
asynchronously *between* the dispatch decrementing it's I/O count
after dispatch, but before we wait on the IO.

Not sure what the right fix is yet - removing the bp->b_error check
from xfs_buf_iowait() doesn't solve the problem - it just prevents
this code path from being tripped over by the race condition.

But, just to validate this is the problem, you should be able to
reproduce this on a 3.16 kernel. Can you try that, Alex?

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Brian Foster

2014-08-13 12:59:32 UTC

Post by Dave Chinner

Post by Alex Lyakas
Hello Dave, Brian,
I will describe a generic reproduction that you ask for.
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git

....

Post by Alex Lyakas
rw,sync,noatime,wsync,attr2,inode64,noquota 0 0
I started a couple of processes writing files sequentially onto this
mount point, and after few seconds crashed the VM.
https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing
dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0"
I am attaching the code (and Makefile) of dm-linear-custom target.
It is exact copy of dm-linear, except that it has a module
parameter. With the parameter set to 0, this is an identity mapping
onto /dev/vde. If the parameter is set to non-0, all WRITE bios are
failed with ENOSPC. There is a workqueue to fail them in a different
context (not sure if really needed, but that's what our "real"
custom
block device does).

Well, they you go. That explains it - an asynchronous dispatch error
happening fast enough to race with the synchronous XFS dispatch
processing.
dispatch thread device workqueue
xfs_buf_hold();
atomic_set(b_io_remaining, 1)
atomic_inc(b_io_remaining)
submit_bio(bio)
queue_work(bio)
xfs_buf_ioend(bp, ....);
atomic_dec(b_io_remaining)
xfs_buf_rele()
bio error set to ENOSPC
bio->end_io()
xfs_buf_bio_endio()
bp->b_error = ENOSPC
_xfs_buf_ioend(bp, 1);
atomic_dec(b_io_remaining)
xfs_buf_ioend(bp, 1);
queue_work(bp)
xfs_buf_iowait()
if (bp->b_error) return error;
if (error)
xfs_buf_relse()
xfs_buf_rele()
xfs_buf_free()
And now we have a freed buffer that is queued on the io completion
queue. Basically, it requires the buffer error to be set
asynchronously *between* the dispatch decrementing it's I/O count
after dispatch, but before we wait on the IO.

That's basically the theory I wanted to test with the experimental
patch. E.g., the error check races with the iodone workqueue item.

Post by Dave Chinner
Not sure what the right fix is yet - removing the bp->b_error check
from xfs_buf_iowait() doesn't solve the problem - it just prevents
this code path from being tripped over by the race condition.

Perhaps I'm missing some context... I don't follow how removing the
error check doesn't solve the problem. It clearly closes the race and
perhaps there are other means of doing the same thing, but what part of
the problem does that leave unresolved? E.g., we provide a
synchronization mechanism for an async submission path and an object
(xfs_buf) that is involved with potentially multiple such async (I/O)
operations. The async callback side manages the counts of outstanding
bios etc. to set the state of the buf object correctly and fires a
completion when everything is done. The calling side simply waits on the
completion before it can analyze state of the object. Referring to
anything inside that object that happens to be managed by the buffer I/O
mechanism before the buffer is considered complete just seems generally
racy.

It looks like submit_bio() manages this by providing the error through
the callback (always). It also doesn't look like submission path is
guaranteed to be synchronous either (consider md, which appears to use
workqueues and kernel threads)), so I'm not sure that '...;
xfs_buf_iorequest(bp); if (bp->b_error)' is really safe anywhere unless
you're explicitly looking for a write verifier error or something and
do nothing further on the buf contingent on completion (e.g., freeing it
or something it depends on).

Brian

Post by Dave Chinner
But, just to validate this is the problem, you should be able to
reproduce this on a 3.16 kernel. Can you try that, Alex?
Cheers,
Dave.
--
Dave Chinner

Dave Chinner

2014-08-13 20:59:29 UTC

Post by Brian Foster

Post by Dave Chinner

Post by Alex Lyakas
Hello Dave, Brian,
I will describe a generic reproduction that you ask for.
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git

....

Post by Alex Lyakas
rw,sync,noatime,wsync,attr2,inode64,noquota 0 0
I started a couple of processes writing files sequentially onto this
mount point, and after few seconds crashed the VM.
https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing
dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0"
I am attaching the code (and Makefile) of dm-linear-custom target.
It is exact copy of dm-linear, except that it has a module
parameter. With the parameter set to 0, this is an identity mapping
onto /dev/vde. If the parameter is set to non-0, all WRITE bios are
failed with ENOSPC. There is a workqueue to fail them in a different
context (not sure if really needed, but that's what our "real"
custom
block device does).

Well, they you go. That explains it - an asynchronous dispatch error
happening fast enough to race with the synchronous XFS dispatch
processing.
dispatch thread device workqueue
xfs_buf_hold();
atomic_set(b_io_remaining, 1)
atomic_inc(b_io_remaining)
submit_bio(bio)
queue_work(bio)
xfs_buf_ioend(bp, ....);
atomic_dec(b_io_remaining)
xfs_buf_rele()
bio error set to ENOSPC
bio->end_io()
xfs_buf_bio_endio()
bp->b_error = ENOSPC
_xfs_buf_ioend(bp, 1);
atomic_dec(b_io_remaining)
xfs_buf_ioend(bp, 1);
queue_work(bp)
xfs_buf_iowait()
if (bp->b_error) return error;
if (error)
xfs_buf_relse()
xfs_buf_rele()
xfs_buf_free()
And now we have a freed buffer that is queued on the io completion
queue. Basically, it requires the buffer error to be set
asynchronously *between* the dispatch decrementing it's I/O count
after dispatch, but before we wait on the IO.

That's basically the theory I wanted to test with the experimental
patch. E.g., the error check races with the iodone workqueue item.

Post by Dave Chinner
Not sure what the right fix is yet - removing the bp->b_error check
from xfs_buf_iowait() doesn't solve the problem - it just prevents
this code path from being tripped over by the race condition.

Perhaps I'm missing some context... I don't follow how removing the
error check doesn't solve the problem. It clearly closes the race and
perhaps there are other means of doing the same thing, but what part of
the problem does that leave unresolved?

Anything that does:

xfs_buf_iorequest(bp);
if (bp->b_error)
xfs_buf_relse(bp);

is susceptible to the same race condition. based on bp->b_error
being set asynchronously and before the buffer IO completion
processing is complete.

Post by Brian Foster
E.g., we provide a
synchronization mechanism for an async submission path and an object
(xfs_buf) that is involved with potentially multiple such async (I/O)
operations. The async callback side manages the counts of outstanding
bios etc. to set the state of the buf object correctly and fires a
completion when everything is done. The calling side simply waits on the
completion before it can analyze state of the object. Referring to
anything inside that object that happens to be managed by the buffer I/O
mechanism before the buffer is considered complete just seems generally
racy.

The point is that the IO submitter holds the buffer lock and so has
"exclusive" access to the buffer, even after it is submitted. It is
allowed to check the internal state of the buffer at any time, and
it is expected to be sane, including while IO completion processing
is running.

The real issue is that workqueue based IO completion processing is
not protected by a reference count of any kind for synchronous IO.
It is done with only the reference count of lock holder held, and so
if the lock holder unlocks and frees the buffer, then that buffer
will be freed.

This issue doesn't exist with B_ASYNC IO submission, because the
B_ASYNC IO owns the reference and the the buffer lock and drops them
from the workqueue when the IO comlpetion processing actuall
completes...

Post by Brian Foster
It looks like submit_bio() manages this by providing the error through
the callback (always). It also doesn't look like submission path is
guaranteed to be synchronous either (consider md, which appears to use
workqueues and kernel threads)), so I'm not sure that '...;
xfs_buf_iorequest(bp); if (bp->b_error)' is really safe anywhere unless
you're explicitly looking for a write verifier error or something and
do nothing further on the buf contingent on completion (e.g., freeing it
or something it depends on).

My point remains that it *should be safe*, and the intent is that
the caller should be able to check for submission errors without
being exposed to a use after free situation. That's the bug we need
to fix, not say "you can't check for submission errors on
synchronous IO" to avoid the race condition.....

Cheers,

Dave

--
Dave Chinner
***@fromorbit.com

Brian Foster

2014-08-13 23:21:35 UTC

Post by Brian Foster

Post by Brian Foster

Post by Dave Chinner

Post by Alex Lyakas
Hello Dave, Brian,
I will describe a generic reproduction that you ask for.
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git

....

Post by Alex Lyakas
rw,sync,noatime,wsync,attr2,inode64,noquota 0 0
I started a couple of processes writing files sequentially onto this
mount point, and after few seconds crashed the VM.
https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing
dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0"
I am attaching the code (and Makefile) of dm-linear-custom target.
It is exact copy of dm-linear, except that it has a module
parameter. With the parameter set to 0, this is an identity mapping
onto /dev/vde. If the parameter is set to non-0, all WRITE bios are
failed with ENOSPC. There is a workqueue to fail them in a different
context (not sure if really needed, but that's what our "real"
custom
block device does).

Well, they you go. That explains it - an asynchronous dispatch error
happening fast enough to race with the synchronous XFS dispatch
processing.
dispatch thread device workqueue
xfs_buf_hold();
atomic_set(b_io_remaining, 1)
atomic_inc(b_io_remaining)
submit_bio(bio)
queue_work(bio)
xfs_buf_ioend(bp, ....);
atomic_dec(b_io_remaining)
xfs_buf_rele()
bio error set to ENOSPC
bio->end_io()
xfs_buf_bio_endio()
bp->b_error = ENOSPC
_xfs_buf_ioend(bp, 1);
atomic_dec(b_io_remaining)
xfs_buf_ioend(bp, 1);
queue_work(bp)
xfs_buf_iowait()
if (bp->b_error) return error;
if (error)
xfs_buf_relse()
xfs_buf_rele()
xfs_buf_free()
And now we have a freed buffer that is queued on the io completion
queue. Basically, it requires the buffer error to be set
asynchronously *between* the dispatch decrementing it's I/O count
after dispatch, but before we wait on the IO.

That's basically the theory I wanted to test with the experimental
patch. E.g., the error check races with the iodone workqueue item.

Post by Dave Chinner
Not sure what the right fix is yet - removing the bp->b_error check
from xfs_buf_iowait() doesn't solve the problem - it just prevents
this code path from being tripped over by the race condition.

Perhaps I'm missing some context... I don't follow how removing the
error check doesn't solve the problem. It clearly closes the race and
perhaps there are other means of doing the same thing, but what part of
the problem does that leave unresolved?

xfs_buf_iorequest(bp);
if (bp->b_error)
xfs_buf_relse(bp);
is susceptible to the same race condition. based on bp->b_error
being set asynchronously and before the buffer IO completion
processing is complete.

Understood, by why would anything do that (as opposed to
xfs_buf_iowait())? I don't see that we do that anywhere today
(the check buried within xfs_buf_iowait() notwithstanding of course).

Post by Brian Foster
From what I can see, all it really guarantees is that the submission has

either passed/failed the write verifier, yes?

Post by Brian Foster

Post by Brian Foster
E.g., we provide a
synchronization mechanism for an async submission path and an object
(xfs_buf) that is involved with potentially multiple such async (I/O)
operations. The async callback side manages the counts of outstanding
bios etc. to set the state of the buf object correctly and fires a
completion when everything is done. The calling side simply waits on the
completion before it can analyze state of the object. Referring to
anything inside that object that happens to be managed by the buffer I/O
mechanism before the buffer is considered complete just seems generally
racy.

The point is that the IO submitter holds the buffer lock and so has
"exclusive" access to the buffer, even after it is submitted. It is
allowed to check the internal state of the buffer at any time, and
it is expected to be sane, including while IO completion processing
is running.

Fair enough, but if the mechanism is async the submitter clearly knows
that's not failsafe (i.e., passing the check above doesn't mean the I/O
will not fail).

Post by Brian Foster
The real issue is that workqueue based IO completion processing is
not protected by a reference count of any kind for synchronous IO.
It is done with only the reference count of lock holder held, and so
if the lock holder unlocks and frees the buffer, then that buffer
will be freed.

I see, the notion of the work item having a hold/refcount on the buffer
makes sense. But with a submission/wait mechanism that waits on actual
"completion" of the I/O (e.g., complete(bp->b_iowait)), that only offers
protection for the case where a sync I/O submitter doesn't wait. I
suppose that's possible, but also seems like a misuse of sync I/O. E.g.,
why wouldn't that caller do async I/O?

Post by Brian Foster
This issue doesn't exist with B_ASYNC IO submission, because the
B_ASYNC IO owns the reference and the the buffer lock and drops them
from the workqueue when the IO comlpetion processing actuall
completes...

Indeed. I wasn't clear on the reference ownership nuance between the
sync/async variants of I/O. Thanks for the context. That said, we also
don't have submitters that check for errors on async I/O either.

Post by Brian Foster

Post by Brian Foster
It looks like submit_bio() manages this by providing the error through
the callback (always). It also doesn't look like submission path is
guaranteed to be synchronous either (consider md, which appears to use
workqueues and kernel threads)), so I'm not sure that '...;
xfs_buf_iorequest(bp); if (bp->b_error)' is really safe anywhere unless
you're explicitly looking for a write verifier error or something and
do nothing further on the buf contingent on completion (e.g., freeing it
or something it depends on).

My point remains that it *should be safe*, and the intent is that
the caller should be able to check for submission errors without
being exposed to a use after free situation. That's the bug we need
to fix, not say "you can't check for submission errors on
synchronous IO" to avoid the race condition.....

Well, technically you can check for submission errors on sync I/O, just
use the code you posted above. :) What we can't currently do is find out
when the I/O subsystem is done with the buffer.

Perhaps the point here is around the semantics of xfs_buf_iowait(). With
a mechanism that is fundamentally async, the sync variant obviously
becomes the async mechanism + some kind of synchronization. I'd expect
that synchronization to not necessarily just tell me whether an error
occurred, but also tell me when the I/O subsystem is done with the
object I've passed (e.g., so I'm free to chuck it, scribble over it, put
it back where I got it, whatever).

My impression is that's the purpose of the b_iowait mechanism.
Otherwise, what's the point of the whole
bio_end_io->buf_ioend->b_iodone->buf_ioend round trip dance?

Brian

Post by Brian Foster
Cheers,
Dave
--
Dave Chinner

Dave Chinner

2014-08-14 06:14:44 UTC

Post by Brian Foster

Post by Brian Foster

Post by Brian Foster
Perhaps I'm missing some context... I don't follow how removing the
error check doesn't solve the problem. It clearly closes the race and
perhaps there are other means of doing the same thing, but what part of
the problem does that leave unresolved?

xfs_buf_iorequest(bp);
if (bp->b_error)
xfs_buf_relse(bp);
is susceptible to the same race condition. based on bp->b_error
being set asynchronously and before the buffer IO completion
processing is complete.

Understood, by why would anything do that (as opposed to
xfs_buf_iowait())? I don't see that we do that anywhere today
(the check buried within xfs_buf_iowait() notwithstanding of course).

"Why" is not important - the fact is the caller *owns* the buffer
and so the above fragment of code is valid behaviour. If there is
an error on the buffer after xfs_buf_iorequest() request returns on
a synchronous IO, then it's a bug if there is still IO in progress
on that buffer.

We can't run IO completion synchronously from xfs_buf_bio_end_io in
this async dispatch error case - we cannot detect it as any
different from IO completion in interrupt context - and so we need
to have some kind of reference protecting the buffer from being
freed from under the completion.

i.e. the bug is that a synchronous buffer has no active reference
while it is sitting on the completion workqueue - it's references
are owned by other contexts that can drop them without regard to
the completion status of the buffer.

For async IO we transfer a reference and the lock to the IO context,
which gets dropped in xfs_buf_iodone_work when all the IO is
complete. Synchronous IO needs this protection, too.

As a proof of concept, adding this to the start of
xfs_buf_iorequest():

+ /*
+ * synchronous IO needs it's own reference count. async IO
+ * inherits the submitter's reference count.
+ */
+ if (!(bp->b_flags & XBF_ASYNC))
+ xfs_buf_hold(bp);

And this to the synchronous IO completion case for
xfs_buf_iodone_work():

else {
ASSERT(read && bp->b_ops);
complete(&bp->b_iowait);
+ xfs_buf_rele(bp);
}

Should ensure that all IO carries a reference count and the buffer
cannot be freed until all IO processing has been completed.

This means it does not matter what the buffer owner does after
xfs_buf_iorequest() - even unconditionally calling xfs_buf_relse()
will not result in use-after-free as the b_hold count will not go to
zero until the IO completion processing has been finalised.

Fixing the rest of the mess (i.e. determining how to deal with
submission/completion races) is going to require more effort and
thought. For the moment, though, correctly reference counting
buffers will solve the use-after-free without changing any
other behaviour.

Post by Brian Foster
From what I can see, all it really guarantees is that the submission has
either passed/failed the write verifier, yes?

No. It can also mean it wasn't rejected by the lower layersi as
they process the bio passed by submit_bio(). e.g. ENODEV, because
the underlying device has been hot-unplugged, EIO because the
buffer is beyond the end of the device, etc.

Post by Brian Foster

Post by Brian Foster

Post by Brian Foster
It looks like submit_bio() manages this by providing the error through
the callback (always). It also doesn't look like submission path is
guaranteed to be synchronous either (consider md, which appears to use
workqueues and kernel threads)), so I'm not sure that '...;
xfs_buf_iorequest(bp); if (bp->b_error)' is really safe anywhere unless
you're explicitly looking for a write verifier error or something and
do nothing further on the buf contingent on completion (e.g., freeing it
or something it depends on).

My point remains that it *should be safe*, and the intent is that
the caller should be able to check for submission errors without
being exposed to a use after free situation. That's the bug we need
to fix, not say "you can't check for submission errors on
synchronous IO" to avoid the race condition.....

Well, technically you can check for submission errors on sync I/O, just
use the code you posted above. :) What we can't currently do is find out
when the I/O subsystem is done with the buffer.

By definition, a buffer marked with an error after submission
processing is complete. It should not need to be waited on, and
there-in lies the bug.

Post by Brian Foster
Perhaps the point here is around the semantics of xfs_buf_iowait(). With
a mechanism that is fundamentally async, the sync variant obviously
becomes the async mechanism + some kind of synchronization. I'd expect
that synchronization to not necessarily just tell me whether an error
occurred, but also tell me when the I/O subsystem is done with the
object I've passed (e.g., so I'm free to chuck it, scribble over it, put
it back where I got it, whatever).
My impression is that's the purpose of the b_iowait mechanism.
Otherwise, what's the point of the whole
bio_end_io->buf_ioend->b_iodone->buf_ioend round trip dance?

Yes, that's exactly what xfs_buf_iorequest/xfs_buf_iowait() provides
and the b_error indication is an integral part of that
synchronisation mechanism. Unfortunately, that is also the part of
the mechanism that is racy and causing problems.

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Brian Foster

2014-08-14 19:05:22 UTC

Post by Dave Chinner

Post by Brian Foster

Post by Brian Foster

Post by Brian Foster
Perhaps I'm missing some context... I don't follow how removing the
error check doesn't solve the problem. It clearly closes the race and
perhaps there are other means of doing the same thing, but what part of
the problem does that leave unresolved?

xfs_buf_iorequest(bp);
if (bp->b_error)
xfs_buf_relse(bp);
is susceptible to the same race condition. based on bp->b_error
being set asynchronously and before the buffer IO completion
processing is complete.

Understood, by why would anything do that (as opposed to
xfs_buf_iowait())? I don't see that we do that anywhere today
(the check buried within xfs_buf_iowait() notwithstanding of course).

"Why" is not important - the fact is the caller *owns* the buffer
and so the above fragment of code is valid behaviour. If there is
an error on the buffer after xfs_buf_iorequest() request returns on
a synchronous IO, then it's a bug if there is still IO in progress
on that buffer.

A buffer can consist of multiple I/Os, yes? If so, then it seems quite
possible for one I/O to fail and set an error on the buffer while
another might still be in progress. I don't see how that can be
considered a bug in general.

Even if not, I'd expect to see a comment explaining why any code
fragment such as the above is not broken because that's not
deterministic at all, even with a single I/O. E.g., the error could very
well have been set by the callback where we clearly continue I/O
processing so-to-speak (though we could consider where we currently set
the error in the callback sequence a bug as well).

Post by Dave Chinner
We can't run IO completion synchronously from xfs_buf_bio_end_io in
this async dispatch error case - we cannot detect it as any
different from IO completion in interrupt context - and so we need
to have some kind of reference protecting the buffer from being
freed from under the completion.

Indeed.

Post by Dave Chinner
i.e. the bug is that a synchronous buffer has no active reference
while it is sitting on the completion workqueue - it's references
are owned by other contexts that can drop them without regard to
the completion status of the buffer.
For async IO we transfer a reference and the lock to the IO context,
which gets dropped in xfs_buf_iodone_work when all the IO is
complete. Synchronous IO needs this protection, too.
As a proof of concept, adding this to the start of
+ /*
+ * synchronous IO needs it's own reference count. async IO
+ * inherits the submitter's reference count.
+ */
+ if (!(bp->b_flags & XBF_ASYNC))
+ xfs_buf_hold(bp);
And this to the synchronous IO completion case for
else {
ASSERT(read && bp->b_ops);
complete(&bp->b_iowait);
+ xfs_buf_rele(bp);
}
Should ensure that all IO carries a reference count and the buffer
cannot be freed until all IO processing has been completed.
This means it does not matter what the buffer owner does after
xfs_buf_iorequest() - even unconditionally calling xfs_buf_relse()
will not result in use-after-free as the b_hold count will not go to
zero until the IO completion processing has been finalised.

Makes sense, assuming we handle the possible error cases and whatnot
therein. Thinking some more, suppose we take this ref, submit one I/O
successfully and a subsequent fails. Then who is responsible for
releasing the reference?

Post by Dave Chinner
Fixing the rest of the mess (i.e. determining how to deal with
submission/completion races) is going to require more effort and
thought. For the moment, though, correctly reference counting
buffers will solve the use-after-free without changing any
other behaviour.

Post by Brian Foster
From what I can see, all it really guarantees is that the submission has
either passed/failed the write verifier, yes?

No. It can also mean it wasn't rejected by the lower layersi as
they process the bio passed by submit_bio(). e.g. ENODEV, because
the underlying device has been hot-unplugged, EIO because the
buffer is beyond the end of the device, etc.

Those are the errors that happen to be synchronously processed today.
That's an implementation detail. submit_bio() is an asynchronous
interface so I don't see any guarantee that will always be the case.
E.g., that's easily broken should somebody decide to defer early end_io
processing to a workqueue. We do a similar thing ourselves for the
reasons you've stated above.

I don't see anything in or around submit_bio() or generic_make_request()
that suggest the interface is anything but async. From
generic_make_request():

/**
...
* generic_make_request() does not return any status. The
* success/failure status of the request, along with notification of
* completion, is delivered asynchronously through the bio->bi_end_io
* function described (one day) else where.
*
...
*/

Post by Dave Chinner

Post by Brian Foster

Post by Brian Foster

Post by Brian Foster
It looks like submit_bio() manages this by providing the error through
the callback (always). It also doesn't look like submission path is
guaranteed to be synchronous either (consider md, which appears to use
workqueues and kernel threads)), so I'm not sure that '...;
xfs_buf_iorequest(bp); if (bp->b_error)' is really safe anywhere unless
you're explicitly looking for a write verifier error or something and
do nothing further on the buf contingent on completion (e.g., freeing it
or something it depends on).

My point remains that it *should be safe*, and the intent is that
the caller should be able to check for submission errors without
being exposed to a use after free situation. That's the bug we need
to fix, not say "you can't check for submission errors on
synchronous IO" to avoid the race condition.....

Well, technically you can check for submission errors on sync I/O, just
use the code you posted above. :) What we can't currently do is find out
when the I/O subsystem is done with the buffer.

By definition, a buffer marked with an error after submission
processing is complete. It should not need to be waited on, and
there-in lies the bug.

I suppose that implicates the error processing on the callback side. We
set the error and continue processing on the buffer. Another option
could be to shuffle that around on the callback side, but to me _that_
is an approach that narrowly avoids the race rather than closing it via
use of synchronization.

Post by Dave Chinner

Post by Brian Foster
Perhaps the point here is around the semantics of xfs_buf_iowait(). With
a mechanism that is fundamentally async, the sync variant obviously
becomes the async mechanism + some kind of synchronization. I'd expect
that synchronization to not necessarily just tell me whether an error
occurred, but also tell me when the I/O subsystem is done with the
object I've passed (e.g., so I'm free to chuck it, scribble over it, put
it back where I got it, whatever).
My impression is that's the purpose of the b_iowait mechanism.
Otherwise, what's the point of the whole
bio_end_io->buf_ioend->b_iodone->buf_ioend round trip dance?

Yes, that's exactly what xfs_buf_iorequest/xfs_buf_iowait() provides
and the b_error indication is an integral part of that
synchronisation mechanism. Unfortunately, that is also the part of
the mechanism that is racy and causing problems.

I don't see how b_iowait itself is racy. It completes when the I/O
completes. The problem is that we've overloaded these mechanisms to
where we attempt to use them for multiple things. b_error can be a
submission error or a deeper I/O error. Adding the b_error check to
xfs_buf_iowait() converts it to the same "has a submission error
occurred? or has any error occurred yet? or wait until all buffer
I/O is complete" non-deterministic semantics.

I agree that the reference count protection for sync I/O sounds useful
and closes a gap (need to think about that some more), but to check for
an error as part of the synchronization mechanism means that the
mechanism simply isn't synchronous.

I don't see any paths outside of I/O submission itself that care
significantly about one general form of error (submission) vs. another
(async I/O error) as opposed to simply whether an error has occurred or
not. The fact that an error _can_ occur at any time until all of the
outstanding bios are completed overrides the fact that some might occur
by the time submission completes, meaning we have to handle the former
for any callers that care about errors in general. Waiting on b_iowait
before checking b_error is an always-safe way to do that.

Brian

Post by Dave Chinner
Cheers,
Dave.
--
Dave Chinner

Dave Chinner

2014-08-14 22:27:56 UTC

Post by Brian Foster

Post by Dave Chinner

Post by Brian Foster

Post by Brian Foster

Post by Brian Foster
Perhaps I'm missing some context... I don't follow how removing the
error check doesn't solve the problem. It clearly closes the race and
perhaps there are other means of doing the same thing, but what part of
the problem does that leave unresolved?

xfs_buf_iorequest(bp);
if (bp->b_error)
xfs_buf_relse(bp);
is susceptible to the same race condition. based on bp->b_error
being set asynchronously and before the buffer IO completion
processing is complete.

Understood, by why would anything do that (as opposed to
xfs_buf_iowait())? I don't see that we do that anywhere today
(the check buried within xfs_buf_iowait() notwithstanding of course).

"Why" is not important - the fact is the caller *owns* the buffer
and so the above fragment of code is valid behaviour. If there is
an error on the buffer after xfs_buf_iorequest() request returns on
a synchronous IO, then it's a bug if there is still IO in progress
on that buffer.

A buffer can consist of multiple I/Os, yes? If so, then it seems quite
possible for one I/O to fail and set an error on the buffer while
another might still be in progress. I don't see how that can be
considered a bug in general.

You miss my point. We have to mark the buffer with an error, but it
should not be visible to the submitter until all IO on the buffer
is done. i.e. setting bp->b_error from the completion path
needs to be deferred until xfs_buf_iodone_work() is run after all
submitted IOs on the buffer have completed.

Post by Brian Foster
Even if not, I'd expect to see a comment explaining why any code
fragment such as the above is not broken because that's not
deterministic at all, even with a single I/O. E.g., the error could very
well have been set by the callback where we clearly continue I/O
processing so-to-speak (though we could consider where we currently set
the error in the callback sequence a bug as well).

Chicken, meet Egg.

There's no comment saying the above code is OK or not because we
assume that if we hold the buffer lock it is safe to look at any
state on the buffer at any time. What this thread points out is that
synchronous IO changes the state of the buffer in a context that
does not hold the buffer lock, and so violates that assumption.

Post by Brian Foster

Post by Dave Chinner
This means it does not matter what the buffer owner does after
xfs_buf_iorequest() - even unconditionally calling xfs_buf_relse()
will not result in use-after-free as the b_hold count will not go to
zero until the IO completion processing has been finalised.

Makes sense, assuming we handle the possible error cases and whatnot
therein. Thinking some more, suppose we take this ref, submit one I/O
successfully and a subsequent fails. Then who is responsible for
releasing the reference?

xfs_buf_iodone_work() doesn't run until b_io_remaining goes to zero.
That's the context that releases the ref. It doesn't matter how many
are submitted, complete successfully or fail.

Post by Brian Foster

Post by Dave Chinner
Fixing the rest of the mess (i.e. determining how to deal with
submission/completion races) is going to require more effort and
thought. For the moment, though, correctly reference counting
buffers will solve the use-after-free without changing any
other behaviour.

Post by Brian Foster
From what I can see, all it really guarantees is that the submission has
either passed/failed the write verifier, yes?

No. It can also mean it wasn't rejected by the lower layersi as
they process the bio passed by submit_bio(). e.g. ENODEV, because
the underlying device has been hot-unplugged, EIO because the
buffer is beyond the end of the device, etc.

Those are the errors that happen to be synchronously processed today.
That's an implementation detail. submit_bio() is an asynchronous
interface so I don't see any guarantee that will always be the case.
E.g., that's easily broken should somebody decide to defer early end_io
processing to a workqueue. We do a similar thing ourselves for the
reasons you've stated above.
I don't see anything in or around submit_bio() or generic_make_request()
that suggest the interface is anything but async. From
/**
...
* generic_make_request() does not return any status. The
* success/failure status of the request, along with notification of
* completion, is delivered asynchronously through the bio->bi_end_io
* function described (one day) else where.
*
...
*/

Delivery of the error is through bio->bi_end_io, but that is not
necessarily from an asynchronous context. e.g. the first thing that
generic_make_request() does is run generic_make_request_checks(),
where a failure runs bio_endio() and therefore bio->bi_end_io()
in the submitter's context. i.e. *synchronously*. This is exactly
what the b_io_remaining reference owned by xfs_buf_iorequest() is
being taken for - to prevent the *XFS endio processing* from being
run asynchronously.

Post by Brian Foster

Post by Dave Chinner

Post by Brian Foster

Post by Brian Foster

Post by Brian Foster
It looks like submit_bio() manages this by providing the error through
the callback (always). It also doesn't look like submission path is
guaranteed to be synchronous either (consider md, which appears to use
workqueues and kernel threads)), so I'm not sure that '...;
xfs_buf_iorequest(bp); if (bp->b_error)' is really safe anywhere unless
you're explicitly looking for a write verifier error or something and
do nothing further on the buf contingent on completion (e.g., freeing it
or something it depends on).

My point remains that it *should be safe*, and the intent is that
the caller should be able to check for submission errors without
being exposed to a use after free situation. That's the bug we need
to fix, not say "you can't check for submission errors on
synchronous IO" to avoid the race condition.....

Well, technically you can check for submission errors on sync I/O, just
use the code you posted above. :) What we can't currently do is find out
when the I/O subsystem is done with the buffer.

By definition, a buffer marked with an error after submission
processing is complete. It should not need to be waited on, and
there-in lies the bug.

I suppose that implicates the error processing on the callback side. We
set the error and continue processing on the buffer. Another option
could be to shuffle that around on the callback side, but to me _that_
is an approach that narrowly avoids the race rather than closing it via
use of synchronization.

We need to store the error from the callback somewhere, until
b_io_remaining falls to zero. Right now we are putting it in
b_error, which makes it immediately visible to the submitter without
any synchronisation. Basically, we can't put state into anything
that the submitter is expected to check. We have internal state
fields that are not to be used externally:

spinlock_t b_lock; /* internal state lock */
unsigned int b_state; /* internal state flags */

So what we probably need to do is add an internal:

int b_io_error; /* internal error state */

serialised by b_lock, where we stuff errors from xfs_buf_bio_end_io
and then propagate them to b_error in xfs_buf_iodone_work() when all
the IO is complete....

And with the extra reference count on the buffer, it doesn't matter
if the submitter detects a submission error and releases it's
reference to the buffer while there is still other IO in progress.
Hence we solve all the issues without changing the current
submit/wait semantics, or needing to remove the submission error
check from xfs_buf_iowait.

Post by Brian Foster
I don't see any paths outside of I/O submission itself that care
significantly about one general form of error (submission) vs. another
(async I/O error) as opposed to simply whether an error has occurred or
not. The fact that an error _can_ occur at any time until all of the
outstanding bios are completed overrides the fact that some might occur
by the time submission completes, meaning we have to handle the former
for any callers that care about errors in general. Waiting on b_iowait
before checking b_error is an always-safe way to do that.

Assuming there are no other bugs, it's "always safe". And that's the
key thing - the fact that we actually have a check for an error
before waiting indicates that it hasn't always been safe to wait on
a buffer marked with a submission error.

This code is full of cruft and mess. It's fully of history,
band-aids upon band-aids, partially completed cleanups, left over
complexity from times gone past and new complexity layered over the
top of structures not originally designed to support those uses.
Let's fix the use after free bug right away, and then clean up the
cruft and fix the underlying problems so we can guarantee that
"always-safe" behaviour an not have it blow up in our faces in
future....

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Alex Lyakas

2014-08-13 17:07:58 UTC

Hi Dave,
I compiled kernel 3.16 today, top commit being:
commit 19583ca584d6f574384e17fe7613dfaeadcdc4a6
Author: Linus Torvalds <***@linux-foundation.org>
Date: Sun Aug 3 15:25:02 2014 -0700

Linux 3.16

(also had to use the updated copy of dm-linear from 3.16).

I would say that the problem reproduces, but it looks a little differently.
Typical flow is:

Aug 13 19:45:48 vc-00-00-1383-dev kernel: [ 143.521383] XFS (dm-0):
Mounting V4 Filesystem
Aug 13 19:45:48 vc-00-00-1383-dev kernel: [ 143.558558] XFS (dm-0): Sleep
10s before log recovery
Aug 13 19:45:58 vc-00-00-1383-dev kernel: [ 153.560139] XFS (dm-0):
Starting recovery (logdev: internal)
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.085826] XFS (dm-0):
metadata I/O error: block 0x1 ("xlog_recover_iodone") error 28 numblks 1
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.087336] XFS (dm-0):
xfs_do_force_shutdown(0x1) called from line 383 of file
/mnt/share/src/linux-mainline/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa030d5f8
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.087341] XFS (dm-0): I/O
Error Detected. Shutting down filesystem
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.088455] XFS (dm-0): Please
umount the filesystem and rectify the problem(s)
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.088601] XFS (dm-0):
metadata I/O error: block 0x8 ("xlog_recover_iodone") error 28 numblks 8
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.088605] XFS (dm-0):
xfs_do_force_shutdown(0x1) called from line 383 of file
/mnt/share/src/linux-mainline/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa030d5f8
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.088658] XFS (dm-0):
metadata I/O error: block 0x10 ("xlog_recover_iodone") error 28 numblks 8
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.088663] XFS (dm-0):
xfs_do_force_shutdown(0x1) called from line 383 of file
/mnt/share/src/linux-mainline/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa030d5f8
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.088702] XFS (dm-0):
metadata I/O error: block 0x40 ("xlog_recover_iodone") error 28 numblks 16
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.088714] XFS (dm-0):
xfs_do_force_shutdown(0x1) called from line 383 of file
/mnt/share/src/linux-mainline/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa030d5f8
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.088719] XFS (dm-0):
metadata I/O error: block 0xa00060 ("xlog_recover_iodone") error 28 numblks
16
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.088731] XFS (dm-0):
xfs_do_force_shutdown(0x1) called from line 383 of file
/mnt/share/src/linux-mainline/fs/xfs/xfs_log_recover.c. Return address =
0xffffffffa030d5f8
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.096288] XFS (dm-0): log
mount/recovery failed: error 28
Aug 13 19:45:59 vc-00-00-1383-dev kernel: [ 154.096482] XFS (dm-0): log
mount failed

So there is no clearly-looking problem like with my development kernel
(which also had KMEMLEAK enabled, which changes timings significantly).
However, after several seconds, kernel panics sporadically with stacks like
[1], [2], and sometimes KVM dies on me with messages like[3], [4]. So
definitely this test corrupts some internal kernel structures, and with
KMEMLEAK enabled this corruption is more clear.

I put the metadump from this experiment here:
https://drive.google.com/file/d/0ByBy89zr3kJNSGxoc3U4X1lOZ2s/edit?usp=sharing

Thanks,
Alex.

[1]
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.223389] BUG: unable to
handle kernel paging request at 00007f9e50678000
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] IP:
[<ffffffff811923a8>] kmem_cache_alloc+0x68/0x160
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] PGD 362b8067 PUD
7a9d9067 PMD 7849f067 PTE 0
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] Oops: 0000 [#1] SMP
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] Modules linked in:
xfs(O) libcrc32c dm_linear_custom(O) deflate ctr twofish_generic
twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64
twofish_common camellia_generic serpent_generic blowfish_generic
blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic cmac
xcbc rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin nfsd
nfs_acl ppdev dm_multipath psmouse parport_pc serio_raw i2c_piix4 mac_hid lp
rpcsec_gss_krb5 parport auth_rpcgss oid_registry nfsv4 nfs fscache lockd
sunrpc floppy
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] CPU: 1 PID: 3302
Comm: cron Tainted: G O 3.16.0-999-generic #201408131143
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] Hardware name:
Bochs Bochs, BIOS Bochs 01/01/2007
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] task:
ffff88007a9d48c0 ti: ffff88007aeb0000 task.ti: ffff88007aeb0000
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] RIP:
0010:[<ffffffff811923a8>] [<ffffffff811923a8>] kmem_cache_alloc+0x68/0x160
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] RSP:
0018:ffff88007aeb3cc0 EFLAGS: 00010286
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] RAX:
0000000000000000 RBX: ffff88007a137048 RCX: 000000000001ff04
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] RDX:
000000000001ff03 RSI: 00000000000000d0 RDI: 0000000000015d20
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] RBP:
ffff88007aeb3d10 R08: ffff88007fc95d20 R09: 0000000000000003
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] R10:
0000000000000003 R11: ffff8800362b83d0 R12: ffff88007d001b00
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] R13:
00007f9e50678000 R14: ffffffff811767fd R15: 00000000000000d0
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] FS:
00007f9e5066b7c0(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] CS: 0010 DS: 0000
ES: 0000 CR0: 000000008005003b
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] CR2:
00007f9e50678000 CR3: 0000000036743000 CR4: 00000000000006e0
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] Stack:
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] 00007f9e4e9d4fff
ffffffff811767d6 0000000000000040 0000000000000048
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] ffff88007aeb3d40
ffff88007a137048 ffff88007af7cb80 ffff880079814da8
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] ffff88007af7cc38
ffff88007af7cc48 ffff88007aeb3d40 ffffffff811767fd
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] Call Trace:
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019]
[<ffffffff811767d6>] ? anon_vma_fork+0x56/0x130
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019]
[<ffffffff811767fd>] anon_vma_fork+0x7d/0x130
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019]
[<ffffffff816d0426>] dup_mmap+0x1c5/0x380
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019]
[<ffffffff816d069f>] dup_mm+0xbe/0x155
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019]
[<ffffffff816d07fc>] copy_mm+0xc6/0xe9
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019]
[<ffffffff81054174>] copy_process.part.35+0x694/0xe90
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019]
[<ffffffff810549f0>] copy_process+0x80/0x90
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019]
[<ffffffff81054b12>] do_fork+0x62/0x280
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019]
[<ffffffff812ca6ee>] ? cap_task_setnice+0xe/0x10
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019]
[<ffffffff812cbfd6>] ? security_task_setnice+0x16/0x20
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019]
[<ffffffff8106a648>] ? set_one_prio+0x88/0xd0
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019]
[<ffffffff81054db6>] SyS_clone+0x16/0x20
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019]
[<ffffffff816e9b39>] stub_clone+0x69/0x90
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019]
[<ffffffff816e9812>] ? system_call_fastpath+0x16/0x1b
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] Code: 00 49 8b 50
08 4d 8b 28 49 8b 40 10 4d 85 ed 0f 84 e6 00 00 00 48 85 c0 0f 84 dd 00 00
00 49 63 44 24 20 49 8b 3c 24 48 8d 4a 01 <49> 8b 5c 05 00 4c 89 e8 65 48 0f
c7 0f 0f 94 c0 84 c0 74 b5 49
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] RIP
[<ffffffff811923a8>] kmem_cache_alloc+0x68/0x160
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] RSP
<ffff88007aeb3cc0>
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.224019] CR2:
00007f9e50678000
Aug 13 19:46:01 vc-00-00-1383-dev kernel: [ 156.276967] ---[ end trace
56118c807adeb6de ]---
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.136602] BUG: unable to
handle kernel paging request at 00007f9e50678000
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.137877] IP:
[<ffffffff811917ac>] kmem_cache_alloc_trace+0x6c/0x160
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.139042] PGD 79732067 PUD 0
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.139645] Oops: 0000 [#2] SMP
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] Modules linked in:
xfs(O) libcrc32c dm_linear_custom(O) deflate ctr twofish_generic
twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64
twofish_common camellia_generic serpent_generic blowfish_generic
blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic cmac
xcbc rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin nfsd
nfs_acl ppdev dm_multipath psmouse parport_pc serio_raw i2c_piix4 mac_hid lp
rpcsec_gss_krb5 parport auth_rpcgss oid_registry nfsv4 nfs fscache lockd
sunrpc floppy
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] CPU: 1 PID: 1688
Comm: whoopsie Tainted: G D O 3.16.0-999-generic #201408131143
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] Hardware name:
Bochs Bochs, BIOS Bochs 01/01/2007
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] task:
ffff88007c261840 ti: ffff8800799a8000 task.ti: ffff8800799a8000
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] RIP:
0010:[<ffffffff811917ac>] [<ffffffff811917ac>]
kmem_cache_alloc_trace+0x6c/0x160
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] RSP:
0018:ffff8800799abdd8 EFLAGS: 00010286
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] RAX:
0000000000000000 RBX: ffff88007bd0e080 RCX: 000000000001ff04
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] RDX:
000000000001ff03 RSI: 00000000000000d0 RDI: 0000000000015d20
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] RBP:
ffff8800799abe28 R08: ffff88007fc95d20 R09: 0000000000000000
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] R10:
0000000000000000 R11: 0000000000000246 R12: ffff88007d001b00
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] R13:
00007f9e50678000 R14: ffffffff815bbbdb R15: 00000000000000d0
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] FS:
00007fa0248e37c0(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] CS: 0010 DS: 0000
ES: 0000 CR0: 0000000080050033
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] CR2:
00007f9e50678000 CR3: 00000000367c0000 CR4: 00000000000006e0
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] Stack:
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] ffff88007c7ee780
ffffffff815bbbbd 0000000000000040 0000000000000280
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] ffff8800799abe08
ffff88007bd0e080 0000000000000000 0000000000000000
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] 0000000000000000
0000000000000000 ffff8800799abe58 ffffffff815bbbdb
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] Call Trace:
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050]
[<ffffffff815bbbbd>] ? sock_alloc_inode+0x2d/0xd0
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050]
[<ffffffff815bbbdb>] sock_alloc_inode+0x4b/0xd0
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050]
[<ffffffff811bb326>] alloc_inode+0x26/0xa0
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050]
[<ffffffff811bcce3>] new_inode_pseudo+0x13/0x60
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050]
[<ffffffff815bbe8e>] sock_alloc+0x1e/0x80
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050]
[<ffffffff815be2d5>] __sock_create+0x95/0x200
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050]
[<ffffffff815be4a0>] sock_create+0x30/0x40
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050]
[<ffffffff815bf5a6>] SyS_socket+0x36/0xb0
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050]
[<ffffffff816e9812>] system_call_fastpath+0x16/0x1b
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] Code: 00 49 8b 50
08 4d 8b 28 49 8b 40 10 4d 85 ed 0f 84 e2 00 00 00 48 85 c0 0f 84 d9 00 00
00 49 63 44 24 20 49 8b 3c 24 48 8d 4a 01 <49> 8b 5c 05 00 4c 89 e8 65 48 0f
c7 0f 0f 94 c0 84 c0 74 b5 49
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] RIP
[<ffffffff811917ac>] kmem_cache_alloc_trace+0x6c/0x160
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] RSP
<ffff8800799abdd8>
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.140050] CR2:
00007f9e50678000
Aug 13 19:46:12 vc-00-00-1383-dev kernel: [ 167.185653] ---[ end trace
56118c807adeb6df ]---

[2]
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.566506] BUG: unable to
handle kernel paging request at 00007f41d65a100c
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.567769] IP:
[<ffffffff81406a47>] vring_add_indirect+0x87/0x210
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.568859] PGD 7b4a7067 PUD
7b4a8067 PMD 7b4aa067 PTE 8000000076f5b065
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570055] Oops: 0003 [#1] SMP
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Modules linked in:
xfs(O) libcrc32c dm_linear_custom(O) deflate ctr twofish_generic
twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64
twofish_common camellia_generic serpent_generic blowfish_generic
blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic cmac
xcbc rmd160 sha512_generic crypto_null af_key xfrm_algo nfsd nfs_acl
dm_round_robin ppdev parport_pc mac_hid dm_multipath i2c_piix4 psmouse
serio_raw rpcsec_gss_krb5 auth_rpcgss oid_registry lp nfsv4 parport nfs
fscache lockd sunrpc floppy
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] CPU: 3 PID: 3329
Comm: mount Tainted: G O 3.16.0-999-generic #201408131143
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Hardware name:
Bochs Bochs, BIOS Bochs 01/01/2007
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] task:
ffff88007b2ac8c0 ti: ffff88007a900000 task.ti: ffff88007a900000
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RIP:
0010:[<ffffffff81406a47>] [<ffffffff81406a47>]
vring_add_indirect+0x87/0x210
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RSP:
0018:ffff88007a903688 EFLAGS: 00010006
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RAX:
ffff88007a9037b8 RBX: ffff88007a903788 RCX: 00007f41d65a1000
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RDX:
000000000001da11 RSI: 0000160000000000 RDI: ffff88007a9037b8
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RBP:
ffff88007a9036d8 R08: ffff88007fd95d20 R09: ffffffff814069f6
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] R10:
0000000000000003 R11: ffff88007a903860 R12: ffffffff81406cb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] R13:
ffff88007c9ae000 R14: 0000000000000000 R15: 00007f41d65a1000
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] FS:
00007f41d67b5800(0000) GS:ffff88007fd80000(0000) knlGS:0000000000000000
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] CS: 0010 DS: 0000
ES: 0000 CR0: 0000000080050033
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] CR2:
00007f41d65a100c CR3: 000000007a943000 CR4: 00000000000006e0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Stack:
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] ffff88007fff8700
0000000300000202 000000007a9037b8 0000000100000002
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] ffff88007a903768
ffff88007c9ae000 ffff88007c9ae000 ffff88007a903788
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] 0000000000000001
0000000000000003 ffff88007a903768 ffffffff814072a2
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Call Trace:
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff814072a2>] virtqueue_add_sgs+0x2f2/0x340
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8147854a>] __virtblk_add_req+0xda/0x1b0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8132f307>] ? __bt_get+0xc7/0x1e0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8132f47a>] ? bt_get+0x5a/0x180
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8132953c>] ? blk_rq_map_sg+0x3c/0x170
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8147871a>] virtio_queue_rq+0xfa/0x250
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8132d39f>] __blk_mq_run_hw_queue+0xff/0x260
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8132d97d>] blk_mq_run_hw_queue+0x7d/0xb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8132dd71>] blk_sq_make_request+0x171/0x2f0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff81320605>] generic_make_request.part.75+0x75/0xb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff813215a8>] generic_make_request+0x68/0x70
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff81321625>] submit_bio+0x75/0x140
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff811d2773>] _submit_bh+0x113/0x160
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff811d27d0>] submit_bh+0x10/0x20
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff811d6135>] block_read_full_page+0x1f5/0x340
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff811d6d20>] ? I_BDEV+0x10/0x10
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8115c445>] ? __inc_zone_page_state+0x35/0x40
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8113eca4>] ? __add_to_page_cache_locked+0xa4/0x130
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff811d7798>] blkdev_readpage+0x18/0x20
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8114aee8>] read_pages+0xe8/0x100
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8114b1e3>] __do_page_cache_readahead+0x163/0x170
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8114b555>] force_page_cache_readahead+0x75/0xb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8114b5d3>] page_cache_sync_readahead+0x43/0x50
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8113f91e>] do_generic_file_read+0x30e/0x490
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff81140c44>] generic_file_read_iter+0xf4/0x150
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8116b4b8>] ? handle_mm_fault+0x48/0x80
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8116edc0>] ? find_vma+0x20/0x80
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff811d6ec7>] blkdev_read_iter+0x37/0x40
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff811a10d8>] new_sync_read+0x78/0xb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff811a218b>] vfs_read+0xab/0x180
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff811a240f>] SyS_read+0x4f/0xb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff816e9812>] system_call_fastpath+0x16/0x1b
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Code: 00 00 00 00
48 63 45 c4 48 8b 04 c3 48 85 c0 74 59 0f 1f 00 49 63 ce 48 be 00 00 00 00
00 16 00 00 48 89 c7 48 c1 e1 04 4c 01 f9 <66> c7 41 0c 01 00 48 8b 10 48 83
e2 fc 48 01 f2 8b 70 08 48 c1
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RIP
[<ffffffff81406a47>] vring_add_indirect+0x87/0x210
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RSP
<ffff88007a903688>
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] CR2:
00007f41d65a100c
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] ---[ end trace
cc83f9989ae9e2af ]---
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] BUG: unable to
handle kernel paging request at 0000000079a0b008
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] IP:
[<ffffffff810d18b0>] acct_collect+0x60/0x1b0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] PGD 7b4ac067 PUD
7b58c067 PMD 7b4a3067 PTE 0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Oops: 0000 [#2] SMP
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Modules linked in:
xfs(O) libcrc32c dm_linear_custom(O) deflate ctr twofish_generic
twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64
twofish_common camellia_generic serpent_generic blowfish_generic
blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic cmac
xcbc rmd160 sha512_generic crypto_null af_key xfrm_algo nfsd nfs_acl
dm_round_robin ppdev parport_pc mac_hid dm_multipath i2c_piix4 psmouse
serio_raw rpcsec_gss_krb5 auth_rpcgss oid_registry lp nfsv4 parport nfs
fscache lockd sunrpc floppy
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] CPU: 3 PID: 3329
Comm: mount Tainted: G D O 3.16.0-999-generic #201408131143
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Hardware name:
Bochs Bochs, BIOS Bochs 01/01/2007
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] task:
ffff88007b2ac8c0 ti: ffff88007a900000 task.ti: ffff88007a900000
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RIP:
0010:[<ffffffff810d18b0>] [<ffffffff810d18b0>] acct_collect+0x60/0x1b0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RSP:
0018:ffff88007a903328 EFLAGS: 00010006
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RAX:
0000000079a0b000 RBX: 0000000000000009 RCX: 0000000000000245
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RDX:
0000000000000275 RSI: 0000000000000001 RDI: ffff8800362fb860
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RBP:
ffff88007a903348 R08: 00007ffffffff000 R09: 0000000000000d01
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] R10:
0000000000000000 R11: 0000000000000000 R12: ffff88007afa4800
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] R13:
ffff88007b2ac8c0 R14: 0001000101133010 R15: 0000000000000046
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] FS:
00007f41d67b5800(0000) GS:ffff88007fd80000(0000) knlGS:0000000000000000
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] CS: 0010 DS: 0000
ES: 0000 CR0: 0000000080050033
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] CR2:
0000000079a0b008 CR3: 000000007a943000 CR4: 00000000000006e0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Stack:
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] ffff88007b2ac8c0
0000000000000009 ffff88007a9035d8 0000000000000001
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] ffff88007a903398
ffffffff810582f7 0000000000000007 0000000000000006
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] 000000000000000a
0000000000000046 0000000000000009 ffff88007a9035d8
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Call Trace:
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff810582f7>] do_exit+0x367/0x470
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff81006465>] oops_end+0xa5/0xf0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff816cf708>] no_context+0x1be/0x1cd
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff816cf8e9>] __bad_area_nosemaphore+0x1d2/0x1f1
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff816cf9ae>] bad_area_access_error+0x45/0x4e
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff81044dfe>] __do_page_fault+0x2fe/0x550
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff81210435>] ? kernfs_path+0x55/0x70
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8134a4e3>] ? cfq_find_alloc_queue+0x293/0x430
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff81141635>] ? mempool_alloc_slab+0x15/0x20
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff81141635>] ? mempool_alloc_slab+0x15/0x20
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff81406cb0>] ? virtqueue_get_buf+0xe0/0xe0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8104516c>] do_page_fault+0xc/0x10
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff816eb222>] page_fault+0x22/0x30
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff81406cb0>] ? virtqueue_get_buf+0xe0/0xe0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff814069f6>] ? vring_add_indirect+0x36/0x210
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff81406a47>] ? vring_add_indirect+0x87/0x210
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff814069f6>] ? vring_add_indirect+0x36/0x210
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff814072a2>] virtqueue_add_sgs+0x2f2/0x340
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8147854a>] __virtblk_add_req+0xda/0x1b0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8132f307>] ? __bt_get+0xc7/0x1e0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8132f47a>] ? bt_get+0x5a/0x180
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8132953c>] ? blk_rq_map_sg+0x3c/0x170
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8147871a>] virtio_queue_rq+0xfa/0x250
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8132d39f>] __blk_mq_run_hw_queue+0xff/0x260
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8132d97d>] blk_mq_run_hw_queue+0x7d/0xb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8132dd71>] blk_sq_make_request+0x171/0x2f0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff81320605>] generic_make_request.part.75+0x75/0xb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff813215a8>] generic_make_request+0x68/0x70
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff81321625>] submit_bio+0x75/0x140
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff811d2773>] _submit_bh+0x113/0x160
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff811d27d0>] submit_bh+0x10/0x20
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff811d6135>] block_read_full_page+0x1f5/0x340
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff811d6d20>] ? I_BDEV+0x10/0x10
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8115c445>] ? __inc_zone_page_state+0x35/0x40
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8113eca4>] ? __add_to_page_cache_locked+0xa4/0x130
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff811d7798>] blkdev_readpage+0x18/0x20
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8114aee8>] read_pages+0xe8/0x100
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8114b1e3>] __do_page_cache_readahead+0x163/0x170
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8114b555>] force_page_cache_readahead+0x75/0xb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8114b5d3>] page_cache_sync_readahead+0x43/0x50
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8113f91e>] do_generic_file_read+0x30e/0x490
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff81140c44>] generic_file_read_iter+0xf4/0x150
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8116b4b8>] ? handle_mm_fault+0x48/0x80
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff8116edc0>] ? find_vma+0x20/0x80
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff811d6ec7>] blkdev_read_iter+0x37/0x40
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff811a10d8>] new_sync_read+0x78/0xb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff811a218b>] vfs_read+0xab/0x180
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff811a240f>] SyS_read+0x4f/0xb0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399]
[<ffffffff816e9812>] system_call_fastpath+0x16/0x1b
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Code: 00 74 55 49
8b bd a0 03 00 00 48 83 c7 60 e8 a8 65 61 00 49 8b 85 a0 03 00 00 48 8b 00
48 85 c0 74 1d 66 0f 1f 84 00 00 00 00 00 <4c> 03 70 08 4c 2b 30 48 8b 40 10
48 85 c0 75 f0 49 c1 ee 0a 65
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RIP
[<ffffffff810d18b0>] acct_collect+0x60/0x1b0
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] RSP
<ffff88007a903328>
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] CR2:
0000000079a0b008
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] ---[ end trace
cc83f9989ae9e2b0 ]---
Aug 13 19:49:06 vc-00-00-1383-dev kernel: [ 118.570399] Fixing recursive
fault but reboot is needed!

[3]
(qemu) KVM internal error. Suberror: 1
emulation failure
RAX=0000000000000000 RBX=0000000000000000 RCX=0000000000000000
RDX=0000000000000623
RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000
RSP=0000000000000000
R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000
R11=0000000000000000
R12=0000000000000000 R13=0000000000000000 R14=0000000000000000
R15=0000000000000000
RIP=000000000000fff0 RFL=00010002 [-------] CPL=3 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 ffffffff 00000000
CS =0033 0000000000000000 ffffffff 00a0fb00 DPL=3 CS64 [-RA]
SS =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS [-WA]
DS =0000 0000000000000000 ffffffff 00000000
FS =0000 00007f0514f7c700 ffffffff 00000000
GS =0000 0000000000000000 ffffffff 00000000
LDT=0000 0000000000000000 ffffffff 00000000
TR =0040 ffff88007fc90380 00002087 00008b00 DPL=0 TSS64-busy
GDT= ffff88007fc84000 0000007f
IDT= ffffffffff57c000 00000fff
CR0=80050033 CR2=000000000061cd64 CR3=000000007ae6b000 CR4=000006e0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000
DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d01

[4]
(qemu) qemu-system-x86_64: virtio: trying to map MMIO memory

-----Original Message-----
From: Dave Chinner
Sent: 13 August, 2014 2:56 AM
To: Alex Lyakas
Cc: Brian Foster ; ***@oss.sgi.com
Subject: Re: use-after-free on log replay failure

Post by Alex Lyakas
Hello Dave, Brian,
I will describe a generic reproduction that you ask for.
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git

....

Post by Alex Lyakas
rw,sync,noatime,wsync,attr2,inode64,noquota 0 0
I started a couple of processes writing files sequentially onto this
mount point, and after few seconds crashed the VM.
https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing
dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0"
I am attaching the code (and Makefile) of dm-linear-custom target.
It is exact copy of dm-linear, except that it has a module
parameter. With the parameter set to 0, this is an identity mapping
onto /dev/vde. If the parameter is set to non-0, all WRITE bios are
failed with ENOSPC. There is a workqueue to fail them in a different
context (not sure if really needed, but that's what our "real"
custom
block device does).

Well, they you go. That explains it - an asynchronous dispatch error
happening fast enough to race with the synchronous XFS dispatch
processing.

dispatch thread device workqueue
xfs_buf_hold();
atomic_set(b_io_remaining, 1)
atomic_inc(b_io_remaining)
submit_bio(bio)
queue_work(bio)
xfs_buf_ioend(bp, ....);
atomic_dec(b_io_remaining)
xfs_buf_rele()
bio error set to ENOSPC
bio->end_io()
xfs_buf_bio_endio()
bp->b_error = ENOSPC
_xfs_buf_ioend(bp, 1);
atomic_dec(b_io_remaining)
xfs_buf_ioend(bp, 1);
queue_work(bp)
xfs_buf_iowait()
if (bp->b_error) return error;
if (error)
xfs_buf_relse()
xfs_buf_rele()
xfs_buf_free()

And now we have a freed buffer that is queued on the io completion
queue. Basically, it requires the buffer error to be set
asynchronously *between* the dispatch decrementing it's I/O count
after dispatch, but before we wait on the IO.

Not sure what the right fix is yet - removing the bp->b_error check
from xfs_buf_iowait() doesn't solve the problem - it just prevents
this code path from being tripped over by the race condition.

But, just to validate this is the problem, you should be able to
reproduce this on a 3.16 kernel. Can you try that, Alex?

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Dave Chinner

2014-08-13 00:03:12 UTC

Post by Alex Lyakas
dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0"
I am attaching the code (and Makefile) of dm-linear-custom target.
It is exact copy of dm-linear, except that it has a module
parameter. With the parameter set to 0, this is an identity mapping
onto /dev/vde. If the parameter is set to non-0, all WRITE bios are
failed with ENOSPC. There is a workqueue to fail them in a different
context (not sure if really needed, but that's what our "real"
custom
block device does).

FWIW, now I've looked at the dm module, this could easily be added
to the dm-flakey driver by adding a "queue_write_error" option
to it (i.e. similar to the current drop_writes and corrupt_bio_byte
options).

If we add the code there, then we could add a debug-only XFS sysfs
variable to trigger the log recovery sleep, and then use dm-flakey
to queue and error out writes. That gives us a reproducable xfstest
for this condition. Brian, does that sound like a reasonable plan to
you?

Thanks for describing the method you've been using to reproduce the
bug so clearly, Alex.

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Brian Foster

2014-08-13 13:11:42 UTC

Post by Dave Chinner

Post by Alex Lyakas
dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0"
I am attaching the code (and Makefile) of dm-linear-custom target.
It is exact copy of dm-linear, except that it has a module
parameter. With the parameter set to 0, this is an identity mapping
onto /dev/vde. If the parameter is set to non-0, all WRITE bios are
failed with ENOSPC. There is a workqueue to fail them in a different
context (not sure if really needed, but that's what our "real"
custom
block device does).

FWIW, now I've looked at the dm module, this could easily be added
to the dm-flakey driver by adding a "queue_write_error" option
to it (i.e. similar to the current drop_writes and corrupt_bio_byte
options).
If we add the code there, then we could add a debug-only XFS sysfs
variable to trigger the log recovery sleep, and then use dm-flakey
to queue and error out writes. That gives us a reproducable xfstest
for this condition. Brian, does that sound like a reasonable plan to
you?

It would be nice if we could avoid this kind of timing hack, but I'll
have to look at the related code and see what we have for options. I'm
also assuming the oops/crash is a consistent behavior when we hit the
race, since that probably defines the failure mode of the test.

But yeah, seems like a reasonable plan in general. Added to the todo
list for once we have this sorted out...

Brian

Post by Dave Chinner
Thanks for describing the method you've been using to reproduce the
bug so clearly, Alex.
Cheers,
Dave.
--
Dave Chinner

38 Replies
8 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Alex Lyakas 2013-12-18 18:37:29 UTC

Dave Chinner 2013-12-18 23:06:15 UTC

Alex Lyakas 2013-12-19 09:24:15 UTC

Dave Chinner 2013-12-19 10:55:13 UTC

Alex Lyakas 2013-12-19 19:24:40 UTC

Chris Murphy 2013-12-21 17:03:47 UTC

Alex Lyakas 2013-12-24 18:21:50 UTC

Dave Chinner 2013-12-26 23:00:18 UTC

Alex Lyakas 2014-07-01 15:06:38 UTC

Dave Chinner 2014-07-01 21:56:26 UTC

Alex Lyakas 2014-07-02 12:27:25 UTC

Eric Sandeen 2014-08-04 18:15:00 UTC

Alex Lyakas 2014-08-06 08:56:25 UTC

Alex Lyakas 2014-08-04 11:00:05 UTC

Brian Foster 2014-08-04 14:12:57 UTC

Dave Chinner 2014-08-04 23:07:21 UTC

Alex Lyakas 2014-08-06 10:05:34 UTC

Dave Chinner 2014-08-06 12:32:15 UTC

Alex Lyakas 2014-08-06 14:43:02 UTC

Alex Lyakas 2014-08-10 16:26:24 UTC

Alex Lyakas 2014-08-06 12:52:03 UTC

Brian Foster 2014-08-06 15:20:42 UTC

Alex Lyakas 2014-08-06 15:28:34 UTC

Alex Lyakas 2014-08-10 12:20:50 UTC

Brian Foster 2014-08-11 13:20:57 UTC

Dave Chinner 2014-08-11 21:52:07 UTC

Brian Foster 2014-08-12 12:03:41 UTC

Alex Lyakas 2014-08-12 12:39:02 UTC

Brian Foster 2014-08-12 19:31:04 UTC

Dave Chinner 2014-08-12 23:56:15 UTC

Brian Foster 2014-08-13 12:59:32 UTC

Dave Chinner 2014-08-13 20:59:29 UTC

Brian Foster 2014-08-13 23:21:35 UTC

Dave Chinner 2014-08-14 06:14:44 UTC

Brian Foster 2014-08-14 19:05:22 UTC

Dave Chinner 2014-08-14 22:27:56 UTC

Alex Lyakas 2014-08-13 17:07:58 UTC

Dave Chinner 2014-08-13 00:03:12 UTC

Brian Foster 2014-08-13 13:11:42 UTC

about - legalese

Loading...