Discussion:
[PATCH] Prevent extent btree block allocation failures
Lachlan McIlroy
2008-06-13 07:38:12 UTC
Permalink
When at ENOSPC conditions extent btree block allocations can fail and we
have no error handling to undo partial btree operations. Prior to extent
btree operations we reserve enough disk blocks somewhere in the filesystem
to satisfy the operation but in some conditions we require the blocks to
come from specific AGs and if those AGs are full the allocation fails.

This change fixes xfs_bmap_extents_to_btree(), xfs_bmap_local_to_extents(),
xfs_bmbt_split() and xfs_bmbt_newroot() so that they can search other AGs
for the space needed. Since we have reserved the space these allocations
are now guaranteed to succeed. In order to search all AGs I had to revert
a change made to xfs_alloc_vextent() that prevented a search from looking
at AGs lower than the starting AG. This original change was made to prevent
out of order AG locking when allocating multiple extents on data writeout
but since we only allocate one extent at a time now this particular problem
can't happen.

Lachlan

--- fs/xfs/xfs_alloc.c_1.193 2008-06-03 11:28:55.000000000 +1000
+++ fs/xfs/xfs_alloc.c 2008-06-02 18:40:47.000000000 +1000
@@ -2376,19 +2376,9 @@ xfs_alloc_vextent(
if (args->agno == sagno &&
type == XFS_ALLOCTYPE_START_BNO)
args->type = XFS_ALLOCTYPE_THIS_AG;
- /*
- * For the first allocation, we can try any AG to get
- * space. However, if we already have allocated a
- * block, we don't want to try AGs whose number is below
- * sagno. Otherwise, we may end up with out-of-order
- * locking of AGF, which might cause deadlock.
- */
- if (++(args->agno) == mp->m_sb.sb_agcount) {
- if (args->firstblock != NULLFSBLOCK)
- args->agno = sagno;
- else
- args->agno = 0;
- }
+
+ if (++(args->agno) == mp->m_sb.sb_agcount)
+ args->agno = 0;
/*
* Reached the starting a.g., must either be done
* or switch to non-trylock mode.
--- fs/xfs/xfs_bmap.c_1.392 2008-06-03 12:20:14.000000000 +1000
+++ fs/xfs/xfs_bmap.c 2008-06-03 15:57:40.000000000 +1000
@@ -3445,16 +3452,10 @@ xfs_bmap_extents_to_btree(
args.tp = tp;
args.mp = mp;
args.firstblock = *firstblock;
- if (*firstblock == NULLFSBLOCK) {
- args.type = XFS_ALLOCTYPE_START_BNO;
+ args.fsbno = *firstblock;
+ if (*firstblock == NULLFSBLOCK)
args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino);
- } else if (flist->xbf_low) {
- args.type = XFS_ALLOCTYPE_START_BNO;
- args.fsbno = *firstblock;
- } else {
- args.type = XFS_ALLOCTYPE_NEAR_BNO;
- args.fsbno = *firstblock;
- }
+ args.type = XFS_ALLOCTYPE_START_BNO;
args.minlen = args.maxlen = args.prod = 1;
args.total = args.minleft = args.alignment = args.mod = args.isfl =
args.minalignslop = 0;
@@ -3585,13 +3586,10 @@ xfs_bmap_local_to_extents(
* Allocate a block. We know we need only one, since the
* file currently fits in an inode.
*/
- if (*firstblock == NULLFSBLOCK) {
+ args.fsbno = *firstblock;
+ if (*firstblock == NULLFSBLOCK)
args.fsbno = XFS_INO_TO_FSB(args.mp, ip->i_ino);
- args.type = XFS_ALLOCTYPE_START_BNO;
- } else {
- args.fsbno = *firstblock;
- args.type = XFS_ALLOCTYPE_NEAR_BNO;
- }
+ args.type = XFS_ALLOCTYPE_START_BNO;
args.total = total;
args.mod = args.minleft = args.alignment = args.wasdel =
args.isfl = args.minalignslop = 0;
--- fs/xfs/xfs_bmap_btree.c_1.169 2008-06-03 11:28:56.000000000 +1000
+++ fs/xfs/xfs_bmap_btree.c 2008-06-06 14:48:14.000000000 +1000
@@ -1493,11 +1493,9 @@ xfs_bmbt_split(
left = XFS_BUF_TO_BMBT_BLOCK(lbp);
args.fsbno = cur->bc_private.b.firstblock;
args.firstblock = args.fsbno;
- if (args.fsbno == NULLFSBLOCK) {
+ if (args.fsbno == NULLFSBLOCK)
args.fsbno = lbno;
- args.type = XFS_ALLOCTYPE_START_BNO;
- } else
- args.type = XFS_ALLOCTYPE_NEAR_BNO;
+ args.type = XFS_ALLOCTYPE_START_BNO;
args.mod = args.minleft = args.alignment = args.total = args.isfl =
args.userdata = args.minalignslop = 0;
args.minlen = args.maxlen = args.prod = 1;
@@ -2253,9 +2251,8 @@ xfs_bmbt_newroot(
}
#endif
args.fsbno = be64_to_cpu(*pp);
- args.type = XFS_ALLOCTYPE_START_BNO;
- } else
- args.type = XFS_ALLOCTYPE_NEAR_BNO;
+ }
+ args.type = XFS_ALLOCTYPE_START_BNO;
if ((error = xfs_alloc_vextent(&args))) {
XFS_BMBT_TRACE_CURSOR(cur, ERROR);
return error;
Christoph Hellwig
2008-06-13 13:44:18 UTC
Permalink
Post by Lachlan McIlroy
This change fixes xfs_bmap_extents_to_btree(), xfs_bmap_local_to_extents(),
xfs_bmbt_split() and xfs_bmbt_newroot() so that they can search other AGs
for the space needed. Since we have reserved the space these allocations
are now guaranteed to succeed.
This looks good and makes lot of sense to me. Please also add a comment
to enum xfs_alloctype in xfs_alloc.h about the danger of the allocation
types that never go out of the AG when used inside transactions.
Post by Lachlan McIlroy
In order to search all AGs I had to revert
a change made to xfs_alloc_vextent() that prevented a search from looking
at AGs lower than the starting AG. This original change was made to prevent
out of order AG locking when allocating multiple extents on data writeout
but since we only allocate one extent at a time now this particular problem
can't happen.
This one also makes sense, but I have a very bad gut feeling about it.
There's nothing preventing the same deadlock scenario from coming back
when people modify the highlevel data allocator again. We really need
some sort of assert to trigger early in that case to not got a nasty
hard to trigger deadlock.
Post by Lachlan McIlroy
+ if (++(args->agno) == mp->m_sb.sb_agcount)
and while we're at it this should be

if (++args->agno == mp->m_sb.sb_agcount)
Lachlan McIlroy
2008-06-16 03:57:52 UTC
Permalink
Post by Christoph Hellwig
Post by Lachlan McIlroy
This change fixes xfs_bmap_extents_to_btree(), xfs_bmap_local_to_extents(),
xfs_bmbt_split() and xfs_bmbt_newroot() so that they can search other AGs
for the space needed. Since we have reserved the space these allocations
are now guaranteed to succeed.
This looks good and makes lot of sense to me. Please also add a comment
to enum xfs_alloctype in xfs_alloc.h about the danger of the allocation
types that never go out of the AG when used inside transactions.
The flags already have comments? Are you referring specifically to the
fact we have reserved space somewhere in the filesystem but expect to find
it in a single AG?
Post by Christoph Hellwig
Post by Lachlan McIlroy
In order to search all AGs I had to revert
a change made to xfs_alloc_vextent() that prevented a search from looking
at AGs lower than the starting AG. This original change was made to prevent
out of order AG locking when allocating multiple extents on data writeout
but since we only allocate one extent at a time now this particular problem
can't happen.
This one also makes sense, but I have a very bad gut feeling about it.
There's nothing preventing the same deadlock scenario from coming back
when people modify the highlevel data allocator again. We really need
some sort of assert to trigger early in that case to not got a nasty
hard to trigger deadlock.
Yes I agree. I was thinking about making the allocation of a second data
extent in the same transaction a conditional try operation to avoid the
deadlock. If a caller to the allocator provides room to return more than
one extent then I don't think we have to return more than one - multiple
extents in one transaction is just an optimisation.
Post by Christoph Hellwig
Post by Lachlan McIlroy
+ if (++(args->agno) == mp->m_sb.sb_agcount)
and while we're at it this should be
if (++args->agno == mp->m_sb.sb_agcount)
Done, thanks.
Dave Chinner
2008-06-13 15:57:08 UTC
Permalink
Post by Lachlan McIlroy
When at ENOSPC conditions extent btree block allocations can fail and we
have no error handling to undo partial btree operations. Prior to extent
btree operations we reserve enough disk blocks somewhere in the filesystem
to satisfy the operation but in some conditions we require the blocks to
come from specific AGs and if those AGs are full the allocation fails.
This change fixes xfs_bmap_extents_to_btree(), xfs_bmap_local_to_extents(),
xfs_bmbt_split() and xfs_bmbt_newroot() so that they can search other AGs
for the space needed. Since we have reserved the space these allocations
are now guaranteed to succeed.
Sure, but we didn't reserve space for potential btree splits in a
second AG as a result of this. That needs to be reserved in the
transaction as well, which will blow out transaction reservations
substantially as we'll need to add another 2 full AGF btree splits to
every transaction that modifies the bmap btree.
Post by Lachlan McIlroy
In order to search all AGs I had to revert
a change made to xfs_alloc_vextent() that prevented a search from looking
at AGs lower than the starting AG. This original change was made to prevent
out of order AG locking when allocating multiple extents on data writeout
but since we only allocate one extent at a time now this particular problem
can't happen.
You missed the fact that the AGF of modified AGs is already held
locked in the transaction, hence the locking order within the
transaction is wrong. Also, if we modify the free list in an AG
the fail an allocation (e.g. can't do an exact allocation), we'll
have multiple dirty and locked AGFs in the one allocation. Hence
we still can have locking order violations if you remove that check
and therefore deadlocks.

This is not the solution to the problem. As I suggested (back when
you first floated this idea as a fix for the problem several weeks
ago) I think the bug is that we are not taking into account the
number of blocks required for a bmbt split when selecting an AG to
allocate from. All we take into account is the blocks required for
the extent to be allocated and nothing else. If we take the blocks
for a bmbt split into account then we'll never try to allocate an
extent in an AG that we can't also allocate all the blocks for the
bmbt split in at the same time.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Lachlan McIlroy
2008-06-16 06:11:09 UTC
Permalink
Post by Dave Chinner
Post by Lachlan McIlroy
When at ENOSPC conditions extent btree block allocations can fail and we
have no error handling to undo partial btree operations. Prior to extent
btree operations we reserve enough disk blocks somewhere in the filesystem
to satisfy the operation but in some conditions we require the blocks to
come from specific AGs and if those AGs are full the allocation fails.
This change fixes xfs_bmap_extents_to_btree(), xfs_bmap_local_to_extents(),
xfs_bmbt_split() and xfs_bmbt_newroot() so that they can search other AGs
for the space needed. Since we have reserved the space these allocations
are now guaranteed to succeed.
Sure, but we didn't reserve space for potential btree splits in a
second AG as a result of this. That needs to be reserved in the
transaction as well, which will blow out transaction reservations
substantially as we'll need to add another 2 full AGF btree splits to
every transaction that modifies the bmap btree.
Right. And most of the time we wont need the space either so it's a
real waste.
Post by Dave Chinner
Post by Lachlan McIlroy
In order to search all AGs I had to revert
a change made to xfs_alloc_vextent() that prevented a search from looking
at AGs lower than the starting AG. This original change was made to prevent
out of order AG locking when allocating multiple extents on data writeout
but since we only allocate one extent at a time now this particular problem
can't happen.
You missed the fact that the AGF of modified AGs is already held
locked in the transaction, hence the locking order within the
transaction is wrong. Also, if we modify the free list in an AG
the fail an allocation (e.g. can't do an exact allocation), we'll
have multiple dirty and locked AGFs in the one allocation. Hence
we still can have locking order violations if you remove that check
and therefore deadlocks.
I'm well aware of that particular deadlock involving the freelist - I
hit it while testing. If you look closely at the code that deadlock
can occur with or without the AG locking avoidance logic. This is
because the rest of the transaction is unaware that an AG has been
locked due to a freelist operation.
Post by Dave Chinner
This is not the solution to the problem. As I suggested (back when
you first floated this idea as a fix for the problem several weeks
ago) I think the bug is that we are not taking into account the
number of blocks required for a bmbt split when selecting an AG to
allocate from. All we take into account is the blocks required for
the extent to be allocated and nothing else. If we take the blocks
for a bmbt split into account then we'll never try to allocate an
extent in an AG that we can't also allocate all the blocks for the
bmbt split in at the same time.
I considered that approach (using the minleft field in xfs_alloc_arg_t)
but it has it's problems too. When we reserve space for the btree
operations it is done on the global filesystem counters, not a
particular AG, so there is the possibility that not one AG has sufficent
space to perform the allocation even though there is enough free space
in the whole filesystem. Of course if we have enough space left in one
AG and the AG is locked then the space we reserved doesn't matter anymore
and it should all work.

I'm worried with this approach that we could have delayed allocations and
unwritten extents that need to be converted but we can't do it because we
don't have the space we might need (but probably don't).
Dave Chinner
2008-06-16 17:10:22 UTC
Permalink
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
When at ENOSPC conditions extent btree block allocations can fail and we
have no error handling to undo partial btree operations. Prior to extent
btree operations we reserve enough disk blocks somewhere in the filesystem
to satisfy the operation but in some conditions we require the blocks to
come from specific AGs and if those AGs are full the allocation fails.
This change fixes xfs_bmap_extents_to_btree(), xfs_bmap_local_to_extents(),
xfs_bmbt_split() and xfs_bmbt_newroot() so that they can search other AGs
for the space needed. Since we have reserved the space these allocations
are now guaranteed to succeed.
Sure, but we didn't reserve space for potential btree splits in a
second AG as a result of this. That needs to be reserved in the
transaction as well, which will blow out transaction reservations
substantially as we'll need to add another 2 full AGF btree splits to
every transaction that modifies the bmap btree.
Right. And most of the time we wont need the space either so it's a
real waste.
Waste, yes, but still needed otherwise transaction overruns and log space
deadlocks could occur....
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
In order to search all AGs I had to revert
a change made to xfs_alloc_vextent() that prevented a search from looking
at AGs lower than the starting AG. This original change was made to prevent
out of order AG locking when allocating multiple extents on data writeout
but since we only allocate one extent at a time now this particular problem
can't happen.
You missed the fact that the AGF of modified AGs is already held
locked in the transaction, hence the locking order within the
transaction is wrong. Also, if we modify the free list in an AG
the fail an allocation (e.g. can't do an exact allocation), we'll
have multiple dirty and locked AGFs in the one allocation. Hence
we still can have locking order violations if you remove that check
and therefore deadlocks.
I'm well aware of that particular deadlock involving the freelist - I
hit it while testing. If you look closely at the code that deadlock
can occur with or without the AG locking avoidance logic. This is
because the rest of the transaction is unaware that an AG has been
locked due to a freelist operation.
Yes, which is why you need to prevent freelist modifications occurring
when you can't allocate anything out of the AG.
Post by Lachlan McIlroy
Post by Dave Chinner
This is not the solution to the problem. As I suggested (back when
you first floated this idea as a fix for the problem several weeks
ago) I think the bug is that we are not taking into account the
number of blocks required for a bmbt split when selecting an AG to
allocate from. All we take into account is the blocks required for
the extent to be allocated and nothing else. If we take the blocks
for a bmbt split into account then we'll never try to allocate an
extent in an AG that we can't also allocate all the blocks for the
bmbt split in at the same time.
I considered that approach (using the minleft field in xfs_alloc_arg_t)
but it has it's problems too. When we reserve space for the btree
operations it is done on the global filesystem counters, not a
particular AG, so there is the possibility that not one AG has sufficent
space to perform the allocation even though there is enough free space
in the whole filesystem.
Yes, we had that problem with the ENOSPC deadlock fixes in that we always
needed at least 4 blocks per AG available for a extent free to succeed.
Hence we have the XFS_ALLOC_SET_ASIDE() value for determining if the
filesystem is out of space, not a count of zero free blocks.

In this case, this macro can be extended to guarantee that our aggregate
block usage never goes below the threshold that would prevent each AG from
holding enough blocks for a worst case allocation to succeed.....
Post by Lachlan McIlroy
Of course if we have enough space left in one
AG and the AG is locked then the space we reserved doesn't matter anymore
and it should all work.
Yes.
Post by Lachlan McIlroy
I'm worried with this approach that we could have delayed allocations and
unwritten extents that need to be converted but we can't do it because we
don't have the space we might need (but probably don't).
Delayed allocation is the issue - unwritten extent conversion failure will
simply return an error and leave the extent unwritten.

With delayed allocation, we can oversubscribe a given AG but we can
always try a different AG. If we get to the situation that we don't have
enough space in any AG then we are screwed. However, by ensuring we can
sustain a worst-case split within any AG we can avoid this situation
completely.

Cheers,

Dave.
Lachlan McIlroy
2008-06-17 01:58:47 UTC
Permalink
Post by Dave Chinner
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
When at ENOSPC conditions extent btree block allocations can fail and we
have