Discussion:
[PATCH] Prevent extent btree block allocation failures
(too old to reply)
Lachlan McIlroy
2008-06-13 07:38:12 UTC
Permalink
When at ENOSPC conditions extent btree block allocations can fail and we
have no error handling to undo partial btree operations. Prior to extent
btree operations we reserve enough disk blocks somewhere in the filesystem
to satisfy the operation but in some conditions we require the blocks to
come from specific AGs and if those AGs are full the allocation fails.

This change fixes xfs_bmap_extents_to_btree(), xfs_bmap_local_to_extents(),
xfs_bmbt_split() and xfs_bmbt_newroot() so that they can search other AGs
for the space needed. Since we have reserved the space these allocations
are now guaranteed to succeed. In order to search all AGs I had to revert
a change made to xfs_alloc_vextent() that prevented a search from looking
at AGs lower than the starting AG. This original change was made to prevent
out of order AG locking when allocating multiple extents on data writeout
but since we only allocate one extent at a time now this particular problem
can't happen.

Lachlan

--- fs/xfs/xfs_alloc.c_1.193 2008-06-03 11:28:55.000000000 +1000
+++ fs/xfs/xfs_alloc.c 2008-06-02 18:40:47.000000000 +1000
@@ -2376,19 +2376,9 @@ xfs_alloc_vextent(
if (args->agno == sagno &&
type == XFS_ALLOCTYPE_START_BNO)
args->type = XFS_ALLOCTYPE_THIS_AG;
- /*
- * For the first allocation, we can try any AG to get
- * space. However, if we already have allocated a
- * block, we don't want to try AGs whose number is below
- * sagno. Otherwise, we may end up with out-of-order
- * locking of AGF, which might cause deadlock.
- */
- if (++(args->agno) == mp->m_sb.sb_agcount) {
- if (args->firstblock != NULLFSBLOCK)
- args->agno = sagno;
- else
- args->agno = 0;
- }
+
+ if (++(args->agno) == mp->m_sb.sb_agcount)
+ args->agno = 0;
/*
* Reached the starting a.g., must either be done
* or switch to non-trylock mode.
--- fs/xfs/xfs_bmap.c_1.392 2008-06-03 12:20:14.000000000 +1000
+++ fs/xfs/xfs_bmap.c 2008-06-03 15:57:40.000000000 +1000
@@ -3445,16 +3452,10 @@ xfs_bmap_extents_to_btree(
args.tp = tp;
args.mp = mp;
args.firstblock = *firstblock;
- if (*firstblock == NULLFSBLOCK) {
- args.type = XFS_ALLOCTYPE_START_BNO;
+ args.fsbno = *firstblock;
+ if (*firstblock == NULLFSBLOCK)
args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino);
- } else if (flist->xbf_low) {
- args.type = XFS_ALLOCTYPE_START_BNO;
- args.fsbno = *firstblock;
- } else {
- args.type = XFS_ALLOCTYPE_NEAR_BNO;
- args.fsbno = *firstblock;
- }
+ args.type = XFS_ALLOCTYPE_START_BNO;
args.minlen = args.maxlen = args.prod = 1;
args.total = args.minleft = args.alignment = args.mod = args.isfl =
args.minalignslop = 0;
@@ -3585,13 +3586,10 @@ xfs_bmap_local_to_extents(
* Allocate a block. We know we need only one, since the
* file currently fits in an inode.
*/
- if (*firstblock == NULLFSBLOCK) {
+ args.fsbno = *firstblock;
+ if (*firstblock == NULLFSBLOCK)
args.fsbno = XFS_INO_TO_FSB(args.mp, ip->i_ino);
- args.type = XFS_ALLOCTYPE_START_BNO;
- } else {
- args.fsbno = *firstblock;
- args.type = XFS_ALLOCTYPE_NEAR_BNO;
- }
+ args.type = XFS_ALLOCTYPE_START_BNO;
args.total = total;
args.mod = args.minleft = args.alignment = args.wasdel =
args.isfl = args.minalignslop = 0;
--- fs/xfs/xfs_bmap_btree.c_1.169 2008-06-03 11:28:56.000000000 +1000
+++ fs/xfs/xfs_bmap_btree.c 2008-06-06 14:48:14.000000000 +1000
@@ -1493,11 +1493,9 @@ xfs_bmbt_split(
left = XFS_BUF_TO_BMBT_BLOCK(lbp);
args.fsbno = cur->bc_private.b.firstblock;
args.firstblock = args.fsbno;
- if (args.fsbno == NULLFSBLOCK) {
+ if (args.fsbno == NULLFSBLOCK)
args.fsbno = lbno;
- args.type = XFS_ALLOCTYPE_START_BNO;
- } else
- args.type = XFS_ALLOCTYPE_NEAR_BNO;
+ args.type = XFS_ALLOCTYPE_START_BNO;
args.mod = args.minleft = args.alignment = args.total = args.isfl =
args.userdata = args.minalignslop = 0;
args.minlen = args.maxlen = args.prod = 1;
@@ -2253,9 +2251,8 @@ xfs_bmbt_newroot(
}
#endif
args.fsbno = be64_to_cpu(*pp);
- args.type = XFS_ALLOCTYPE_START_BNO;
- } else
- args.type = XFS_ALLOCTYPE_NEAR_BNO;
+ }
+ args.type = XFS_ALLOCTYPE_START_BNO;
if ((error = xfs_alloc_vextent(&args))) {
XFS_BMBT_TRACE_CURSOR(cur, ERROR);
return error;
Christoph Hellwig
2008-06-13 13:44:18 UTC
Permalink
Post by Lachlan McIlroy
This change fixes xfs_bmap_extents_to_btree(), xfs_bmap_local_to_extents(),
xfs_bmbt_split() and xfs_bmbt_newroot() so that they can search other AGs
for the space needed. Since we have reserved the space these allocations
are now guaranteed to succeed.
This looks good and makes lot of sense to me. Please also add a comment
to enum xfs_alloctype in xfs_alloc.h about the danger of the allocation
types that never go out of the AG when used inside transactions.
Post by Lachlan McIlroy
In order to search all AGs I had to revert
a change made to xfs_alloc_vextent() that prevented a search from looking
at AGs lower than the starting AG. This original change was made to prevent
out of order AG locking when allocating multiple extents on data writeout
but since we only allocate one extent at a time now this particular problem
can't happen.
This one also makes sense, but I have a very bad gut feeling about it.
There's nothing preventing the same deadlock scenario from coming back
when people modify the highlevel data allocator again. We really need
some sort of assert to trigger early in that case to not got a nasty
hard to trigger deadlock.
Post by Lachlan McIlroy
+ if (++(args->agno) == mp->m_sb.sb_agcount)
and while we're at it this should be

if (++args->agno == mp->m_sb.sb_agcount)
Lachlan McIlroy
2008-06-16 03:57:52 UTC
Permalink
Post by Christoph Hellwig
Post by Lachlan McIlroy
This change fixes xfs_bmap_extents_to_btree(), xfs_bmap_local_to_extents(),
xfs_bmbt_split() and xfs_bmbt_newroot() so that they can search other AGs
for the space needed. Since we have reserved the space these allocations
are now guaranteed to succeed.
This looks good and makes lot of sense to me. Please also add a comment
to enum xfs_alloctype in xfs_alloc.h about the danger of the allocation
types that never go out of the AG when used inside transactions.
The flags already have comments? Are you referring specifically to the
fact we have reserved space somewhere in the filesystem but expect to find
it in a single AG?
Post by Christoph Hellwig
Post by Lachlan McIlroy
In order to search all AGs I had to revert
a change made to xfs_alloc_vextent() that prevented a search from looking
at AGs lower than the starting AG. This original change was made to prevent
out of order AG locking when allocating multiple extents on data writeout
but since we only allocate one extent at a time now this particular problem
can't happen.
This one also makes sense, but I have a very bad gut feeling about it.
There's nothing preventing the same deadlock scenario from coming back
when people modify the highlevel data allocator again. We really need
some sort of assert to trigger early in that case to not got a nasty
hard to trigger deadlock.
Yes I agree. I was thinking about making the allocation of a second data
extent in the same transaction a conditional try operation to avoid the
deadlock. If a caller to the allocator provides room to return more than
one extent then I don't think we have to return more than one - multiple
extents in one transaction is just an optimisation.
Post by Christoph Hellwig
Post by Lachlan McIlroy
+ if (++(args->agno) == mp->m_sb.sb_agcount)
and while we're at it this should be
if (++args->agno == mp->m_sb.sb_agcount)
Done, thanks.
Dave Chinner
2008-06-13 15:57:08 UTC
Permalink
Post by Lachlan McIlroy
When at ENOSPC conditions extent btree block allocations can fail and we
have no error handling to undo partial btree operations. Prior to extent
btree operations we reserve enough disk blocks somewhere in the filesystem
to satisfy the operation but in some conditions we require the blocks to
come from specific AGs and if those AGs are full the allocation fails.
This change fixes xfs_bmap_extents_to_btree(), xfs_bmap_local_to_extents(),
xfs_bmbt_split() and xfs_bmbt_newroot() so that they can search other AGs
for the space needed. Since we have reserved the space these allocations
are now guaranteed to succeed.
Sure, but we didn't reserve space for potential btree splits in a
second AG as a result of this. That needs to be reserved in the
transaction as well, which will blow out transaction reservations
substantially as we'll need to add another 2 full AGF btree splits to
every transaction that modifies the bmap btree.
Post by Lachlan McIlroy
In order to search all AGs I had to revert
a change made to xfs_alloc_vextent() that prevented a search from looking
at AGs lower than the starting AG. This original change was made to prevent
out of order AG locking when allocating multiple extents on data writeout
but since we only allocate one extent at a time now this particular problem
can't happen.
You missed the fact that the AGF of modified AGs is already held
locked in the transaction, hence the locking order within the
transaction is wrong. Also, if we modify the free list in an AG
the fail an allocation (e.g. can't do an exact allocation), we'll
have multiple dirty and locked AGFs in the one allocation. Hence
we still can have locking order violations if you remove that check
and therefore deadlocks.

This is not the solution to the problem. As I suggested (back when
you first floated this idea as a fix for the problem several weeks
ago) I think the bug is that we are not taking into account the
number of blocks required for a bmbt split when selecting an AG to
allocate from. All we take into account is the blocks required for
the extent to be allocated and nothing else. If we take the blocks
for a bmbt split into account then we'll never try to allocate an
extent in an AG that we can't also allocate all the blocks for the
bmbt split in at the same time.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Lachlan McIlroy
2008-06-16 06:11:09 UTC
Permalink
Post by Dave Chinner
Post by Lachlan McIlroy
When at ENOSPC conditions extent btree block allocations can fail and we
have no error handling to undo partial btree operations. Prior to extent
btree operations we reserve enough disk blocks somewhere in the filesystem
to satisfy the operation but in some conditions we require the blocks to
come from specific AGs and if those AGs are full the allocation fails.
This change fixes xfs_bmap_extents_to_btree(), xfs_bmap_local_to_extents(),
xfs_bmbt_split() and xfs_bmbt_newroot() so that they can search other AGs
for the space needed. Since we have reserved the space these allocations
are now guaranteed to succeed.
Sure, but we didn't reserve space for potential btree splits in a
second AG as a result of this. That needs to be reserved in the
transaction as well, which will blow out transaction reservations
substantially as we'll need to add another 2 full AGF btree splits to
every transaction that modifies the bmap btree.
Right. And most of the time we wont need the space either so it's a
real waste.
Post by Dave Chinner
Post by Lachlan McIlroy
In order to search all AGs I had to revert
a change made to xfs_alloc_vextent() that prevented a search from looking
at AGs lower than the starting AG. This original change was made to prevent
out of order AG locking when allocating multiple extents on data writeout
but since we only allocate one extent at a time now this particular problem
can't happen.
You missed the fact that the AGF of modified AGs is already held
locked in the transaction, hence the locking order within the
transaction is wrong. Also, if we modify the free list in an AG
the fail an allocation (e.g. can't do an exact allocation), we'll
have multiple dirty and locked AGFs in the one allocation. Hence
we still can have locking order violations if you remove that check
and therefore deadlocks.
I'm well aware of that particular deadlock involving the freelist - I
hit it while testing. If you look closely at the code that deadlock
can occur with or without the AG locking avoidance logic. This is
because the rest of the transaction is unaware that an AG has been
locked due to a freelist operation.
Post by Dave Chinner
This is not the solution to the problem. As I suggested (back when
you first floated this idea as a fix for the problem several weeks
ago) I think the bug is that we are not taking into account the
number of blocks required for a bmbt split when selecting an AG to
allocate from. All we take into account is the blocks required for
the extent to be allocated and nothing else. If we take the blocks
for a bmbt split into account then we'll never try to allocate an
extent in an AG that we can't also allocate all the blocks for the
bmbt split in at the same time.
I considered that approach (using the minleft field in xfs_alloc_arg_t)
but it has it's problems too. When we reserve space for the btree
operations it is done on the global filesystem counters, not a
particular AG, so there is the possibility that not one AG has sufficent
space to perform the allocation even though there is enough free space
in the whole filesystem. Of course if we have enough space left in one
AG and the AG is locked then the space we reserved doesn't matter anymore
and it should all work.

I'm worried with this approach that we could have delayed allocations and
unwritten extents that need to be converted but we can't do it because we
don't have the space we might need (but probably don't).
Dave Chinner
2008-06-16 17:10:22 UTC
Permalink
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
When at ENOSPC conditions extent btree block allocations can fail and we
have no error handling to undo partial btree operations. Prior to extent
btree operations we reserve enough disk blocks somewhere in the filesystem
to satisfy the operation but in some conditions we require the blocks to
come from specific AGs and if those AGs are full the allocation fails.
This change fixes xfs_bmap_extents_to_btree(), xfs_bmap_local_to_extents(),
xfs_bmbt_split() and xfs_bmbt_newroot() so that they can search other AGs
for the space needed. Since we have reserved the space these allocations
are now guaranteed to succeed.
Sure, but we didn't reserve space for potential btree splits in a
second AG as a result of this. That needs to be reserved in the
transaction as well, which will blow out transaction reservations
substantially as we'll need to add another 2 full AGF btree splits to
every transaction that modifies the bmap btree.
Right. And most of the time we wont need the space either so it's a
real waste.
Waste, yes, but still needed otherwise transaction overruns and log space
deadlocks could occur....
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
In order to search all AGs I had to revert
a change made to xfs_alloc_vextent() that prevented a search from looking
at AGs lower than the starting AG. This original change was made to prevent
out of order AG locking when allocating multiple extents on data writeout
but since we only allocate one extent at a time now this particular problem
can't happen.
You missed the fact that the AGF of modified AGs is already held
locked in the transaction, hence the locking order within the
transaction is wrong. Also, if we modify the free list in an AG
the fail an allocation (e.g. can't do an exact allocation), we'll
have multiple dirty and locked AGFs in the one allocation. Hence
we still can have locking order violations if you remove that check
and therefore deadlocks.
I'm well aware of that particular deadlock involving the freelist - I
hit it while testing. If you look closely at the code that deadlock
can occur with or without the AG locking avoidance logic. This is
because the rest of the transaction is unaware that an AG has been
locked due to a freelist operation.
Yes, which is why you need to prevent freelist modifications occurring
when you can't allocate anything out of the AG.
Post by Lachlan McIlroy
Post by Dave Chinner
This is not the solution to the problem. As I suggested (back when
you first floated this idea as a fix for the problem several weeks
ago) I think the bug is that we are not taking into account the
number of blocks required for a bmbt split when selecting an AG to
allocate from. All we take into account is the blocks required for
the extent to be allocated and nothing else. If we take the blocks
for a bmbt split into account then we'll never try to allocate an
extent in an AG that we can't also allocate all the blocks for the
bmbt split in at the same time.
I considered that approach (using the minleft field in xfs_alloc_arg_t)
but it has it's problems too. When we reserve space for the btree
operations it is done on the global filesystem counters, not a
particular AG, so there is the possibility that not one AG has sufficent
space to perform the allocation even though there is enough free space
in the whole filesystem.
Yes, we had that problem with the ENOSPC deadlock fixes in that we always
needed at least 4 blocks per AG available for a extent free to succeed.
Hence we have the XFS_ALLOC_SET_ASIDE() value for determining if the
filesystem is out of space, not a count of zero free blocks.

In this case, this macro can be extended to guarantee that our aggregate
block usage never goes below the threshold that would prevent each AG from
holding enough blocks for a worst case allocation to succeed.....
Post by Lachlan McIlroy
Of course if we have enough space left in one
AG and the AG is locked then the space we reserved doesn't matter anymore
and it should all work.
Yes.
Post by Lachlan McIlroy
I'm worried with this approach that we could have delayed allocations and
unwritten extents that need to be converted but we can't do it because we
don't have the space we might need (but probably don't).
Delayed allocation is the issue - unwritten extent conversion failure will
simply return an error and leave the extent unwritten.

With delayed allocation, we can oversubscribe a given AG but we can
always try a different AG. If we get to the situation that we don't have
enough space in any AG then we are screwed. However, by ensuring we can
sustain a worst-case split within any AG we can avoid this situation
completely.

Cheers,

Dave.
Lachlan McIlroy
2008-06-17 01:58:47 UTC
Permalink
Post by Dave Chinner
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
When at ENOSPC conditions extent btree block allocations can fail and we
have no error handling to undo partial btree operations. Prior to extent
btree operations we reserve enough disk blocks somewhere in the filesystem
to satisfy the operation but in some conditions we require the blocks to
come from specific AGs and if those AGs are full the allocation fails.
This change fixes xfs_bmap_extents_to_btree(), xfs_bmap_local_to_extents(),
xfs_bmbt_split() and xfs_bmbt_newroot() so that they can search other AGs
for the space needed. Since we have reserved the space these allocations
are now guaranteed to succeed.
Sure, but we didn't reserve space for potential btree splits in a
second AG as a result of this. That needs to be reserved in the
transaction as well, which will blow out transaction reservations
substantially as we'll need to add another 2 full AGF btree splits to
every transaction that modifies the bmap btree.
Right. And most of the time we wont need the space either so it's a
real waste.
Waste, yes, but still needed otherwise transaction overruns and log space
deadlocks could occur....
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
In order to search all AGs I had to revert
a change made to xfs_alloc_vextent() that prevented a search from looking
at AGs lower than the starting AG. This original change was made to prevent
out of order AG locking when allocating multiple extents on data writeout
but since we only allocate one extent at a time now this particular problem
can't happen.
You missed the fact that the AGF of modified AGs is already held
locked in the transaction, hence the locking order within the
transaction is wrong. Also, if we modify the free list in an AG
the fail an allocation (e.g. can't do an exact allocation), we'll
have multiple dirty and locked AGFs in the one allocation. Hence
we still can have locking order violations if you remove that check
and therefore deadlocks.
I'm well aware of that particular deadlock involving the freelist - I
hit it while testing. If you look closely at the code that deadlock
can occur with or without the AG locking avoidance logic. This is
because the rest of the transaction is unaware that an AG has been
locked due to a freelist operation.
Yes, which is why you need to prevent freelist modifications occurring
when you can't allocate anything out of the AG.
That sounds reasonable but it isn't consistent with the deadlock I saw.
One of the threads that was deadlocked had tried to allocate a data extent
in AG3 but didn't find the space. It had modified, and hence locked, AG3
due to modifying the freelist but since it didn't get the space it needed
it had to go on to another AG. So before we even allocated the data extent
(and before we tried to modify the btree, etc) we had an AG locked. The
deadlock avoidance code in xfs_alloc_vextent() didn't know this because
it only checks for a previous allocation. It makes sense that we should
avoid modifying the freelist if there isn't enough space for the allocation
so the puzzle is why didn't the code do that?
Post by Dave Chinner
Post by Lachlan McIlroy
Post by Dave Chinner
This is not the solution to the problem. As I suggested (back when
you first floated this idea as a fix for the problem several weeks
ago) I think the bug is that we are not taking into account the
number of blocks required for a bmbt split when selecting an AG to
allocate from. All we take into account is the blocks required for
the extent to be allocated and nothing else. If we take the blocks
for a bmbt split into account then we'll never try to allocate an
extent in an AG that we can't also allocate all the blocks for the
bmbt split in at the same time.
I considered that approach (using the minleft field in xfs_alloc_arg_t)
but it has it's problems too. When we reserve space for the btree
operations it is done on the global filesystem counters, not a
particular AG, so there is the possibility that not one AG has sufficent
space to perform the allocation even though there is enough free space
in the whole filesystem.
Yes, we had that problem with the ENOSPC deadlock fixes in that we always
needed at least 4 blocks per AG available for a extent free to succeed.
Hence we have the XFS_ALLOC_SET_ASIDE() value for determining if the
filesystem is out of space, not a count of zero free blocks.
Those 4 blocks are for one extent free operation, right? What if we have
multiple threads all trying to do the same thing (in the same AG)?
Post by Dave Chinner
In this case, this macro can be extended to guarantee that our aggregate
block usage never goes below the threshold that would prevent each AG from
holding enough blocks for a worst case allocation to succeed.....
Post by Lachlan McIlroy
Of course if we have enough space left in one
AG and the AG is locked then the space we reserved doesn't matter anymore
and it should all work.
Yes.
Post by Lachlan McIlroy
I'm worried with this approach that we could have delayed allocations and
unwritten extents that need to be converted but we can't do it because we
don't have the space we might need (but probably don't).
Delayed allocation is the issue - unwritten extent conversion failure will
simply return an error and leave the extent unwritten.
That's still a problem though - if we can't convert unwritten extents then
we can't clean dirty pages and we wont be able to unmount the filesystem.
Post by Dave Chinner
With delayed allocation, we can oversubscribe a given AG but we can
always try a different AG. If we get to the situation that we don't have
enough space in any AG then we are screwed. However, by ensuring we can
sustain a worst-case split within any AG we can avoid this situation
completely.
Cheers,
Dave.
Dave Chinner
2008-06-17 07:39:49 UTC
Permalink
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
I'm well aware of that particular deadlock involving the freelist - I
hit it while testing. If you look closely at the code that deadlock
can occur with or without the AG locking avoidance logic. This is
because the rest of the transaction is unaware that an AG has been
locked due to a freelist operation.
Yes, which is why you need to prevent freelist modifications occurring
when you can't allocate anything out of the AG.
That sounds reasonable but it isn't consistent with the deadlock I saw.
One of the threads that was deadlocked had tried to allocate a data extent
in AG3 but didn't find the space. It had modified, and hence locked, AG3
due to modifying the freelist but since it didn't get the space it needed
it had to go on to another AG.
That sounds like an exact allocation failure - there is enough
space, a large enough extent but no free space at the exact block
required. This is exactly the case that occurred with the inode
allocation - and then allocation in the same AG failed because of
alignment that wasn't taken into account by the first exact
allocation attempt. Perhaps the minalignslop calculation in
xfs_bmap_btalloc() is incorrect...
Post by Lachlan McIlroy
So before we even allocated the data extent
(and before we tried to modify the btree, etc) we had an AG locked. The
deadlock avoidance code in xfs_alloc_vextent() didn't know this because
it only checks for a previous allocation. It makes sense that we should
avoid modifying the freelist if there isn't enough space for the allocation
so the puzzle is why didn't the code do that?
Good question, and I think that is one we need to answer.
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
I considered that approach (using the minleft field in xfs_alloc_arg_t)
but it has it's problems too. When we reserve space for the btree
operations it is done on the global filesystem counters, not a
particular AG, so there is the possibility that not one AG has sufficent
space to perform the allocation even though there is enough free space
in the whole filesystem.
Yes, we had that problem with the ENOSPC deadlock fixes in that we always
needed at least 4 blocks per AG available for a extent free to succeed.
Hence we have the XFS_ALLOC_SET_ASIDE() value for determining if the
filesystem is out of space, not a count of zero free blocks.
Those 4 blocks are for one extent free operation, right? What if we have
multiple threads all trying to do the same thing (in the same AG)?
If we have empty AGF btrees we need to allocate two new root blocks,
and then we won't have to allocate any more btree blocks for the
next 100+ extent free operations...

For allocations, the first would succeed, the rest would have to
search for another AG...
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
I'm worried with this approach that we could have delayed allocations and
unwritten extents that need to be converted but we can't do it because we
don't have the space we might need (but probably don't).
Delayed allocation is the issue - unwritten extent conversion failure will
simply return an error and leave the extent unwritten.
That's still a problem though - if we can't convert unwritten extents then
we can't clean dirty pages and we wont be able to unmount the filesystem.
There will be errors logged and the extents will remain unwritten.
The filesystem should still be able to be unmounted.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Lachlan McIlroy
2008-06-19 07:28:50 UTC
Permalink
Post by Dave Chinner
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
I'm well aware of that particular deadlock involving the freelist - I
hit it while testing. If you look closely at the code that deadlock
can occur with or without the AG locking avoidance logic. This is
because the rest of the transaction is unaware that an AG has been
locked due to a freelist operation.
Yes, which is why you need to prevent freelist modifications occurring
when you can't allocate anything out of the AG.
That sounds reasonable but it isn't consistent with the deadlock I saw.
One of the threads that was deadlocked had tried to allocate a data extent
in AG3 but didn't find the space. It had modified, and hence locked, AG3
due to modifying the freelist but since it didn't get the space it needed
it had to go on to another AG.
That sounds like an exact allocation failure - there is enough
space, a large enough extent but no free space at the exact block
required. This is exactly the case that occurred with the inode
allocation - and then allocation in the same AG failed because of
alignment that wasn't taken into account by the first exact
allocation attempt. Perhaps the minalignslop calculation in
xfs_bmap_btalloc() is incorrect...
Okay I'll look into that.

There's something else that looks suspicious to me - this code in
xfs_bmap_btalloc() is setting minleft to 0. Doesn't this go against
what you were saying about setting minleft to be the space we might
need for the btree operations?

if (args.fsbno == NULLFSBLOCK && nullfb) {
args.fsbno = 0;
args.type = XFS_ALLOCTYPE_FIRST_AG;
args.total = ap->minlen;
args.minleft = 0;
if ((error = xfs_alloc_vextent(&args)))
return error;
ap->low = 1;
}

I see it sets a lowspace indicator which filters back up and into
some btree operations. It appears the purpose of this feature is to
allow allocations to search for space in other AGs as in this example
from xfs_bmap_extents_to_btree():

if (*firstblock == NULLFSBLOCK) {
args.type = XFS_ALLOCTYPE_START_BNO;
args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino);
} else if (flist->xbf_low) {
args.type = XFS_ALLOCTYPE_START_BNO;
args.fsbno = *firstblock;
} else {
args.type = XFS_ALLOCTYPE_NEAR_BNO;
args.fsbno = *firstblock;
}

This is sort of what I was trying to do with my patch but without the
special lowspace condition. This lowspace feature is probably broken
because there was a similar special case in xfs_bmbt_split() that got
removed with the changes that fixed the AG out-of-order locking problem.

@@ -1569,12 +1569,11 @@
lbno = XFS_DADDR_TO_FSB(args.mp, XFS_BUF_ADDR(lbp));
left = XFS_BUF_TO_BMBT_BLOCK(lbp);
args.fsbno = cur->bc_private.b.firstblock;
+ args.firstblock = args.fsbno;
if (args.fsbno == NULLFSBLOCK) {
args.fsbno = lbno;
args.type = XFS_ALLOCTYPE_START_BNO;
- } else if (cur->bc_private.b.flist->xbf_low)
- args.type = XFS_ALLOCTYPE_FIRST_AG;
- else
+ } else
args.type = XFS_ALLOCTYPE_NEAR_BNO;
args.mod = args.minleft = args.alignment = args.total = args.isfl =
args.userdata = args.minalignslop = 0;

This could be why we have allocations failing now. Maybe it should
have been left in and XFS_ALLOCTYPE_FIRST_AG changed to
XFS_ALLOCTYPE_START_BNO. But even then it could still fail because the
AG deadlock avoidance code may prevent us from searching the AG that has
the space we need.

Should we ditch this lowspace special condition (and the code in
xfs_bmap_btalloc()) and insist that all the space we need (using minleft)
should come from one AG?
Post by Dave Chinner
Post by Lachlan McIlroy
So before we even allocated the data extent
(and before we tried to modify the btree, etc) we had an AG locked. The
deadlock avoidance code in xfs_alloc_vextent() didn't know this because
it only checks for a previous allocation. It makes sense that we should
avoid modifying the freelist if there isn't enough space for the allocation
so the puzzle is why didn't the code do that?
Good question, and I think that is one we need to answer.
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
I considered that approach (using the minleft field in xfs_alloc_arg_t)
but it has it's problems too. When we reserve space for the btree
operations it is done on the global filesystem counters, not a
particular AG, so there is the possibility that not one AG has sufficent
space to perform the allocation even though there is enough free space
in the whole filesystem.
Yes, we had that problem with the ENOSPC deadlock fixes in that we always
needed at least 4 blocks per AG available for a extent free to succeed.
Hence we have the XFS_ALLOC_SET_ASIDE() value for determining if the
filesystem is out of space, not a count of zero free blocks.
Those 4 blocks are for one extent free operation, right? What if we have
multiple threads all trying to do the same thing (in the same AG)?
If we have empty AGF btrees we need to allocate two new root blocks,
and then we won't have to allocate any more btree blocks for the
next 100+ extent free operations...
For allocations, the first would succeed, the rest would have to
search for another AG...
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
I'm worried with this approach that we could have delayed allocations and
unwritten extents that need to be converted but we can't do it because we
don't have the space we might need (but probably don't).
Delayed allocation is the issue - unwritten extent conversion failure will
simply return an error and leave the extent unwritten.
That's still a problem though - if we can't convert unwritten extents then
we can't clean dirty pages and we wont be able to unmount the filesystem.
There will be errors logged and the extents will remain unwritten.
The filesystem should still be able to be unmounted.
So returning an error from unwritten extent conversion will not re-dirty the
pages? So we effectively throw the user's data away?
Dave Chinner
2008-06-20 05:21:20 UTC
Permalink
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
I'm well aware of that particular deadlock involving the freelist - I
hit it while testing. If you look closely at the code that deadlock
can occur with or without the AG locking avoidance logic. This is
because the rest of the transaction is unaware that an AG has been
locked due to a freelist operation.
Yes, which is why you need to prevent freelist modifications occurring
when you can't allocate anything out of the AG.
That sounds reasonable but it isn't consistent with the deadlock I saw.
One of the threads that was deadlocked had tried to allocate a data extent
in AG3 but didn't find the space. It had modified, and hence locked, AG3
due to modifying the freelist but since it didn't get the space it needed
it had to go on to another AG.
That sounds like an exact allocation failure - there is enough
space, a large enough extent but no free space at the exact block
required. This is exactly the case that occurred with the inode
allocation - and then allocation in the same AG failed because of
alignment that wasn't taken into account by the first exact
allocation attempt. Perhaps the minalignslop calculation in
xfs_bmap_btalloc() is incorrect...
Okay I'll look into that.
There's something else that looks suspicious to me - this code in
xfs_bmap_btalloc() is setting minleft to 0. Doesn't this go against
what you were saying about setting minleft to be the space we might
need for the btree operations?
if (args.fsbno == NULLFSBLOCK && nullfb) {
args.fsbno = 0;
args.type = XFS_ALLOCTYPE_FIRST_AG;
args.total = ap->minlen;
args.minleft = 0;
if ((error = xfs_alloc_vextent(&args)))
return error;
ap->low = 1;
}
Hmmm - that looks suspicious. In xfs_bmapi(), when we are doing a
write and *firstblock == NULLFSBLOCK (which leads to nullfb being
set in the above code), we do:

if (wr && *firstblock == NULLFSBLOCK) {
if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE)
minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
else
minleft = 1;
} else
minleft = 0;

If we are in btree format we set the minleft to the number of blocks needed
for a split. If we are in extent or local format, change to extent of btree
format requires one extra block.

The above code you point out definitely breaks this - we haven't done a
previous allocation so we can start from the first AG, but we sure as
hell still need minleft set to the number of blocks needed for a
format change or btree split.
Post by Lachlan McIlroy
I see it sets a lowspace indicator which filters back up and into
some btree operations. It appears the purpose of this feature is to
allow allocations to search for space in other AGs as in this example
if (*firstblock == NULLFSBLOCK) {
args.type = XFS_ALLOCTYPE_START_BNO;
args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino);
} else if (flist->xbf_low) {
args.type = XFS_ALLOCTYPE_START_BNO;
args.fsbno = *firstblock;
} else {
args.type = XFS_ALLOCTYPE_NEAR_BNO;
args.fsbno = *firstblock;
}
Hmmm - the only place xbf_low is used in the extent-to-btree conversion. I
don't have access to the revision history anymore, so i can't find out what
bug the xbf_low condition was added for. It certainly looks like it is
allowing the btree block to be allocated in a different AG to data block.
Post by Lachlan McIlroy
This is sort of what I was trying to do with my patch but without the
special lowspace condition. This lowspace feature is probably broken
because there was a similar special case in xfs_bmbt_split() that got
removed with the changes that fixed the AG out-of-order locking problem.
@@ -1569,12 +1569,11 @@
lbno = XFS_DADDR_TO_FSB(args.mp, XFS_BUF_ADDR(lbp));
left = XFS_BUF_TO_BMBT_BLOCK(lbp);
args.fsbno = cur->bc_private.b.firstblock;
+ args.firstblock = args.fsbno;
if (args.fsbno == NULLFSBLOCK) {
args.fsbno = lbno;
args.type = XFS_ALLOCTYPE_START_BNO;
- } else if (cur->bc_private.b.flist->xbf_low)
- args.type = XFS_ALLOCTYPE_FIRST_AG;
- else
+ } else
args.type = XFS_ALLOCTYPE_NEAR_BNO;
args.mod = args.minleft = args.alignment = args.total = args.isfl =
args.userdata = args.minalignslop = 0;
This could be why we have allocations failing now.
Hmmm - yes, could be. Well found, Lachlan. Was there an equivalent change
to the allocation of a new root block?
Post by Lachlan McIlroy
Maybe it should
have been left in and XFS_ALLOCTYPE_FIRST_AG changed to
XFS_ALLOCTYPE_START_BNO. But even then it could still fail because the
AG deadlock avoidance code may prevent us from searching the AG that has
the space we need.
Right. But it would definitely be more likely to find space than the current
code without re-introducing the deadlock.
Post by Lachlan McIlroy
Should we ditch this lowspace special condition (and the code in
xfs_bmap_btalloc()) and insist that all the space we need (using minleft)
should come from one AG?
Well, we could, but I suspect that one condition that it is used it
is safe to do so. That is, the logic goes like this:

- allocate the last extent in an AG. By definition, that has not
caused a AGF btree split as the trees are now empty.
- because we haven't split any AGF btrees, we still have an unused
transaction reservation for full AGF btree splits.
- seeing as we have a full reservation, we can safely allocate in a
different AG without overrunning a transaction reservation.

However, we still need to obey the AGF locking order.

Hmmmm - perhaps before allocating with minleft = 0 we need to
check if we can allocate the rest of the blocks from another AG and
lock both AGs in the correct order first, recheck we can allocate
from both of them after they are locked but before modifying anything
and only then proceed. If we can't find two AGs to allocate from then
we can safely ENOSPC without any problems. In this special case we'd then
be able to search the entire FS for space and hence only get an ENOSPC
if we are really at ENOSPC. Can you pick holes in this?
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
I'm worried with this approach that we could have delayed allocations and
unwritten extents that need to be converted but we can't do it because we
don't have the space we might need (but probably don't).
Delayed allocation is the issue - unwritten extent conversion failure will
simply return an error and leave the extent unwritten.
That's still a problem though - if we can't convert unwritten extents then
we can't clean dirty pages and we wont be able to unmount the filesystem.
There will be errors logged and the extents will remain unwritten.
The filesystem should still be able to be unmounted.
So returning an error from unwritten extent conversion will not re-dirty the
pages? So we effectively throw the user's data away?
Yes, I think that can happen async writes. For anything that is sync the
error will be propagated back to application. Often there is nothing to
report errors back to on async writes - I'm not sure if the errors get
logged in that case, though...

I suspect that this is a holdover from before we had ENOSPC checking on
mmap writes - that could result in mmap oversubscribing space and the
data could not ever be written. In low memory conditions this could
deadlock the machine if we did not throw the pages away. We probably
need to reevisit this now that ->page_mkwrite() prevents
oversubscription from occurring....

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Dave Chinner
2008-06-23 05:20:25 UTC
Permalink
Post by Dave Chinner
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
I'm well aware of that particular deadlock involving the freelist - I
hit it while testing. If you look closely at the code that deadlock
can occur with or without the AG locking avoidance logic. This is
because the rest of the transaction is unaware that an AG has been
locked due to a freelist operation.
Yes, which is why you need to prevent freelist modifications occurring
when you can't allocate anything out of the AG.
That sounds reasonable but it isn't consistent with the deadlock I saw.
One of the threads that was deadlocked had tried to allocate a data extent
in AG3 but didn't find the space. It had modified, and hence locked, AG3
due to modifying the freelist but since it didn't get the space it needed
it had to go on to another AG.
That sounds like an exact allocation failure - there is enough
space, a large enough extent but no free space at the exact block
required. This is exactly the case that occurred with the inode
allocation - and then allocation in the same AG failed because of
alignment that wasn't taken into account by the first exact
allocation attempt. Perhaps the minalignslop calculation in
xfs_bmap_btalloc() is incorrect...
Okay I'll look into that.
There's something else that looks suspicious to me - this code in
xfs_bmap_btalloc() is setting minleft to 0. Doesn't this go against
what you were saying about setting minleft to be the space we might
need for the btree operations?
if (args.fsbno == NULLFSBLOCK && nullfb) {
args.fsbno = 0;
args.type = XFS_ALLOCTYPE_FIRST_AG;
args.total = ap->minlen;
args.minleft = 0;
if ((error = xfs_alloc_vextent(&args)))
return error;
ap->low = 1;
}
Hmmm - that looks suspicious. In xfs_bmapi(), when we are doing a
write and *firstblock == NULLFSBLOCK (which leads to nullfb being
if (wr && *firstblock == NULLFSBLOCK) {
if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE)
minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
else
minleft = 1;
} else
minleft = 0;
If we are in btree format we set the minleft to the number of blocks needed
for a split. If we are in extent or local format, change to extent of btree
format requires one extra block.
The above code you point out definitely breaks this - we haven't done a
previous allocation so we can start from the first AG, but we sure as
hell still need minleft set to the number of blocks needed for a
format change or btree split.
Just to point out yet another problem in this code (one that's just
been tripped over @ agami) is unwritten extent conversion.

Basically, we don't do an allocation here, so when we end up in
xfs_bmap_add_extent_unwritten_real() with a null firstblock. Hence
the cases where conversion can cause a split - case
MASK(LEFT_FILLING), MASK(RIGHT_FILLING) and 0 (convert the middle of
an extent) - we can select an AG that doesn't have enough space for
the entire split as we've ignored the number of blocks we might
need to allocate in the split (the minleft parameter) entirely.

I suspect that xfs_bmbt_split() needs to handle the null first block
case slightly differently - the minleft parameter passed to the
allocation should not be zero - it should be the number of levels
above the current level left in the tree. i.e:

minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;

If we've already got a firstblock set, then this should have already
been taken into account (i.e. we still need to fix the low space
case where it got ignored as we were discussing).

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Lachlan McIlroy
2008-06-23 05:57:22 UTC
Permalink
Post by Dave Chinner
Post by Dave Chinner
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
I'm well aware of that particular deadlock involving the freelist - I
hit it while testing. If you look closely at the code that deadlock
can occur with or without the AG locking avoidance logic. This is
because the rest of the transaction is unaware that an AG has been
locked due to a freelist operation.
Yes, which is why you need to prevent freelist modifications occurring
when you can't allocate anything out of the AG.
That sounds reasonable but it isn't consistent with the deadlock I saw.
One of the threads that was deadlocked had tried to allocate a data extent
in AG3 but didn't find the space. It had modified, and hence locked, AG3
due to modifying the freelist but since it didn't get the space it needed
it had to go on to another AG.
That sounds like an exact allocation failure - there is enough
space, a large enough extent but no free space at the exact block
required. This is exactly the case that occurred with the inode
allocation - and then allocation in the same AG failed because of
alignment that wasn't taken into account by the first exact
allocation attempt. Perhaps the minalignslop calculation in
xfs_bmap_btalloc() is incorrect...
Okay I'll look into that.
There's something else that looks suspicious to me - this code in
xfs_bmap_btalloc() is setting minleft to 0. Doesn't this go against
what you were saying about setting minleft to be the space we might
need for the btree operations?
if (args.fsbno == NULLFSBLOCK && nullfb) {
args.fsbno = 0;
args.type = XFS_ALLOCTYPE_FIRST_AG;
args.total = ap->minlen;
args.minleft = 0;
if ((error = xfs_alloc_vextent(&args)))
return error;
ap->low = 1;
}
Hmmm - that looks suspicious. In xfs_bmapi(), when we are doing a
write and *firstblock == NULLFSBLOCK (which leads to nullfb being
if (wr && *firstblock == NULLFSBLOCK) {
if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE)
minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
else
minleft = 1;
} else
minleft = 0;
If we are in btree format we set the minleft to the number of blocks needed
for a split. If we are in extent or local format, change to extent of btree
format requires one extra block.
The above code you point out definitely breaks this - we haven't done a
previous allocation so we can start from the first AG, but we sure as
hell still need minleft set to the number of blocks needed for a
format change or btree split.
Just to point out yet another problem in this code (one that's just
Basically, we don't do an allocation here, so when we end up in
xfs_bmap_add_extent_unwritten_real() with a null firstblock. Hence
the cases where conversion can cause a split - case
MASK(LEFT_FILLING), MASK(RIGHT_FILLING) and 0 (convert the middle of
an extent) - we can select an AG that doesn't have enough space for
the entire split as we've ignored the number of blocks we might
need to allocate in the split (the minleft parameter) entirely.
I suspect that xfs_bmbt_split() needs to handle the null first block
case slightly differently - the minleft parameter passed to the
allocation should not be zero - it should be the number of levels
minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
If we've already got a firstblock set, then this should have already
been taken into account (i.e. we still need to fix the low space
case where it got ignored as we were discussing).
Funny. I tested the exact same change last week to try to fix the same
problem. Seemed to work okay.

In the case where we convert the middle of an existing unwritten extent
we need to insert two new extents. I might be paranoid here but I'll
assume the worst case scenario and that we'll need space for two complete
tree splits. The first allocation for the first insert will set minleft
correctly but what about the allocations for splits during the second
insert? We could run out of space in the chosen AG because minleft wasn't
enough.
Dave Chinner
2008-06-23 06:14:21 UTC
Permalink
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Dave Chinner
Post by Lachlan McIlroy
There's something else that looks suspicious to me - this code in
xfs_bmap_btalloc() is setting minleft to 0. Doesn't this go against
what you were saying about setting minleft to be the space we might
need for the btree operations?
if (args.fsbno == NULLFSBLOCK && nullfb) {
args.fsbno = 0;
args.type = XFS_ALLOCTYPE_FIRST_AG;
args.total = ap->minlen;
args.minleft = 0;
if ((error = xfs_alloc_vextent(&args)))
return error;
ap->low = 1;
}
Hmmm - that looks suspicious. In xfs_bmapi(), when we are doing a
write and *firstblock == NULLFSBLOCK (which leads to nullfb being
if (wr && *firstblock == NULLFSBLOCK) {
if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE)
minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
else
minleft = 1;
} else
minleft = 0;
If we are in btree format we set the minleft to the number of blocks needed
for a split. If we are in extent or local format, change to extent of btree
format requires one extra block.
The above code you point out definitely breaks this - we haven't done a
previous allocation so we can start from the first AG, but we sure as
hell still need minleft set to the number of blocks needed for a
format change or btree split.
Just to point out yet another problem in this code (one that's just
Basically, we don't do an allocation here, so when we end up in
xfs_bmap_add_extent_unwritten_real() with a null firstblock. Hence
the cases where conversion can cause a split - case
MASK(LEFT_FILLING), MASK(RIGHT_FILLING) and 0 (convert the middle of
an extent) - we can select an AG that doesn't have enough space for
the entire split as we've ignored the number of blocks we might
need to allocate in the split (the minleft parameter) entirely.
I suspect that xfs_bmbt_split() needs to handle the null first block
case slightly differently - the minleft parameter passed to the
allocation should not be zero - it should be the number of levels
minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
If we've already got a firstblock set, then this should have already
been taken into account (i.e. we still need to fix the low space
case where it got ignored as we were discussing).
Funny. I tested the exact same change last week to try to fix the same
problem. Seemed to work okay.
Cool. Got a patch for review?
Post by Lachlan McIlroy
In the case where we convert the middle of an existing unwritten extent
we need to insert two new extents. I might be paranoid here but I'll
assume the worst case scenario and that we'll need space for two complete
tree splits.
Yes, I think so. Certainly, if you look at the block reservation in
xfs_iomap_write_unwritten():

892 resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0) << 1;

#define XFS_DIOSTRAT_SPACE_RES(mp, v) \
(XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK) + (v))

#define XFS_EXTENTADD_SPACE_RES(mp,w) (XFS_BM_MAXLEVELS(mp,w) - 1)

It reserves enough blocks for 2 bmbt splits so I think this is
definitely a possibility we need to handle.
Post by Lachlan McIlroy
The first allocation for the first insert will set minleft
correctly but what about the allocations for splits during the second
insert? We could run out of space in the chosen AG because minleft wasn't
enough.
Yeah, so we probably need pass a flag in the cursor to indicate
it's a double split case when doing the first allocation in
xfs_bmbt_split....

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Lachlan McIlroy
2008-06-23 06:40:26 UTC
Permalink
Post by Dave Chinner
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Dave Chinner
Post by Lachlan McIlroy
There's something else that looks suspicious to me - this code in
xfs_bmap_btalloc() is setting minleft to 0. Doesn't this go against
what you were saying about setting minleft to be the space we might
need for the btree operations?
if (args.fsbno == NULLFSBLOCK && nullfb) {
args.fsbno = 0;
args.type = XFS_ALLOCTYPE_FIRST_AG;
args.total = ap->minlen;
args.minleft = 0;
if ((error = xfs_alloc_vextent(&args)))
return error;
ap->low = 1;
}
Hmmm - that looks suspicious. In xfs_bmapi(), when we are doing a
write and *firstblock == NULLFSBLOCK (which leads to nullfb being
if (wr && *firstblock == NULLFSBLOCK) {
if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE)
minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
else
minleft = 1;
} else
minleft = 0;
If we are in btree format we set the minleft to the number of blocks needed
for a split. If we are in extent or local format, change to extent of btree
format requires one extra block.
The above code you point out definitely breaks this - we haven't done a
previous allocation so we can start from the first AG, but we sure as
hell still need minleft set to the number of blocks needed for a
format change or btree split.
Just to point out yet another problem in this code (one that's just
Basically, we don't do an allocation here, so when we end up in
xfs_bmap_add_extent_unwritten_real() with a null firstblock. Hence
the cases where conversion can cause a split - case
MASK(LEFT_FILLING), MASK(RIGHT_FILLING) and 0 (convert the middle of
an extent) - we can select an AG that doesn't have enough space for
the entire split as we've ignored the number of blocks we might
need to allocate in the split (the minleft parameter) entirely.
I suspect that xfs_bmbt_split() needs to handle the null first block
case slightly differently - the minleft parameter passed to the
allocation should not be zero - it should be the number of levels
minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
If we've already got a firstblock set, then this should have already
been taken into account (i.e. we still need to fix the low space
case where it got ignored as we were discussing).
Funny. I tested the exact same change last week to try to fix the same
problem. Seemed to work okay.
Cool. Got a patch for review?
I couldn't find the original patch that calculated minleft as above - instead
here's a variant that addresses the double insert problem by retrieving the
reservation amount from the transaction. It could very well be overkill though.

--- fs/xfs/xfs_bmap_btree.c_1.169 2008-06-16 17:25:10.000000000 +1000
+++ fs/xfs/xfs_bmap_btree.c 2008-06-16 18:32:45.000000000 +1000
@@ -1496,9 +1496,12 @@ xfs_bmbt_split(
if (args.fsbno == NULLFSBLOCK) {
args.fsbno = lbno;
args.type = XFS_ALLOCTYPE_START_BNO;
- } else
+ args.minleft = xfs_trans_get_block_res(args.tp);
+ } else {
args.type = XFS_ALLOCTYPE_NEAR_BNO;
- args.mod = args.minleft = args.alignment = args.total = args.isfl =
+ args.minleft = 0;
+ }
+ args.mod = args.alignment = args.total = args.isfl =
args.userdata = args.minalignslop = 0;
args.minlen = args.maxlen = args.prod = 1;
args.wasdel = cur->bc_private.b.flags & XFS_BTCUR_BPRV_WASDEL;
Post by Dave Chinner
Post by Lachlan McIlroy
In the case where we convert the middle of an existing unwritten extent
we need to insert two new extents. I might be paranoid here but I'll
assume the worst case scenario and that we'll need space for two complete
tree splits.
Yes, I think so. Certainly, if you look at the block reservation in
892 resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0) << 1;
#define XFS_DIOSTRAT_SPACE_RES(mp, v) \
(XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK) + (v))
#define XFS_EXTENTADD_SPACE_RES(mp,w) (XFS_BM_MAXLEVELS(mp,w) - 1)
It reserves enough blocks for 2 bmbt splits so I think this is
definitely a possibility we need to handle.
Post by Lachlan McIlroy
The first allocation for the first insert will set minleft
correctly but what about the allocations for splits during the second
insert? We could run out of space in the chosen AG because minleft wasn't
enough.
Yeah, so we probably need pass a flag in the cursor to indicate
it's a double split case when doing the first allocation in
xfs_bmbt_split....
Cheers,
Dave.
Dave Chinner
2008-06-23 08:05:30 UTC
Permalink
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Dave Chinner
Post by Lachlan McIlroy
There's something else that looks suspicious to me - this code in
xfs_bmap_btalloc() is setting minleft to 0. Doesn't this go against
what you were saying about setting minleft to be the space we might
need for the btree operations?
if (args.fsbno == NULLFSBLOCK && nullfb) {
args.fsbno = 0;
args.type = XFS_ALLOCTYPE_FIRST_AG;
args.total = ap->minlen;
args.minleft = 0;
if ((error = xfs_alloc_vextent(&args)))
return error;
ap->low = 1;
}
Hmmm - that looks suspicious. In xfs_bmapi(), when we are doing a
write and *firstblock == NULLFSBLOCK (which leads to nullfb being
if (wr && *firstblock == NULLFSBLOCK) {
if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE)
minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
else
minleft = 1;
} else
minleft = 0;
If we are in btree format we set the minleft to the number of blocks needed
for a split. If we are in extent or local format, change to extent of btree
format requires one extra block.
The above code you point out definitely breaks this - we haven't done a
previous allocation so we can start from the first AG, but we sure as
hell still need minleft set to the number of blocks needed for a
format change or btree split.
Just to point out yet another problem in this code (one that's just
Basically, we don't do an allocation here, so when we end up in
xfs_bmap_add_extent_unwritten_real() with a null firstblock. Hence
the cases where conversion can cause a split - case
MASK(LEFT_FILLING), MASK(RIGHT_FILLING) and 0 (convert the middle of
an extent) - we can select an AG that doesn't have enough space for
the entire split as we've ignored the number of blocks we might
need to allocate in the split (the minleft parameter) entirely.
I suspect that xfs_bmbt_split() needs to handle the null first block
case slightly differently - the minleft parameter passed to the
allocation should not be zero - it should be the number of levels
minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
If we've already got a firstblock set, then this should have already
been taken into account (i.e. we still need to fix the low space
case where it got ignored as we were discussing).
Funny. I tested the exact same change last week to try to fix the same
problem. Seemed to work okay.
Cool. Got a patch for review?
I couldn't find the original patch that calculated minleft as above - instead
here's a variant that addresses the double insert problem by retrieving the
reservation amount from the transaction. It could very well be overkill though.
No, that seems valid; all allocations need to pass in a reservation
for a number of blocks needed for the transaction to proceed. I did
a quick check and everything appears to be reserving only what
is necessary for a bmbt split (or two)...
Post by Lachlan McIlroy
--- fs/xfs/xfs_bmap_btree.c_1.169 2008-06-16 17:25:10.000000000 +1000
+++ fs/xfs/xfs_bmap_btree.c 2008-06-16 18:32:45.000000000 +1000
@@ -1496,9 +1496,12 @@ xfs_bmbt_split(
if (args.fsbno == NULLFSBLOCK) {
args.fsbno = lbno;
args.type = XFS_ALLOCTYPE_START_BNO;
- } else
+ args.minleft = xfs_trans_get_block_res(args.tp);
+ } else {
Might be worth a comment ;)

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Lachlan McIlroy
2008-06-23 05:24:45 UTC
Permalink
Post by Dave Chinner
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
I'm well aware of that particular deadlock involving the freelist - I
hit it while testing. If you look closely at the code that deadlock
can occur with or without the AG locking avoidance logic. This is
because the rest of the transaction is unaware that an AG has been
locked due to a freelist operation.
Yes, which is why you need to prevent freelist modifications occurring
when you can't allocate anything out of the AG.
That sounds reasonable but it isn't consistent with the deadlock I saw.
One of the threads that was deadlocked had tried to allocate a data extent
in AG3 but didn't find the space. It had modified, and hence locked, AG3
due to modifying the freelist but since it didn't get the space it needed
it had to go on to another AG.
That sounds like an exact allocation failure - there is enough
space, a large enough extent but no free space at the exact block
required. This is exactly the case that occurred with the inode
allocation - and then allocation in the same AG failed because of
alignment that wasn't taken into account by the first exact
allocation attempt. Perhaps the minalignslop calculation in
xfs_bmap_btalloc() is incorrect...
Okay I'll look into that.
There's something else that looks suspicious to me - this code in
xfs_bmap_btalloc() is setting minleft to 0. Doesn't this go against
what you were saying about setting minleft to be the space we might
need for the btree operations?
if (args.fsbno == NULLFSBLOCK && nullfb) {
args.fsbno = 0;
args.type = XFS_ALLOCTYPE_FIRST_AG;
args.total = ap->minlen;
args.minleft = 0;
if ((error = xfs_alloc_vextent(&args)))
return error;
ap->low = 1;
}
Hmmm - that looks suspicious. In xfs_bmapi(), when we are doing a
write and *firstblock == NULLFSBLOCK (which leads to nullfb being
if (wr && *firstblock == NULLFSBLOCK) {
if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE)
minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
else
minleft = 1;
} else
minleft = 0;
If we are in btree format we set the minleft to the number of blocks needed
for a split. If we are in extent or local format, change to extent of btree
format requires one extra block.
The above code you point out definitely breaks this - we haven't done a
previous allocation so we can start from the first AG, but we sure as
hell still need minleft set to the number of blocks needed for a
format change or btree split.
Post by Lachlan McIlroy
I see it sets a lowspace indicator which filters back up and into
some btree operations. It appears the purpose of this feature is to
allow allocations to search for space in other AGs as in this example
if (*firstblock == NULLFSBLOCK) {
args.type = XFS_ALLOCTYPE_START_BNO;
args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino);
} else if (flist->xbf_low) {
args.type = XFS_ALLOCTYPE_START_BNO;
args.fsbno = *firstblock;
} else {
args.type = XFS_ALLOCTYPE_NEAR_BNO;
args.fsbno = *firstblock;
}
Hmmm - the only place xbf_low is used in the extent-to-btree conversion. I
don't have access to the revision history anymore, so i can't find out what
bug the xbf_low condition was added for. It certainly looks like it is
allowing the btree block to be allocated in a different AG to data block.
The lowspace algorithm was added way back in the early '90s and has been
'tweaked' many times since.
Post by Dave Chinner
Post by Lachlan McIlroy
This is sort of what I was trying to do with my patch but without the
special lowspace condition. This lowspace feature is probably broken
because there was a similar special case in xfs_bmbt_split() that got
removed with the changes that fixed the AG out-of-order locking problem.
@@ -1569,12 +1569,11 @@
lbno = XFS_DADDR_TO_FSB(args.mp, XFS_BUF_ADDR(lbp));
left = XFS_BUF_TO_BMBT_BLOCK(lbp);
args.fsbno = cur->bc_private.b.firstblock;
+ args.firstblock = args.fsbno;
if (args.fsbno == NULLFSBLOCK) {
args.fsbno = lbno;
args.type = XFS_ALLOCTYPE_START_BNO;
- } else if (cur->bc_private.b.flist->xbf_low)
- args.type = XFS_ALLOCTYPE_FIRST_AG;
- else
+ } else
args.type = XFS_ALLOCTYPE_NEAR_BNO;
args.mod = args.minleft = args.alignment = args.total = args.isfl =
args.userdata = args.minalignslop = 0;
This could be why we have allocations failing now.
Hmmm - yes, could be. Well found, Lachlan. Was there an equivalent change
to the allocation of a new root block?
Yeah but it got dropped 14 years ago in a code cleanup change! Looks like
it was by mistake too. There used to be another special case for converting
delayed allocations that had the same semantics as this low space trick - it
used XFS_ALLOCTYPE_FIRST_AG instead - maybe to try a little harder to find
space for cases where it is too late to return an error to the user.
Post by Dave Chinner
Post by Lachlan McIlroy
Maybe it should
have been left in and XFS_ALLOCTYPE_FIRST_AG changed to
XFS_ALLOCTYPE_START_BNO. But even then it could still fail because the
AG deadlock avoidance code may prevent us from searching the AG that has
the space we need.
Right. But it would definitely be more likely to find space than the current
code without re-introducing the deadlock.
Post by Lachlan McIlroy
Should we ditch this lowspace special condition (and the code in
xfs_bmap_btalloc()) and insist that all the space we need (using minleft)
should come from one AG?
Well, we could, but I suspect that one condition that it is used it
- allocate the last extent in an AG. By definition, that has not
caused a AGF btree split as the trees are now empty.
- because we haven't split any AGF btrees, we still have an unused
transaction reservation for full AGF btree splits.
- seeing as we have a full reservation, we can safely allocate in a
different AG without overrunning a transaction reservation.
However, we still need to obey the AGF locking order.
Hmmmm - perhaps before allocating with minleft = 0 we need to
check if we can allocate the rest of the blocks from another AG and
lock both AGs in the correct order first, recheck we can allocate
from both of them after they are locked but before modifying anything
and only then proceed. If we can't find two AGs to allocate from then
we can safely ENOSPC without any problems. In this special case we'd then
be able to search the entire FS for space and hence only get an ENOSPC
if we are really at ENOSPC. Can you pick holes in this?
Sounds like it should work. We may need to lock more than two AGs at once to
find all the space we need. Since we can't lock AGs out of order then if we get
to the last AG and we still don't have enough space then we will need to try
again but start at an earlier AG (say AG 0 which should work).

So the code in xfs_alloc_vextent() that uses XFS_ALLOCTYPE_FIRST_AG and sets
minleft to 0 should work. If we start at AG 0 and we've done the proper reservations
then we should eventually find all the space we need - as long as everything that
needs to allocate space obeys the lowspace algorithm and we always kick off each
search for space from the AG we last allocated from.
Post by Dave Chinner
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
I'm worried with this approach that we could have delayed allocations and
unwritten extents that need to be converted but we can't do it because we
don't have the space we might need (but probably don't).
Delayed allocation is the issue - unwritten extent conversion failure will
simply return an error and leave the extent unwritten.
That's still a problem though - if we can't convert unwritten extents then
we can't clean dirty pages and we wont be able to unmount the filesystem.
There will be errors logged and the extents will remain unwritten.
The filesystem should still be able to be unmounted.
So returning an error from unwritten extent conversion will not re-dirty the
pages? So we effectively throw the user's data away?
Yes, I think that can happen async writes. For anything that is sync the
error will be propagated back to application. Often there is nothing to
report errors back to on async writes - I'm not sure if the errors get
logged in that case, though...
I suspect that this is a holdover from before we had ENOSPC checking on
mmap writes - that could result in mmap oversubscribing space and the
data could not ever be written. In low memory conditions this could
deadlock the machine if we did not throw the pages away. We probably
need to reevisit this now that ->page_mkwrite() prevents
oversubscription from occurring....
Cheers,
Dave.
Dave Chinner
2008-06-23 06:21:58 UTC
Permalink
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
I see it sets a lowspace indicator which filters back up and into
some btree operations. It appears the purpose of this feature is to
allow allocations to search for space in other AGs as in this example
if (*firstblock == NULLFSBLOCK) {
args.type = XFS_ALLOCTYPE_START_BNO;
args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino);
} else if (flist->xbf_low) {
args.type = XFS_ALLOCTYPE_START_BNO;
args.fsbno = *firstblock;
} else {
args.type = XFS_ALLOCTYPE_NEAR_BNO;
args.fsbno = *firstblock;
}
Hmmm - the only place xbf_low is used in the extent-to-btree conversion. I
don't have access to the revision history anymore, so i can't find out what
bug the xbf_low condition was added for. It certainly looks like it is
allowing the btree block to be allocated in a different AG to data block.
The lowspace algorithm was added way back in the early '90s and has been
'tweaked' many times since.
It's good to have a complete revision history around somewhere. ;)
Post by Lachlan McIlroy
Post by Dave Chinner
Post by Lachlan McIlroy
This is sort of what I was trying to do with my patch but without the
special lowspace condition. This lowspace feature is probably broken
because there was a similar special case in xfs_bmbt_split() that got
removed with the changes that fixed the AG out-of-order locking problem.
@@ -1569,12 +1569,11 @@
lbno = XFS_DADDR_TO_FSB(args.mp, XFS_BUF_ADDR(lbp));
left = XFS_BUF_TO_BMBT_BLOCK(lbp);
args.fsbno = cur->bc_private.b.firstblock;
+ args.firstblock = args.fsbno;
if (args.fsbno == NULLFSBLOCK) {
args.fsbno = lbno;
args.type = XFS_ALLOCTYPE_START_BNO;
- } else if (cur->bc_private.b.flist->xbf_low)
- args.type = XFS_ALLOCTYPE_FIRST_AG;
- else
+ } else
args.type = XFS_ALLOCTYPE_NEAR_BNO;
args.mod = args.minleft = args.alignment = args.total = args.isfl =
args.userdata = args.minalignslop = 0;
This could be why we have allocations failing now.
Hmmm - yes, could be. Well found, Lachlan. Was there an equivalent change
to the allocation of a new root block?
Yeah but it got dropped 14 years ago in a code cleanup change! Looks like
it was by mistake too.
Ouch.
Post by Lachlan McIlroy
There used to be another special case for converting
delayed allocations that had the same semantics as this low space trick - it
used XFS_ALLOCTYPE_FIRST_AG instead - maybe to try a little harder to find
space for cases where it is too late to return an error to the user.
Interesting. Any idea why that was removed? Or another accident?

[....]
Post by Lachlan McIlroy
Post by Dave Chinner
Hmmmm - perhaps before allocating with minleft = 0 we need to
check if we can allocate the rest of the blocks from another AG and
lock both AGs in the correct order first, recheck we can allocate
from both of them after they are locked but before modifying anything
and only then proceed. If we can't find two AGs to allocate from then
we can safely ENOSPC without any problems. In this special case we'd then
be able to search the entire FS for space and hence only get an ENOSPC
if we are really at ENOSPC. Can you pick holes in this?
Sounds like it should work. We may need to lock more than two AGs at once to
find all the space we need. Since we can't lock AGs out of order then if we get
to the last AG and we still don't have enough space then we will need to try
again but start at an earlier AG (say AG 0 which should work).
In this case, I'd just start from XFS_ALLOCTYPE_FIRST_AG rather than
doing multiple passes and having to undo between them....
Post by Lachlan McIlroy
So the code in xfs_alloc_vextent() that uses XFS_ALLOCTYPE_FIRST_AG and sets
minleft to 0 should work.
Yes, as long as the next selected AG has the original minleft blocks
available in it.
Post by Lachlan McIlroy
If we start at AG 0 and we've done the proper reservations
then we should eventually find all the space we need - as long as everything that
needs to allocate space obeys the lowspace algorithm and we always kick off each
search for space from the AG we last allocated from.
Yup. Seems sane to me.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Continue reading on narkive:
Loading...