Discussion:
[PATCH 0/3] xfs: allocation worker causes freelist buffer lock hang
(too old to reply)
t***@sgi.com
2012-09-19 16:31:33 UTC
Permalink
We have experience file system hangs on the IA64, X86_32 and X86_64
environments caused by a deadlock on AGF buffers.

The below stack frames is a simple example of the deadlock. This example
was taken from the current OSS Linux 3.6 sources on a x86_32 machine that
was suffering from a daemon storm at the time:

The holder of the lock:

PID: 11601 TASK: d29dadb0 CPU: 3 COMMAND: "kworker/3:3"
#0 [c8ea57bc] __schedule at c068a7e8
#1 [c8ea5844] schedule at c068abd9
#2 [c8ea584c] schedule_timeout at c0689808
#3 [c8ea58b0] wait_for_common at c068a436
#4 [c8ea58e8] wait_for_completion at c068a54d
#5 [c8ea58f0] xfs_alloc_vextent at e1ed8f63 [xfs]
#6 [c8ea590c] xfs_bmap_btalloc at e1ee4124 [xfs]
#7 [c8ea59dc] xfs_bmap_alloc at e1ee42f5 [xfs]
#8 [c8ea59e4] xfs_bmapi_allocate at e1ef0ae9 [xfs]
#9 [c8ea5a14] xfs_bmapi_write at e1ef13a7 [xfs]
#10 [c8ea5b18] xfs_iomap_write_allocate at e1ec7e93 [xfs]
#11 [c8ea5bac] xfs_map_blocks at e1eb795d [xfs]
#12 [c8ea5c08] xfs_vm_writepage at e1eb88d9 [xfs]
#13 [c8ea5c64] find_get_pages_tag at c02df2a2
#14 [c8ea5c9c] __writepage at c02e6769
#15 [c8ea5ca8] write_cache_pages at c02e6f8c
#16 [c8ea5d48] generic_writepages at c02e7212
#17 [c8ea5d78] xfs_vm_writepages at e1eb6c5a [xfs]
#18 [c8ea5d8c] do_writepages at c02e7f4a
#19 [c8ea5d98] __filemap_fdatawrite_range at c02e046a
#20 [c8ea5dc4] filemap_fdatawrite_range at c02e0fb1
#21 [c8ea5de0] xfs_flush_pages at e1ec32e9 [xfs]
#22 [c8ea5e0c] xfs_sync_inode_data at e1ecdd73 [xfs]
#23 [c8ea5e34] xfs_inode_ag_walk at e1ece0d6 [xfs]
#24 [c8ea5f00] xfs_inode_ag_iterator at e1ece23f [xfs]
#25 [c8ea5f28] xfs_sync_data at e1ece2ce [xfs]
#26 [c8ea5f38] xfs_flush_worker at e1ece31d [xfs]
#27 [c8ea5f44] process_one_work at c024e39d
#28 [c8ea5f88] process_scheduled_works at c024e6ad
#29 [c8ea5f98] worker_thread at c024f6cb
#30 [c8ea5fbc] kthread at c0253c1b
#31 [c8ea5fe8] kernel_thread_helper at c0692af4

The worker that is blocked waiting for the lock:

PID: 30223 TASK: dfb86cb0 CPU: 3 COMMAND: "xfsalloc"
#0 [d4a45b9c] __schedule at c068a7e8
#1 [d4a45c24] schedule at c068abd9
#2 [d4a45c2c] schedule_timeout at c0689808
#3 [d4a45c90] __down_common at c068a1d4
#4 [d4a45cc0] __down at c068a25e
#5 [d4a45cc8] down at c02598ff
#6 [d4a45cd8] xfs_buf_lock at e1ebaa6d [xfs]
#7 [d4a45cf8] _xfs_buf_find at e1ebad0e [xfs]
#8 [d4a45d2c] xfs_buf_get_map at e1ebaf00 [xfs]
#9 [d4a45d58] xfs_buf_read_map at e1ebbce3 [xfs]
#10 [d4a45d7c] xfs_trans_read_buf_map at e1f2974e [xfs]
#11 [d4a45dac] xfs_read_agf at e1ed7d2c [xfs]
#12 [d4a45dec] xfs_alloc_read_agf at e1ed7f2d [xfs]
#13 [d4a45e0c] xfs_alloc_fix_freelist at e1ed84f6 [xfs]
#14 [d4a45e4c] check_preempt_curr at c0261c47
#15 [d4a45e60] ttwu_do_wakeup at c0261c7e
#16 [d4a45e8c] radix_tree_lookup at c0465ab5
#17 [d4a45e94] xfs_perag_get at e1f19a36 [xfs]
#18 [d4a45ec8] __xfs_alloc_vextent at e1ed899d [xfs]
#19 [d4a45f1c] xfs_alloc_vextent_worker at e1ed8edb [xfs]
#20 [d4a45f30] process_one_work at c024e39d
#21 [d4a45f74] process_scheduled_works at c024e6ad
#22 [d4a45f84] rescuer_thread at c024e7f0
#23 [d4a45fbc] kthread at c0253c1b
#24 [d4a45fe8] kernel_thread_helper at c0692af

The AGF buffer can be locked across multiple calls to xfs_alloc_vextent().
This buffer must remained locked until the transaction is either committed or
canceled. The deadlock occurs when other callers of xfs_alloc_vextent() block
waiting for the AGF buffer lock and the workqueue manager cannot create another
allocate worker to service the allocation request for the process that holds
the lock, The default limit for the allocation worker is 256 workers and that
limit can be increased to 512, but lack of resources before this limit is the
real reason the workqueue manager cannot create another worker.

It is very easy to have a more than one AG hung in this manner simultaneously.

The locked AGF buffer and possibly the calling inode can be in the AIL. Since
they are locked, these items cannot be pushed. This prevents the tail from
moving; eventually the log space is depleted causing the whole filesystem to
hang.

Tests 109 and 119 seem to trigger this condition the easiest. To recreate the
problem, eat up the spare RAM (above hang was caused during a daemon storm)
or simply limit the number of allocation workers when running xfstests.

The solution to this problem is to move the allocation worker so that the
loops over xfs_alloc_vextent() for a single transaction, on non-metadata
paths will happen in a single worker.

I traced the callers of xfs_alloc_vextent(), noting transaction creation,
commits and cancels; I noted loops in the callers and which were marked
as metadata/userdata. I can send this document to reviewers.

Patch 1: limits the allocation worker to the X86_64 architectures.

Patch 2: moves the allocation worker so that loops to xfs_alloc_vextent()
for a transaction is contained in one worker. I have a document
that lists the callers of xfs_alloc_vextent() noting loops,
transaction allocations/commits and metadata callers.

Patch 3: zeros 2 uninitialized userdata variables.

--Mark.
t***@sgi.com
2012-09-19 16:31:36 UTC
Permalink
The allocation argument structures is created on the kernel stack space. This
patch makes sure the userdata variable is always initialized.

The better solution is always memset to 0 or "= { 0 }" the xfs_alloc_arg
and xfs_bmalloca structures. I believe there will only be 4 places that need
this memset change.

Signed-off-by: Mark Tinguely <***@sgi.com>
---
fs/xfs/xfs_bmap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Index: b/fs/xfs/xfs_bmap.c
===================================================================
--- a/fs/xfs/xfs_bmap.c
+++ b/fs/xfs/xfs_bmap.c
@@ -3096,7 +3096,7 @@ xfs_bmap_extents_to_btree(
}
args.minlen = args.maxlen = args.prod = 1;
args.total = args.minleft = args.alignment = args.mod = args.isfl =
- args.minalignslop = 0;
+ args.minalignslop = args.userdata = 0;
args.wasdel = wasdel;
*logflagsp = 0;
if ((error = xfs_alloc_vextent(&args))) {
@@ -3254,7 +3254,7 @@ xfs_bmap_local_to_extents(
}
args.total = total;
args.mod = args.minleft = args.alignment = args.wasdel =
- args.isfl = args.minalignslop = 0;
+ args.isfl = args.minalignslop = args.userdata = 0;
args.minlen = args.maxlen = args.prod = 1;
if ((error = xfs_alloc_vextent(&args)))
goto done;
Dave Chinner
2012-09-19 23:41:02 UTC
Permalink
Post by t***@sgi.com
The allocation argument structures is created on the kernel stack space. This
patch makes sure the userdata variable is always initialized.
The better solution is always memset to 0 or "= { 0 }" the xfs_alloc_arg
Yes, please do that and remove all the initialisations to zero
instead. It's much easier to maintain in future if new additions
to structures are automaticlly zeroed by the code that uses it...
Post by t***@sgi.com
and xfs_bmalloca structures. I believe there will only be 4 places that need
this memset change.
The only definition of xfs_bmalloca (in xfs_bmapi_write) is already
zeroed:

struct xfs_bmalloca bma = { 0 }; /* args for xfs_bmap_alloc */

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Mark Tinguely
2012-09-20 18:16:45 UTC
Permalink
Zero the kernel stack space that makes up the xfs_alloc_arg structures.

Signed-off-by: Mark Tinguely <***@sgi.com>

---
fs/xfs/xfs_alloc.c | 1 +
fs/xfs/xfs_bmap.c | 3 +++
fs/xfs/xfs_ialloc.c | 1 +
3 files changed, 5 insertions(+)

Index: b/fs/xfs/xfs_alloc.c
===================================================================
--- a/fs/xfs/xfs_alloc.c
+++ b/fs/xfs/xfs_alloc.c
@@ -1866,6 +1866,7 @@ xfs_alloc_fix_freelist(
/*
* Initialize the args structure.
*/
+ memset(&targs, 0, sizeof(targs));
targs.tp = tp;
targs.mp = mp;
targs.agbp = agbp;
Index: b/fs/xfs/xfs_bmap.c
===================================================================
--- a/fs/xfs/xfs_bmap.c
+++ b/fs/xfs/xfs_bmap.c
@@ -2437,6 +2437,7 @@ xfs_bmap_btalloc(
* Normal allocation, done through xfs_alloc_vextent.
*/
tryagain = isaligned = 0;
+ memset(&args, 0, sizeof(args));
args.tp = ap->tp;
args.mp = mp;
args.fsbno = ap->blkno;
@@ -3082,6 +3083,7 @@ xfs_bmap_extents_to_btree(
* Convert to a btree with two levels, one record in root.
*/
XFS_IFORK_FMT_SET(ip, whichfork, XFS_DINODE_FMT_BTREE);
+ memset(&args, 0, sizeof(args));
args.tp = tp;
args.mp = mp;
args.firstblock = *firstblock;
@@ -3237,6 +3239,7 @@ xfs_bmap_local_to_extents(
xfs_buf_t *bp; /* buffer for extent block */
xfs_bmbt_rec_host_t *ep;/* extent record pointer */

+ memset(&args, 0, sizeof(args));
args.tp = tp;
args.mp = ip->i_mount;
args.firstblock = *firstblock;
Index: b/fs/xfs/xfs_ialloc.c
===================================================================
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -250,6 +250,7 @@ xfs_ialloc_ag_alloc(
/* boundary */
struct xfs_perag *pag;

+ memset(&args, 0, sizeof(args));
args.tp = tp;
args.mp = tp->t_mountp;
Ben Myers
2012-09-25 20:20:54 UTC
Permalink
Post by Mark Tinguely
Zero the kernel stack space that makes up the xfs_alloc_arg structures.
I'll just double check that you got all usages of xfs_alloc_arg_t:

xfs_inobt_alloc_block... fixed in f5eb8e7c by HCH
xfs_ialloc_ag_alloc... fixed below
xfs_bmbt_alloc_block... also fixed in f5eb8e7c
xfs_bmap_local_to_extents... fixed below
xfs_bmap_extents_to_btree... fixed below
xfs_bmap_btalloc... fixed below
xfs_free_extent... fixed in 0e1edbd9 by Nathan Scott
xfs_alloc_fix_freelist... fixed below

Looks good!
Post by Mark Tinguely
---
fs/xfs/xfs_alloc.c | 1 +
fs/xfs/xfs_bmap.c | 3 +++
fs/xfs/xfs_ialloc.c | 1 +
3 files changed, 5 insertions(+)
Index: b/fs/xfs/xfs_alloc.c
===================================================================
--- a/fs/xfs/xfs_alloc.c
+++ b/fs/xfs/xfs_alloc.c
@@ -1866,6 +1866,7 @@ xfs_alloc_fix_freelist(
/*
* Initialize the args structure.
*/
+ memset(&targs, 0, sizeof(targs));
targs.tp = tp;
targs.mp = mp;
targs.agbp = agbp;
Index: b/fs/xfs/xfs_bmap.c
===================================================================
--- a/fs/xfs/xfs_bmap.c
+++ b/fs/xfs/xfs_bmap.c
@@ -2437,6 +2437,7 @@ xfs_bmap_btalloc(
* Normal allocation, done through xfs_alloc_vextent.
*/
tryagain = isaligned = 0;
+ memset(&args, 0, sizeof(args));
args.tp = ap->tp;
args.mp = mp;
args.fsbno = ap->blkno;
@@ -3082,6 +3083,7 @@ xfs_bmap_extents_to_btree(
* Convert to a btree with two levels, one record in root.
*/
XFS_IFORK_FMT_SET(ip, whichfork, XFS_DINODE_FMT_BTREE);
+ memset(&args, 0, sizeof(args));
args.tp = tp;
args.mp = mp;
args.firstblock = *firstblock;
@@ -3237,6 +3239,7 @@ xfs_bmap_local_to_extents(
xfs_buf_t *bp; /* buffer for extent block */
xfs_bmbt_rec_host_t *ep;/* extent record pointer */
+ memset(&args, 0, sizeof(args));
args.tp = tp;
args.mp = ip->i_mount;
args.firstblock = *firstblock;
Index: b/fs/xfs/xfs_ialloc.c
===================================================================
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -250,6 +250,7 @@ xfs_ialloc_ag_alloc(
/* boundary */
struct xfs_perag *pag;
+ memset(&args, 0, sizeof(args));
args.tp = tp;
args.mp = tp->t_mountp;
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
Ben Myers
2012-10-18 22:52:26 UTC
Permalink
Mark,
Post by Mark Tinguely
Zero the kernel stack space that makes up the xfs_alloc_arg structures.
This has been committed to git://oss.sgi.com/xfs/xfs.git, master and for-next
branches.

Regards,
Ben

t***@sgi.com
2012-09-19 16:31:35 UTC
Permalink
Move the allocation worker call so that any loops on xfs_alloc_vextent()
calls for a particular transaction are contained within a single worker.
This prevents a filesystem hang that can occur because the holder of
AGF buffer lock cannot allocate a worker.

Signed-off-by: Mark Tinguely <***@sgi.com>
---
fs/xfs/xfs_alloc.c | 53 -------------------------------------------
fs/xfs/xfs_alloc.h | 3 --
fs/xfs/xfs_bmap.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
fs/xfs/xfs_bmap.h | 16 +++++++++++++
4 files changed, 80 insertions(+), 56 deletions(-)

Index: b/fs/xfs/xfs_alloc.c
===================================================================
--- a/fs/xfs/xfs_alloc.c
+++ b/fs/xfs/xfs_alloc.c
@@ -2207,7 +2207,7 @@ xfs_alloc_read_agf(
* group or loop over the allocation groups to find the result.
*/
int /* error */
-__xfs_alloc_vextent(
+xfs_alloc_vextent(
xfs_alloc_arg_t *args) /* allocation argument structure */
{
xfs_agblock_t agsize; /* allocation group size */
@@ -2417,57 +2417,6 @@ error0:
return error;
}

-#if defined(CONFIG_X86_64)
-static void
-xfs_alloc_vextent_worker(
- struct work_struct *work)
-{
- struct xfs_alloc_arg *args = container_of(work,
- struct xfs_alloc_arg, work);
- unsigned long pflags;
-
- /* we are in a transaction context here */
- current_set_flags_nested(&pflags, PF_FSTRANS);
-
- args->result = __xfs_alloc_vextent(args);
- complete(args->done);
-
- current_restore_flags_nested(&pflags, PF_FSTRANS);
-}
-#endif
-
-/*
- * Data allocation requests often come in with little stack to work on. Push
- * them off to a worker thread so there is lots of stack to use. Metadata
- * requests, OTOH, are generally from low stack usage paths, so avoid the
- * context switch overhead here.
- */
-int
-xfs_alloc_vextent(
- struct xfs_alloc_arg *args)
-{
-#if defined(CONFIG_X86_64)
- DECLARE_COMPLETION_ONSTACK(done);
-
- if (!args->userdata)
- return __xfs_alloc_vextent(args);
-
-
- args->done = &done;
- INIT_WORK_ONSTACK(&args->work, xfs_alloc_vextent_worker);
- queue_work(xfs_alloc_wq, &args->work);
- wait_for_completion(&done);
- return args->result;
-#else
- /* The allocation worker is needed by the i386_64.
- * Do not use the worker for other platforms. This will
- * allow those platforms avoid the performance hit and
- * the potential AGF buffer deadlock issue.
- */
- return __xfs_alloc_vextent(args);
-#endif
-}
-
/*
* Free an extent.
* Just break up the extent address and hand off to xfs_free_ag_extent
Index: b/fs/xfs/xfs_alloc.h
===================================================================
--- a/fs/xfs/xfs_alloc.h
+++ b/fs/xfs/xfs_alloc.h
@@ -120,9 +120,6 @@ typedef struct xfs_alloc_arg {
char isfl; /* set if is freelist blocks - !acctg */
char userdata; /* set if this is user data */
xfs_fsblock_t firstblock; /* io first block allocated */
- struct completion *done;
- struct work_struct work;
- int result;
} xfs_alloc_arg_t;

/*
Index: b/fs/xfs/xfs_bmap.c
===================================================================
--- a/fs/xfs/xfs_bmap.c
+++ b/fs/xfs/xfs_bmap.c
@@ -48,7 +48,6 @@
#include "xfs_vnodeops.h"
#include "xfs_trace.h"

-
kmem_zone_t *xfs_bmap_free_item_zone;

/*
@@ -4807,7 +4806,11 @@ xfs_bmapi_convert_unwritten(
* blocks then the call will fail (return NULLFSBLOCK in "firstblock").
*/
int
+#if defined(CONFIG_X86_64)
+__xfs_bmapi_write(
+#else
xfs_bmapi_write(
+#endif
struct xfs_trans *tp, /* transaction pointer */
struct xfs_inode *ip, /* incore inode */
xfs_fileoff_t bno, /* starting file offs. mapped */
@@ -5031,6 +5034,65 @@ error0:
return error;
}

+#if defined(CONFIG_X86_64)
+static void
+xfs_bmapi_write_worker(
+ struct work_struct *work)
+{
+ struct xfs_bmw_wkr *bw = container_of(work,
+ struct xfs_bmw_wkr, work);
+ unsigned long pflags;
+
+ /* we are in a transaction context here */
+ current_set_flags_nested(&pflags, PF_FSTRANS);
+
+ bw->result = __xfs_bmapi_write(bw->tp, bw->ip, bw->bno, bw->len,
+ bw->flags, bw->firstblock, bw->total,
+ bw->mval, bw->nmap, bw->flist);
+ complete(bw->done);
+
+ current_restore_flags_nested(&pflags, PF_FSTRANS);
+}
+
+int
+xfs_bmapi_write(
+ struct xfs_trans *tp, /* transaction pointer */
+ struct xfs_inode *ip, /* incore inode */
+ xfs_fileoff_t bno, /* starting file offs. mapped */
+ xfs_filblks_t len, /* length to map in file */
+ int flags, /* XFS_BMAPI_... */
+ xfs_fsblock_t *firstblock, /* first allocated block
+ controls a.g. for allocs */
+ xfs_extlen_t total, /* total blocks needed */
+ struct xfs_bmbt_irec *mval, /* output: map values */
+ int *nmap, /* i/o: mval size/count */
+ struct xfs_bmap_free *flist) /* i/o: list extents to free */
+{
+ struct xfs_bmw_wkr bw;
+ DECLARE_COMPLETION_ONSTACK(done);
+
+ if (flags & XFS_BMAPI_METADATA)
+ return __xfs_bmapi_write(tp, ip, bno, len, flags, firstblock,
+ total, mval, nmap, flist);
+ /* initialize the worker argument list structure */
+ bw.tp = tp;
+ bw.ip = ip;
+ bw.bno = bno;
+ bw.len = len;
+ bw.flags = flags;
+ bw.firstblock = firstblock;
+ bw.total = total;
+ bw.mval = mval;
+ bw.nmap = nmap;
+ bw.flist = flist;
+ bw.done = &done;
+ INIT_WORK_ONSTACK(&bw.work, xfs_bmapi_write_worker);
+ queue_work(xfs_alloc_wq, &bw.work);
+ wait_for_completion(&done);
+ return bw.result;
+}
+#endif
+
/*
* Unmap (remove) blocks from a file.
* If nexts is nonzero then the number of extents to remove is limited to
Index: b/fs/xfs/xfs_bmap.h
===================================================================
--- a/fs/xfs/xfs_bmap.h
+++ b/fs/xfs/xfs_bmap.h
@@ -135,6 +135,22 @@ typedef struct xfs_bmalloca {
char conv; /* overwriting unwritten extents */
} xfs_bmalloca_t;

+struct xfs_bmw_wkr {
+ struct xfs_trans *tp; /* transaction pointer */
+ struct xfs_inode *ip; /* incore inode */
+ xfs_fileoff_t bno; /* starting file offs. mapped */
+ xfs_filblks_t len; /* length to map in file */
+ int flags; /* XFS_BMAPI_... */
+ xfs_fsblock_t *firstblock; /* first allocblock controls */
+ xfs_extlen_t total; /* total blocks needed */
+ struct xfs_bmbt_irec *mval; /* output: map values */
+ int *nmap; /* i/o: mval size/count */
+ struct xfs_bmap_free *flist; /* bmap freelist */
+ struct completion *done; /* worker completion ptr */
+ struct work_struct work; /* worker */
+ int result; /* worker function result */
+} ;
+
/*
* Flags for xfs_bmap_add_extent*.
*/
t***@sgi.com
2012-09-19 16:31:34 UTC
Permalink
Restrict the allocation worker to X86_64 machines. This will improve
performance on non-X86-64 machines and avoid the AGF buffer hang.

Signed-off-by: Mark Tinguely <***@sgi.com>
---
fs/xfs/xfs_alloc.c | 11 +++++++++++
1 file changed, 11 insertions(+)

Index: b/fs/xfs/xfs_alloc.c
===================================================================
--- a/fs/xfs/xfs_alloc.c
+++ b/fs/xfs/xfs_alloc.c
@@ -2417,6 +2417,7 @@ error0:
return error;
}

+#if defined(CONFIG_X86_64)
static void
xfs_alloc_vextent_worker(
struct work_struct *work)
@@ -2433,6 +2434,7 @@ xfs_alloc_vextent_worker(

current_restore_flags_nested(&pflags, PF_FSTRANS);
}
+#endif

/*
* Data allocation requests often come in with little stack to work on. Push
@@ -2444,6 +2446,7 @@ int
xfs_alloc_vextent(
struct xfs_alloc_arg *args)
{
+#if defined(CONFIG_X86_64)
DECLARE_COMPLETION_ONSTACK(done);

if (!args->userdata)
@@ -2455,6 +2458,14 @@ xfs_alloc_vextent(
queue_work(xfs_alloc_wq, &args->work);
wait_for_completion(&done);
return args->result;
+#else
+ /* The allocation worker is needed by the i386_64.
+ * Do not use the worker for other platforms. This will
+ * allow those platforms avoid the performance hit and
+ * the potential AGF buffer deadlock issue.
+ */
+ return __xfs_alloc_vextent(args);
+#endif
}

/*
Dave Chinner
2012-09-19 21:54:05 UTC
Permalink
Post by t***@sgi.com
Restrict the allocation worker to X86_64 machines. This will improve
performance on non-X86-64 machines and avoid the AGF buffer hang.
NACK.

The stack overflow problems that this works around are not limited
to x86-64. In the past we've seen overflows on i686 (even with 8k
stacks), s390 and other platforms, so it's not an isolated issue.

It either works or it doesn't - let's not start down the rathole of
having different code paths and behaviours for different platforms.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Mark Tinguely
2012-09-20 17:37:46 UTC
Permalink
Post by Dave Chinner
Post by t***@sgi.com
Restrict the allocation worker to X86_64 machines. This will improve
performance on non-X86-64 machines and avoid the AGF buffer hang.
NACK.
The stack overflow problems that this works around are not limited
to x86-64. In the past we've seen overflows on i686 (even with 8k
stacks), s390 and other platforms, so it's not an isolated issue.
It either works or it doesn't - let's not start down the rathole of
having different code paths and behaviours for different platforms.
Cheers,
Dave.
Well, I was expecting a 4 letter word from Dave on this patch and "NACK"
was surprisingly mild.

When the allocation worker was placed into XFS, even Christoph wanted a
kernel configure switch to be able turn it off.

Dave has already placed a switch in the code that turns it off for over
half of the direct callers xfs_alloc_vextent() because a performance issue.

We are just finding places where it causes serious issues.

This is worker is an "necessary evil" (I think those were Christoph's
review comment). We should limit the evil to where it is necessary.

--Mark.
Ben Myers
2012-09-24 17:37:23 UTC
Permalink
Hey,
Post by Mark Tinguely
Post by Dave Chinner
Post by t***@sgi.com
Restrict the allocation worker to X86_64 machines. This will improve
performance on non-X86-64 machines and avoid the AGF buffer hang.
NACK.
The stack overflow problems that this works around are not limited
to x86-64. In the past we've seen overflows on i686 (even with 8k
stacks), s390 and other platforms, so it's not an isolated issue.
It either works or it doesn't - let's not start down the rathole of
having different code paths and behaviours for different platforms.
Cheers,
Dave.
Well, I was expecting a 4 letter word from Dave on this patch and
"NACK" was surprisingly mild.
When the allocation worker was placed into XFS, even Christoph
wanted a kernel configure switch to be able turn it off.
Dave has already placed a switch in the code that turns it off for
over half of the direct callers xfs_alloc_vextent() because a
performance issue.
We are just finding places where it causes serious issues.
This is worker is an "necessary evil" (I think those were
Christoph's review comment). We should limit the evil to where it is
necessary.
I tend to agree that it is undesireble to have platform specific behaviors in
XFS. Dave has a good point. Mark and Christoph also have valid points.

This is a platform specific problem so it's reasonable that the solution can be
platform specific too. If the list of platforms which are broken by having so
little stack available in the kernel is larger than we'd like... well that's
unfortunate. But it's not something we're the cause of, and it speaks to how
important it is to fix the more general problem.

We shouldn't penalize those who are using platforms which are not affected by
this problem for the limitations of the other platforms. OTOH, if we have
multiple behaviors our testing becomes more difficult. Nobody wins.

I think it is desireable to be able to turn this off so that users can choose
how they prefer to lose, and so that this hack continues to be easily removable
if the time ever comes when we can do so.

-Ben
Dave Chinner
2012-09-25 00:14:05 UTC
Permalink
Post by Ben Myers
Hey,
Post by Mark Tinguely
Post by Dave Chinner
Post by t***@sgi.com
Restrict the allocation worker to X86_64 machines. This will improve
performance on non-X86-64 machines and avoid the AGF buffer hang.
NACK.
The stack overflow problems that this works around are not limited
to x86-64. In the past we've seen overflows on i686 (even with 8k
stacks), s390 and other platforms, so it's not an isolated issue.
It either works or it doesn't - let's not start down the rathole of
having different code paths and behaviours for different platforms.
Cheers,
Dave.
Well, I was expecting a 4 letter word from Dave on this patch and
"NACK" was surprisingly mild.
When the allocation worker was placed into XFS, even Christoph
wanted a kernel configure switch to be able turn it off.
Dave has already placed a switch in the code that turns it off for
over half of the direct callers xfs_alloc_vextent() because a
performance issue.
We are just finding places where it causes serious issues.
This is worker is an "necessary evil" (I think those were
Christoph's review comment). We should limit the evil to where it is
necessary.
I tend to agree that it is undesireble to have platform specific behaviors in
XFS. Dave has a good point. Mark and Christoph also have valid points.
This is a platform specific problem so it's reasonable that the solution can be
platform specific too.
It's not really platform specific. it's just harder to hit on some
platforms than others. The stack usage of XFS on 32 bt platforms is
only about 25% less than on 64 bit platforms through the problematic
allocation path as most of the on stack structures are the same size
on 32 bit and 64 bit - only the size of pointers and registers
change, and they are the miniority of stack usage. Stuff like
xfs_bmalloc_args, xfs_alloc_args and so on only shrink by a few
bytes.

I've seen the x86-64 stack overrung by 3.5k - that's almost a 50%
overrun. That means even if the stack usage is reduces by 25% on
Seeing as we are smashing the 8k x86-64 stack by over 25%, that same
path will smash an 8k stack. We just haven't had anyone report it
because ia32 users are in the minority.

Hence for all the 32 bit platforms with 8k stacks (i.e. all ARM,
MIPS, SH, etc) there is a still a significant risk of stack
overruns. Given the prevalence of XFS on these architectures for NAS
and DVR functionality, there's every chance that stack overflows
occur quite regularly that nobody ever reports or can report....

And even on other 64 bit platforms we know that s390 stack usage is
an issue because each function call on the stack has a
base register frame of 128 bytes, plus whatever stack is used by the
function. Hence when we are looking at a function call chain of
70-80 functions deep, they really only have half of their available
stack space available for the code usage. i.e. about 8k. Sparc has
the same issue.

Power and ia64 are probably the only arches that don'thave problems
because of their default 64k page size.

IOWs, there are quite a few architectures where XFS is commonly used
that are at risk of stack overflows through the worst case XFS
writeback/allocation path (i.e. when you have a non-trivial
allocation like a double bmbt/abt split, bmbt split with readahead
in the abt, etc). Just because we don't have reports of them
occurring doesn't mean the problem doesn't exist. We simply get
reports from x86-64 users because they make up the vast majority of
XFS deployments and so rare problems are seen (correspondingly) more
frequently....
Post by Ben Myers
If the list of platforms which are broken by having so
little stack available in the kernel is larger than we'd like... well that's
unfortunate. But it's not something we're the cause of, and it speaks to how
important it is to fix the more general problem.
I've been trying to raise awareness of the problem - the wider
storage community understand that there is a problem, but we're
getting no traction at all with core kernel folk. As far as they are
conerned, it's our problem, not a general problem.
Post by Ben Myers
We shouldn't penalize those who are using platforms which are not affected by
this problem for the limitations of the other platforms. OTOH, if we have
multiple behaviors our testing becomes more difficult. Nobody wins.
You're forgetting that the performance problems were platform
specific! It only showed up on certain CPUs (i.e. AMD, not intel)
and only on low CPU count (1-4p) machines. So by your argument that
we should only enable this for platforms that don't see performance
issues, we should only do this stack switch for intel x86-64.

See where categorising issues by hardware platform leads? it's a
rat-hole.
Post by Ben Myers
I think it is desireable to be able to turn this off so that users
can choose how they prefer to lose, and so that this hack
continues to be easily removable if the time ever comes when we
can do so.
This is not an option that should be left to users - it is our
responsibility to provide a fit-for-purpose filesystem that doesn't
randomly crash, and one of those responsibilities is to take changes
that you don't like but are necessary for stability across the wider
user base.

I'm not going to change my opinion - if you do make this allocation
workqueue code platform specific, the first thing I'll be doing on
any kernel I'm responsible for is reverting the commit so I know
users are not going to have random outages caused by allocation
related stack overflows.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Dave Chinner
2012-09-19 23:34:35 UTC
Permalink
Post by t***@sgi.com
We have experience file system hangs on the IA64, X86_32 and X86_64
environments caused by a deadlock on AGF buffers.
The below stack frames is a simple example of the deadlock. This example
was taken from the current OSS Linux 3.6 sources on a x86_32 machine that
I hit it last night via test 113 on a RHEL6 backport that has the
old workqueue implementation and so only has one worker per CPU.
....
Post by t***@sgi.com
PID: 30223 TASK: dfb86cb0 CPU: 3 COMMAND: "xfsalloc"
#0 [d4a45b9c] __schedule at c068a7e8
#1 [d4a45c24] schedule at c068abd9
#2 [d4a45c2c] schedule_timeout at c0689808
#3 [d4a45c90] __down_common at c068a1d4
#4 [d4a45cc0] __down at c068a25e
#5 [d4a45cc8] down at c02598ff
#6 [d4a45cd8] xfs_buf_lock at e1ebaa6d [xfs]
#7 [d4a45cf8] _xfs_buf_find at e1ebad0e [xfs]
#8 [d4a45d2c] xfs_buf_get_map at e1ebaf00 [xfs]
#9 [d4a45d58] xfs_buf_read_map at e1ebbce3 [xfs]
#10 [d4a45d7c] xfs_trans_read_buf_map at e1f2974e [xfs]
#11 [d4a45dac] xfs_read_agf at e1ed7d2c [xfs]
#12 [d4a45dec] xfs_alloc_read_agf at e1ed7f2d [xfs]
#13 [d4a45e0c] xfs_alloc_fix_freelist at e1ed84f6 [xfs]
So this has to be the first time we've tried to use this AG for
allocation in the transaction, otherwise we would have found it
attached to the transaction. I'll come back to this.
Post by t***@sgi.com
#14 [d4a45e4c] check_preempt_curr at c0261c47
#15 [d4a45e60] ttwu_do_wakeup at c0261c7e
#16 [d4a45e8c] radix_tree_lookup at c0465ab5
#17 [d4a45e94] xfs_perag_get at e1f19a36 [xfs]
#18 [d4a45ec8] __xfs_alloc_vextent at e1ed899d [xfs]
#19 [d4a45f1c] xfs_alloc_vextent_worker at e1ed8edb [xfs]
#20 [d4a45f30] process_one_work at c024e39d
#21 [d4a45f74] process_scheduled_works at c024e6ad
#22 [d4a45f84] rescuer_thread at c024e7f0
#23 [d4a45fbc] kthread at c0253c1b
#24 [d4a45fe8] kernel_thread_helper at c0692af
The AGF buffer can be locked across multiple calls to xfs_alloc_vextent().
This buffer must remained locked until the transaction is either committed or
canceled. The deadlock occurs when other callers of xfs_alloc_vextent() block
waiting for the AGF buffer lock and the workqueue manager cannot create another
allocate worker to service the allocation request for the process that holds
the lock, The default limit for the allocation worker is 256 workers and that
limit can be increased to 512, but lack of resources before this limit is the
real reason the workqueue manager cannot create another worker.
IOWs, the problem is that a worker thread that holds a resource cannot
get scheduled to make progress.
Post by t***@sgi.com
The solution to this problem is to move the allocation worker so that the
loops over xfs_alloc_vextent() for a single transaction, on non-metadata
paths will happen in a single worker.
That just moves the problem. e.g. to unwritten extent conversion.
Say we only have 3 worker threads maximum:

kworker 1 kworker 2 kworker 3
end io endio end io
ilock_excl ilock_excl ilock_excl
(block) (blocks) (blocks)

<holder releases lock>

(gets lock)
bmapi_write(convert)
queue work
wait for work

So now we are stuck again.
Post by t***@sgi.com
I traced the callers of xfs_alloc_vextent(), noting transaction creation,
commits and cancels; I noted loops in the callers and which were marked
as metadata/userdata. I can send this document to reviewers.
No need, I can see the (generic) problem now that it has been
pointed out.

Basically, this is what the rescuer thread is kind of for - it works
around the inability to allocate a new worker thread due to a OOM
situation. This doesn't help us here because that rescuer would
simply block like the others. Hence a rescuer thread concept doesn't
solve the problem at all.

So lets go back to the case at hand. We block because this is the
first attempt to lock the AGF in an allocation, and the
XFS_ALLOC_FLAG_TRYLOCK flag is not set. This indicates we are asking
for allocation in a specific AG from __xfs_alloc_vextent(), and we
can't do that until some other allocation completes. Given that AGF
locking orders are strictly enforced, we know that the holder of the
AGF will not deadlock as long as it can be scheduled to run.

IOWs, it is not safe to block forever on a lock in a worker thread
when there are outside dependencies on that lock being released to
make progress. To me that's not really a deadlock in the traditional
sense - it's more akin to a priority inversion that prevents the
resource holder from making progress. i.e. blocked threads hold the
worker resources, which prevent the lock holder from being able to
run and release the lock.

Hmmm. I think there are other ways to trigger this as well.
Anything that holds the AGF locked for a period of time is going to
cause allocations to back up like this, regardless of whether we
queue at xfs_alloc_vexent() or xfs_bmapi_write(). For example,
FITRIM holds the AGF locked for as long as it takes to issue all the
discard operations. That could stall enough allocations to run us
out of worker threads and hence effectively stop data allocations
from being run for some period of time. It won't deadlock, but it
certainly is sub-optimal behaviour.

I suspect the only way to fix this is to re-use the old workqueue
method of avoiding blocking on the workqueue indefinitely. That is,
if we fail to get the lock in this case, we return with EGAIN and
requeue the work. __xfs_alloc_vextent() and xfs_alloc_fix_freelist()
already have trylock support, so this should be fairly easy to do.
If I also convert the work to delayed work, I can easily add a
backoff that will prevent busy looping on the workqueue.

I'll have a quick look at this and see what falls out....

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Mark Tinguely
2012-09-20 13:49:55 UTC
Permalink
Post by Dave Chinner
I suspect the only way to fix this is to re-use the old workqueue
method of avoiding blocking on the workqueue indefinitely. That is,
if we fail to get the lock in this case, we return with EGAIN and
requeue the work. __xfs_alloc_vextent() and xfs_alloc_fix_freelist()
already have trylock support, so this should be fairly easy to do.
If I also convert the work to delayed work, I can easily add a
backoff that will prevent busy looping on the workqueue.
I'll have a quick look at this and see what falls out....
Okay.

Many of these paths are not using an allocator worker (userdata==0)
and/or the loops are done within a new transaction.

--Mark.
Dave Chinner
2012-09-24 13:25:19 UTC
Permalink
Post by Dave Chinner
Post by t***@sgi.com
We have experience file system hangs on the IA64, X86_32 and X86_64
environments caused by a deadlock on AGF buffers.
The below stack frames is a simple example of the deadlock. This example
was taken from the current OSS Linux 3.6 sources on a x86_32 machine that
I hit it last night via test 113 on a RHEL6 backport that has the
old workqueue implementation and so only has one worker per CPU.
.....
Post by Dave Chinner
Post by t***@sgi.com
I traced the callers of xfs_alloc_vextent(), noting transaction creation,
commits and cancels; I noted loops in the callers and which were marked
as metadata/userdata. I can send this document to reviewers.
No need, I can see the (generic) problem now that it has been
pointed out.
Basically, this is what the rescuer thread is kind of for - it works
around the inability to allocate a new worker thread due to a OOM
situation. This doesn't help us here because that rescuer would
simply block like the others. Hence a rescuer thread concept doesn't
solve the problem at all.
So lets go back to the case at hand. We block because this is the
first attempt to lock the AGF in an allocation, and the
XFS_ALLOC_FLAG_TRYLOCK flag is not set. This indicates we are asking
for allocation in a specific AG from __xfs_alloc_vextent(), and we
can't do that until some other allocation completes. Given that AGF
locking orders are strictly enforced, we know that the holder of the
AGF will not deadlock as long as it can be scheduled to run.
IOWs, it is not safe to block forever on a lock in a worker thread
when there are outside dependencies on that lock being released to
make progress. To me that's not really a deadlock in the traditional
sense - it's more akin to a priority inversion that prevents the
resource holder from making progress. i.e. blocked threads hold the
worker resources, which prevent the lock holder from being able to
run and release the lock.
Hmmm. I think there are other ways to trigger this as well.
Anything that holds the AGF locked for a period of time is going to
cause allocations to back up like this, regardless of whether we
queue at xfs_alloc_vexent() or xfs_bmapi_write(). For example,
FITRIM holds the AGF locked for as long as it takes to issue all the
discard operations. That could stall enough allocations to run us
out of worker threads and hence effectively stop data allocations
from being run for some period of time. It won't deadlock, but it
certainly is sub-optimal behaviour.
I suspect the only way to fix this is to re-use the old workqueue
method of avoiding blocking on the workqueue indefinitely. That is,
if we fail to get the lock in this case, we return with EGAIN and
requeue the work. __xfs_alloc_vextent() and xfs_alloc_fix_freelist()
already have trylock support, so this should be fairly easy to do.
If I also convert the work to delayed work, I can easily add a
backoff that will prevent busy looping on the workqueue.
I'll have a quick look at this and see what falls out....
Here's what fell out. I've smoke tested it on a couple of different
machines with xfstests, and all seems fine so far.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com

xfs: don't block xfs_alloc_wq on AGF locks

From: Dave Chinner <***@redhat.com>

When the workqueue runs out of worker threads because all the
allocations are blockd on the AGF lock, the AGF lock holder cannot
run to complete the allocation transaction that will unlock the AGF.
As a result, allocations stop making progress. Raising the
workqueue concurrency limit does not solve the problem, just raises
the number of allocations that need to block before we hit the
starvation issue.

To avoid workqueue starvation, instead of blocking on the AGF lock
only try to lock the AGF when caled from worker thread context. If
we fail to get the lock on all the AGs we are allowed to try, return
an EAGAIN error and have the worker requeue the allocation work for
a short time in the future when progress has been made. Because we
no longer block worker threads, we avoid the issue of starving
allocations requests of service and hence will always make progress.

Because the trylock semantics are now slightly different, add a
"try-harder" flag to ensure that we avoid conflating the new
"trylock" with the existing "avoid allocating data in a metadata
prefered AG" semantics. The "try-harder" flag is now used to
indicate data allocation in metadata-preferred AGs is allowed.

Signed-off-by: Dave Chinner <***@redhat.com>
---
fs/xfs/xfs_alloc.c | 65 +++++++++++++++++++++++++++++++++++++---------------
fs/xfs/xfs_alloc.h | 3 ++-
2 files changed, 48 insertions(+), 20 deletions(-)

diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c
index 4f33c32..37cd31d 100644
--- a/fs/xfs/xfs_alloc.c
+++ b/fs/xfs/xfs_alloc.c
@@ -1764,9 +1764,9 @@ xfs_alloc_fix_freelist(
xfs_trans_t *tp; /* transaction pointer */

mp = args->mp;
-
pag = args->pag;
tp = args->tp;
+
if (!pag->pagf_init) {
if ((error = xfs_alloc_read_agf(mp, tp, args->agno, flags,
&agbp)))
@@ -1775,7 +1775,7 @@ xfs_alloc_fix_freelist(
ASSERT(flags & XFS_ALLOC_FLAG_TRYLOCK);
ASSERT(!(flags & XFS_ALLOC_FLAG_FREEING));
args->agbp = NULL;
- return 0;
+ return EAGAIN;
}
} else
agbp = NULL;
@@ -1786,7 +1786,7 @@ xfs_alloc_fix_freelist(
* try harder at this point
*/
if (pag->pagf_metadata && args->userdata &&
- (flags & XFS_ALLOC_FLAG_TRYLOCK)) {
+ (flags & XFS_ALLOC_FLAG_TRYHARD)) {
ASSERT(!(flags & XFS_ALLOC_FLAG_FREEING));
args->agbp = NULL;
return 0;
@@ -1822,7 +1822,7 @@ xfs_alloc_fix_freelist(
ASSERT(flags & XFS_ALLOC_FLAG_TRYLOCK);
ASSERT(!(flags & XFS_ALLOC_FLAG_FREEING));
args->agbp = NULL;
- return 0;
+ return EAGAIN;
}
}
/*
@@ -2208,7 +2208,8 @@ xfs_alloc_read_agf(
*/
int /* error */
__xfs_alloc_vextent(
- xfs_alloc_arg_t *args) /* allocation argument structure */
+ xfs_alloc_arg_t *args,
+ bool trylock)
{
xfs_agblock_t agsize; /* allocation group size */
int error;
@@ -2249,6 +2250,7 @@ __xfs_alloc_vextent(
}
minleft = args->minleft;

+ flags = trylock ? XFS_ALLOC_FLAG_TRYLOCK : 0;
switch (type) {
case XFS_ALLOCTYPE_THIS_AG:
case XFS_ALLOCTYPE_NEAR_BNO:
@@ -2259,7 +2261,7 @@ __xfs_alloc_vextent(
args->agno = XFS_FSB_TO_AGNO(mp, args->fsbno);
args->pag = xfs_perag_get(mp, args->agno);
args->minleft = 0;
- error = xfs_alloc_fix_freelist(args, 0);
+ error = xfs_alloc_fix_freelist(args, flags);
args->minleft = minleft;
if (error) {
trace_xfs_alloc_vextent_nofix(args);
@@ -2309,7 +2311,7 @@ __xfs_alloc_vextent(
args->agno = XFS_FSB_TO_AGNO(mp, args->fsbno);
args->type = XFS_ALLOCTYPE_THIS_AG;
sagno = 0;
- flags = 0;
+ flags |= XFS_ALLOC_FLAG_TRYHARD;
} else {
if (type == XFS_ALLOCTYPE_START_AG)
args->type = XFS_ALLOCTYPE_THIS_AG;
@@ -2328,7 +2330,7 @@ __xfs_alloc_vextent(
if (no_min) args->minleft = 0;
error = xfs_alloc_fix_freelist(args, flags);
args->minleft = minleft;
- if (error) {
+ if (error && error != EAGAIN) {
trace_xfs_alloc_vextent_nofix(args);
goto error0;
}
@@ -2374,8 +2376,16 @@ __xfs_alloc_vextent(
}
if (flags == 0) {
no_min = 1;
+ } else if (trylock &&
+ flags == XFS_ALLOC_FLAG_TRYLOCK) {
+ flags |= XFS_ALLOC_FLAG_TRYHARD;
+ } else if (trylock &&
+ (flags & XFS_ALLOC_FLAG_TRYHARD)) {
+ error = EAGAIN;
+ trace_xfs_alloc_vextent_noagbp(args);
+ goto error0;
} else {
- flags = 0;
+ flags = XFS_ALLOC_FLAG_TRYHARD;
if (type == XFS_ALLOCTYPE_START_BNO) {
args->agbno = XFS_FSB_TO_AGBNO(mp,
args->fsbno);
@@ -2417,28 +2427,45 @@ error0:
return error;
}

+/*
+ * The worker is not allowed to block indefinitely on AGF locks. This prevents
+ * worker thread starvation from preventing allocation progress from being made.
+ * This occurs when the transaction holding the AGF cannot be scheduled to run
+ * in a worker because all worker threads are blocked on the AGF lock and no
+ * more can be allocated.
+ */
static void
xfs_alloc_vextent_worker(
struct work_struct *work)
{
- struct xfs_alloc_arg *args = container_of(work,
+ struct delayed_work *dwork = container_of(work,
+ struct delayed_work, work);
+ struct xfs_alloc_arg *args = container_of(dwork,
struct xfs_alloc_arg, work);
unsigned long pflags;
+ int result;

/* we are in a transaction context here */
current_set_flags_nested(&pflags, PF_FSTRANS);

- args->result = __xfs_alloc_vextent(args);
- complete(args->done);
+ result = __xfs_alloc_vextent(args, true);
+ if (result == EAGAIN) {
+ /* requeue for a later retry */
+ queue_delayed_work(xfs_alloc_wq, &args->work,
+ msecs_to_jiffies(10));
+ } else {
+ args->result = result;
+ complete(args->done);
+ }

current_restore_flags_nested(&pflags, PF_FSTRANS);
}

/*
* Data allocation requests often come in with little stack to work on. Push
- * them off to a worker thread so there is lots of stack to use. Metadata
- * requests, OTOH, are generally from low stack usage paths, so avoid the
- * context switch overhead here.
+ * them off to a worker thread so there is lots of stack to use if we are not
+ * already in a worker thread context. Metadata requests, OTOH, are generally
+ * from low stack usage paths, so avoid the context switch overhead here.
*/
int
xfs_alloc_vextent(
@@ -2446,13 +2473,13 @@ xfs_alloc_vextent(
{
DECLARE_COMPLETION_ONSTACK(done);

- if (!args->userdata)
- return __xfs_alloc_vextent(args);
+ if (!args->userdata || (current->flags & PF_WQ_WORKER))
+ return __xfs_alloc_vextent(args, false);


args->done = &done;
- INIT_WORK_ONSTACK(&args->work, xfs_alloc_vextent_worker);
- queue_work(xfs_alloc_wq, &args->work);
+ INIT_DELAYED_WORK_ONSTACK(&args->work, xfs_alloc_vextent_worker);
+ queue_delayed_work(xfs_alloc_wq, &args->work, 0);
wait_for_completion(&done);
return args->result;
}
diff --git a/fs/xfs/xfs_alloc.h b/fs/xfs/xfs_alloc.h
index 93be4a6..047a239 100644
--- a/fs/xfs/xfs_alloc.h
+++ b/fs/xfs/xfs_alloc.h
@@ -54,6 +54,7 @@ typedef unsigned int xfs_alloctype_t;
*/
#define XFS_ALLOC_FLAG_TRYLOCK 0x00000001 /* use trylock for buffer locking */
#define XFS_ALLOC_FLAG_FREEING 0x00000002 /* indicate caller is freeing extents*/
+#define XFS_ALLOC_FLAG_TRYHARD 0x00000004 /* also use metadata AGs for data */

/*
* In order to avoid ENOSPC-related deadlock caused by
@@ -121,7 +122,7 @@ typedef struct xfs_alloc_arg {
char userdata; /* set if this is user data */
xfs_fsblock_t firstblock; /* io first block allocated */
struct completion *done;
- struct work_struct work;
+ struct delayed_work work;
int result;
} xfs_alloc_arg_t;
Ben Myers
2012-09-24 17:11:59 UTC
Permalink
Hi Mark,

On Wed, Sep 19, 2012 at 11:31:33AM -0500, ***@sgi.com wrote:
...
Post by t***@sgi.com
I traced the callers of xfs_alloc_vextent(), noting transaction creation,
commits and cancels; I noted loops in the callers and which were marked
as metadata/userdata. I can send this document to reviewers.
I'd like to see the doc of xfs_alloc_vextent callers and which of them loop.
Can you post your doc to the list?

Regards,
Ben
Mark Tinguely
2012-09-24 18:09:04 UTC
Permalink
Date: Mon, 24 Sep 2012 12:11:59 -0500
Subject: Re: [PATCH 0/3] xfs: allocation worker causes freelist buffer lock
hang
Hi Mark,
...
Post by t***@sgi.com
I traced the callers of xfs_alloc_vextent(), noting transaction creation,
commits and cancels; I noted loops in the callers and which were marked
as metadata/userdata. I can send this document to reviewers.
I'd like to see the doc of xfs_alloc_vextent callers and which of them loop.
Can you post your doc to the list?
Regards,
Ben
Here are the Linux 3.6.x callers of xfs_alloc_vextent() stopping at the
transaction commit/cancel level.

XFS_BMAPI_METADATA *not* being set implies user data.

Userdata being set is the community designated indication that an allocate
worker is needed to prevent the overflow of the kernel stack.

Calling xfs_alloc_vextent() several times with the same transaction can cause
a dead lock if a new allocation worker can not be allocated. I noted where the
loops occur. xfs_alloc_vextent() can call itself, those calls must be in the
same allocation worker.

As a bonus, consolidating the loops into one worker actually gives a slight
performance advantage.

Sorry this is wider than 80 characters wide.
---
xfs_bmap_btalloc(xfs_bmalloca_t)
! xfs_bmap_alloc(xfs_bmalloca_t)
! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! xfs_bmapi_write(xfs_trans_t ...) loops over above
! ! ! ! xfs_attr_rmtval_set(xfs_da_args_t) loops over bmapi_write (xfs_attr_set_int) **XFS_BMAPI_METADATA**
! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! xfs_attr_node_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! xfs_da_grow_inode_int(xfs_da_args_t) loops over bmapi_write **XFS_BMAPI_METADATA**
! ! ! ! ! xfs_dir2_grow_inode(struct xfs_da_args)
! ! ! ! ! ! xfs_dir2_sf_to_block(xfs_da_args_t)
! ! ! ! ! ! ! xfs_bmap_add_attrfork_local(xfs_trans_t ..)
! ! ! ! ! ! ! ! xfs_bmap_add_attrfork(...) trans alloc, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! xfs_dir2_sf_addname((xfs_da_args_t)
! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! xfs_dir2_leaf_addname(xfs_da_args_t)
! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! xfs_dir2_block_addname(xfs_da_args_t)
! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! xfs_dir2_sf_addname((xfs_da_args_t)
! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! xfs_dir2_leaf_to_node(xfs_da_args_t)
! ! ! ! ! ! ! xfs_dir2_leaf_addname(xfs_da_args_t) <also above>
! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir2_block_addname(xfs_da_args_t)
! ! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_dir2_sf_addname((xfs_da_args_t)
! ! ! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! xfs_dir2_node_addname_int(xfs_da_args_t ...)
! ! ! ! ! ! ! xfs_dir2_node_addname(xfs_da_args_t)
! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir2_leaf_addname(xfs_da_args_t)
! ! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! xfs_da_grow_inode(xfs_da_args)
! ! ! ! ! ! xfs_attr_shortform_to_leaf(xfs_da_args_t)
! ! ! ! ! ! ! xfs_attr_set_int(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! xfs_attr_leaf_to_node(xfs_da_args_t)
! ! ! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! xfs_attr_node_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! xfs_attr_leaf_split(xfs_da_state_t ...)
! ! ! ! ! ! ! xfs_da_split(xfs_da_state_t) loops over xfs_attr_leaf_split
! ! ! ! ! ! ! ! xfs_attr_node_addname(xfs_da_args_t) trans_roll
! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! xfs_dir2_node_addname(xfs_da_args_t)
! ! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_dir2_leaf_addname(xfs_da_args_t)
! ! ! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! xfs_da_root_split(xfs_da_state_t ...) <above>
! ! ! ! ! ! ! xfs_da_split(xfs_da_state_t) loops over xfs_attr_leaf_split
! ! ! ! ! ! ! ! xfs_attr_node_addname(xfs_da_args_t) trans_roll
! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! xfs_da_node_split(xfs_da_state_t ...)
! ! ! ! ! ! ! ! xfs_da_split(xfs_da_state_t) loops over xfs_attr_leaf_split
! ! ! ! ! ! ! ! ! xfs_attr_node_addname(xfs_da_args_t) trans_roll
! ! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! xfs_dir2_block_to_leaf(xfs_da_args_t ...)
! ! ! ! ! ! ! xfs_dir2_block_addname(xfs_da_args_t)
! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir2_sf_addname((xfs_da_args_t)
! ! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! xfs_dir2_leafn_split(xfs_da_state_t...)
! ! ! ! ! ! ! xfs_da_split(xfs_da_state_t) loops over xfs_attr_leaf_split
! ! ! ! ! ! ! ! xfs_attr_node_addname(xfs_da_args_t) trans_roll
! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! xfs_qm_dqalloc(xfs_trans_t ...) does a xfs_bmap_finish/cancel **XFS_BMAPI_METADATA**
! ! ! ! ! xfs_qm_dqtobp(xfs_trans_t ...)
! ! ! ! ! ! xfs_qm_dqread(...) does the trans_alloc/commit/cancel
! ! ! ! xfs_iomap_write_direct(...) alloc trans, xfs_trans_commit/cancel
! ! ! ! xfs_iomap_write_allocate(...) alloc trans, xfs_trans_commit/cancel safe loop
! ! ! ! xfs_iomap_write_unwritten(..) alloc trans, xfs_trans_commit/cancel safe loop
! ! ! ! xfs_growfs_rt_alloc(..) alloc trans, xfs_trans_commit/cancel safe loop
! ! ! ! xfs_symlink(...) allocates trans does a xfs_trans_commit/cancel
! ! ! ! xfs_alloc_file_space(...) alloc trans, xfs_trans_commit/cancel safe loop

xfs_bmap_extents_to_btree(xfs_trans_t ...) <- set userdata to 0 (patch 3)
! xfs_bmap_add_attrfork_extents(xfs_trans_t ...)
! ! xfs_bmap_add_attrfork(...) allocates trans does a xfs_trans_commit/cancel
! xfs_bmap_add_extent_delay_real(fs_bmalloca)
! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! <see above>
! xfs_bmap_add_extent_unwritten_real(xfs_trans_t ...)
! ! xfs_bmapi_convert_unwritten(xfs_bmalloca ...)
! ! ! xfs_bmapi_write(xfs_trans ...) calls xfs_bmapi_convert_unwritten in loop XFS_BMAPI_METADATA
! ! ! ! ... <see above>
! ! xfs_bunmapi(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ! xfs_attr_rmtval_remove(xfs_da_args_t) loops calling xfs_bunmapi **XFS_BMAPI_METADATA**
! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! xfs_attr_node_addname(xfs_da_args_t) trans_roll
! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! xfs_attr_node_removename(xfs_da_args_t)
! ! ! ! ! xfs_attr_remove_int(...) allocates trans does a xfs_trans_commit/cancel
! ! ! xfs_dir2_shrink_inode(xfs_da_args_t) **XFS_BMAPI_METADATA**
! ! ! ! xfs_dir2_leaf_trim_data(xfs_da_args_t)
! ! ! ! ! xfs_dir2_leaf_to_block(xfs_da_args_t ...)
! ! ! ! ! ! xfs_dir2_leaf_removename(xfs_da_args_t)
! ! ! ! ! ! ! xfs_dir_removename(xfs_trans ...)
! ! ! ! ! ! ! ! xfs_remove(...) allocates trans does a xfs_trans_commit/cancel
! ! ! ! xfs_dir2_leaf_trim_data(xfs_da_args_t)
! ! ! ! ! xfs_dir2_leaf_to_block(xfs_da_args_t ...)
! ! ! ! ! ! xfs_dir2_leaf_removename(xfs_da_args_t)
! ! ! ! ! ! ! xfs_dir_removename(xfs_trans ...)
! ! ! ! ! ! ! ! xfs_remove(...) allocates trans does a xfs_trans_commit/cancel
! ! ! ! ! ! xfs_dir2_node_to_leaf(xfs_da_state_t)
! ! ! ! ! ! ! xfs_dir2_node_removename(xfs_da_args_t)
! ! ! ! ! ! ! ! xfs_dir_removename(xfs_trans_t ...) creates the xfs_da_args_t
! ! ! ! ! ! ! ! ! xfs_remove(...) allocates trans does a xfs_trans_commit/cancel
! ! ! ! xfs_dir2_node_to_leaf(xfs_da_state_t)
! ! ! ! ! xfs_dir2_node_removename(xfs_da_args_t)
! ! ! ! ! ! xfs_dir_removename(xfs_trans_t ...) creates the xfs_da_args_t
! ! ! ! ! ! ! xfs_remove(...) allocates trans does a xfs_trans_commit/cancel
! ! ! xfs_bmap_punch_delalloc_range(...) loops calling xfs_bunmapi with NULL tp
! ! ! xfs_da_shrink_inode(xfs_da_args_t) loops calling xfs_bunmapi **XFS_BMAPI_METADATA**
! ! ! ! xfs_dir2_leaf_to_block(xfs_da_args_t ...)
! ! ! ! ! xfs_dir2_leaf_removename(xfs_da_args_t)
! ! ! ! ! ! xfs_dir_removename(xfs_trans ...)
! ! ! ! ! ! ! xfs_remove(...) allocates trans does a xfs_trans_commit/cancel
! ! ! xfs_itruncate_extents(xfs_trans ...) loops calling xfs_bunmapi with new tp
! ! ! xfs_inactive_symlink_rmt(..., xfs_trans_t) does a trans_commit and trans_dup
! ! ! xfs_free_file_space(...) loops calling xfs_bmapi with new tp
! xfs_bmap_add_extent_hole_real(xfs_bmalloca ...)
! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! ... <see above>
! xfs_bunmapi(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ... <see above>

xfs_bmap_local_to_extents(xfs_trans_t ...) <- set userdata to 0 (patch 3)
! xfs_bmap_add_attrfork_local(xfs_trans_t ..)
! ! xfs_bmap_add_attrfork(...) trans alloc, bmap_finish trans_commit/cancel
! xfs_bmapi_write(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ... <see above>

xfs_ialloc_ag_alloc(xfs_trans_t ...) userdata == 0
! xfs_dialloc(xfs_trans ...) loops over the above
! ! xfs_ialloc(xfs_trans ...)
! ! ! xfs_dir_ialloc(xfs_trans ...)
! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel

xfs_bmbt_alloc_block(xfs_btree_cur, ...) userdata == 0
! xfs_btree_split(xfs_btree_cur, ...)
! ! xfs_btree_make_block_unfull
! ! ! xfs_btree_insrec(xfs_btree_cur, ...)
! ! ! ! xfs_btree_insert(xfs_btree_cur, ...)
! ! ! ! ! xfs_alloc_fixup_trees(xfs_btree_cur, ...)
! ! ! ! ! ! xfs_alloc_ag_vextent_exact(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_near(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_size(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! xfs_free_ag_extent(xfs_trans ...)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! xfs_bmap_add_extent_delay_real(xfs_bmalloca)
! ! ! ! ! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! ! ! ! ! ! ... <see above>
! ! ! ! ! xfs_bmap_add_extent_hole_real(xfs_bmalloca)
! ! ! ! ! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! ! ! ! ! ! ... <see above>
! ! ! ! ! xfs_bmap_add_extent_unwritten_real(xfs_trans, ...)
! ! ! ! ! ! xfs_bunmapi(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ! ! ! ! ! ... <see above>
! ! ! ! ! ! xfs_bmapi_convert_unwritten(xfs_bmalloca, ...)
! ! ! ! ! ! ! xfs_bmapi_write(xfs_trans_t ...) loop over the above
! ! ! ! ! ! ! ! ... <see above>
! ! ! ! ! xfs_bmap_del_extent(xfs_trans, ...)
! ! ! ! ! ! xfs_bunmapi(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ! ! ! ! ! ... <see above>
! ! ! ! ! xfs_ialloc_ag_alloc(xfs_trans, ...) loops over the above
! ! ! ! ! ! xfs_dialloc(xfs_trans ...) loops over the above
! ! ! ! ! ! ! xfs_ialloc(xfs_trans ...)
! ! ! ! ! ! ! ! xfs_dir_ialloc(xfs_trans ...)
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! xfs_btree_new_iroot(xfs_btree_cur, ...)
! ! xfs_btree_make_block_unfull(xfs_btree_cur, ...)
! ! ! xfs_btree_insrec(xfs_btree_cur, ...)
! ! ! ! xfs_btree_insert(xfs_btree_cur, ...)
! ! ! ! ! xfs_alloc_fixup_trees(xfs_btree_cur, ...)
! ! ! ! ! ! xfs_alloc_ag_vextent_exact(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_near(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_size(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! xfs_bmap_add_attrfork_btree(xfs_trans_t ...)
! ! ! xfs_bmap_add_attrfork(...) allocates trans does a xfs_trans_commit/cancel
! xfs_btree_new_root(xfs_btree_cur, ...)
! ! xfs_btree_insrec(xfs_btree_cur, ...)
! ! ! xfs_btree_insert(xfs_btree_cur, ...)
! ! ! xfs_alloc_fixup_trees(xfs_btree_cur, ...)
! ! ! ! xfs_alloc_ag_vextent_exact(xfs_alloc_arg)
! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! xfs_alloc_ag_vextent_near(xfs_alloc_arg)
! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! xfs_alloc_ag_vextent_size(xfs_alloc_arg)
! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)

xfs_inobt_alloc_block(xfs_btree_cur, ...) userdata == 0
! xfs_btree_split(xfs_btree_cur, ...)
! ! xfs_btree_make_block_unfull
! ! ! xfs_btree_insrec(xfs_btree_cur, ...)
! ! ! ! xfs_btree_insert(xfs_btree_cur, ...)
! ! ! ! ! xfs_alloc_fixup_trees(xfs_btree_cur, ...)
! ! ! ! ! ! xfs_alloc_ag_vextent_exact(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_near(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_size(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! xfs_free_ag_extent(xfs_trans ...)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! xfs_bmap_add_extent_delay_real(xfs_bmalloca)
! ! ! ! ! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! ! ! ! ! ! ... <see above>
! ! ! ! ! xfs_bmap_add_extent_hole_real(xfs_bmalloca)
! ! ! ! ! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! ! ! ! ! ! ... <see above>
! ! ! ! ! xfs_bmap_add_extent_unwritten_real(xfs_trans, ...)
! ! ! ! ! ! xfs_bunmapi(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ! ! ! ! ! ... <see above>
! ! ! ! ! ! xfs_bmapi_convert_unwritten(xfs_bmalloca, ...)
! ! ! ! ! ! ! xfs_bmapi_write(xfs_trans_t ...) loop over the above
! ! ! ! ! ! ! ! ... <see above>
! ! ! ! ! xfs_bmap_del_extent(xfs_trans, ...)
! ! ! ! ! ! xfs_bunmapi(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ! ! ! ! ! ... <see above>
! ! ! ! ! xfs_ialloc_ag_alloc(xfs_trans, ...) loops over the above
! ! ! ! ! ! xfs_dialloc(xfs_trans ...) loops over the above
! ! ! ! ! ! ! xfs_ialloc(xfs_trans ...)
! ! ! ! ! ! ! ! xfs_dir_ialloc(xfs_trans ...)
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! xfs_btree_new_iroot(xfs_btree_cur, ...)
! ! xfs_btree_make_block_unfull(xfs_btree_cur, ...)
! ! ! xfs_btree_insrec(xfs_btree_cur, ...)
! ! ! ! xfs_btree_insert(xfs_btree_cur, ...)
! ! ! ! ! xfs_alloc_fixup_trees(xfs_btree_cur, ...)
! ! ! ! ! ! xfs_alloc_ag_vextent_exact(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_near(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_size(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! xfs_bmap_add_attrfork_btree(xfs_trans_t ...)
! ! ! xfs_bmap_add_attrfork(...) allocates trans does a xfs_trans_commit/cancel
! xfs_btree_new_root(xfs_btree_cur, ...)
! ! xfs_btree_insrec(xfs_btree_cur, ...)
! ! ! xfs_btree_insert(xfs_btree_cur, ...)
! ! ! xfs_alloc_fixup_trees(xfs_btree_cur, ...)
! ! ! ! xfs_alloc_ag_vextent_exact(xfs_alloc_arg)
! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! xfs_alloc_ag_vextent_near(xfs_alloc_arg)
! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! xfs_alloc_ag_vextent_size(xfs_alloc_arg)
! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)

--Mark.
Dave Chinner
2012-09-25 00:56:32 UTC
Permalink
Post by Mark Tinguely
Date: Mon, 24 Sep 2012 12:11:59 -0500
Subject: Re: [PATCH 0/3] xfs: allocation worker causes freelist buffer lock
hang
Hi Mark,
...
Post by t***@sgi.com
I traced the callers of xfs_alloc_vextent(), noting transaction creation,
commits and cancels; I noted loops in the callers and which were marked
as metadata/userdata. I can send this document to reviewers.
I'd like to see the doc of xfs_alloc_vextent callers and which of them loop.
Can you post your doc to the list?
Regards,
Ben
Here are the Linux 3.6.x callers of xfs_alloc_vextent() stopping at the
transaction commit/cancel level.
XFS_BMAPI_METADATA *not* being set implies user data.
Userdata being set is the community designated indication that an allocate
worker is needed to prevent the overflow of the kernel stack.
Calling xfs_alloc_vextent() several times with the same transaction can cause
a dead lock if a new allocation worker can not be allocated. I noted where the
loops occur. xfs_alloc_vextent() can call itself, those calls must be in the
same allocation worker.
As a bonus, consolidating the loops into one worker actually gives a slight
performance advantage.
Can you quantify it?
Post by Mark Tinguely
Sorry this is wider than 80 characters wide.
---
xfs_bmap_btalloc(xfs_bmalloca_t)
! xfs_bmap_alloc(xfs_bmalloca_t)
! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! xfs_bmapi_write(xfs_trans_t ...) loops over above
! ! ! ! xfs_attr_rmtval_set(xfs_da_args_t) loops over bmapi_write (xfs_attr_set_int) **XFS_BMAPI_METADATA**
! ! ! ! xfs_da_grow_inode_int(xfs_da_args_t) loops over bmapi_write **XFS_BMAPI_METADATA**
! ! ! ! xfs_qm_dqalloc(xfs_trans_t ...) does a xfs_bmap_finish/cancel **XFS_BMAPI_METADATA**
! ! ! ! xfs_iomap_write_direct(...) alloc trans, xfs_trans_commit/cancel
! ! ! ! xfs_iomap_write_allocate(...) alloc trans, xfs_trans_commit/cancel safe loop
! ! ! ! xfs_iomap_write_unwritten(..) alloc trans, xfs_trans_commit/cancel safe loop
! ! ! ! xfs_growfs_rt_alloc(..) alloc trans, xfs_trans_commit/cancel safe loop
! ! ! ! xfs_symlink(...) allocates trans does a xfs_trans_commit/cancel
! ! ! ! xfs_alloc_file_space(...) alloc trans, xfs_trans_commit/cancel safe loop
So the only data path callers though here are
xfs_iomap_write_direct(), xfs_iomap_write_allocate() and
xfs_iomap_write_unwritten() and xfs_alloc_file_space(). Everything
else is metadata, so won't use
Post by Mark Tinguely
xfs_bmap_extents_to_btree(xfs_trans_t ...) <- set userdata to 0 (patch 3)
! xfs_bmap_add_attrfork_extents(xfs_trans_t ...)
! ! xfs_bmap_add_attrfork(...) allocates trans does a xfs_trans_commit/cancel
! xfs_bmap_add_extent_delay_real(fs_bmalloca)
! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! <see above>
! xfs_bmap_add_extent_unwritten_real(xfs_trans_t ...)
! ! xfs_bmapi_convert_unwritten(xfs_bmalloca ...)
! ! ! xfs_bmapi_write(xfs_trans ...) calls xfs_bmapi_convert_unwritten in loop XFS_BMAPI_METADATA
! ! ! ! ... <see above>
.....
Post by Mark Tinguely
! xfs_bmap_add_extent_hole_real(xfs_bmalloca ...)
! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! ... <see above>
So it's bmbt modification that looks to be the problem here, right?
Post by Mark Tinguely
xfs_bmap_local_to_extents(xfs_trans_t ...) <- set userdata to 0 (patch 3)
! xfs_bmap_add_attrfork_local(xfs_trans_t ..)
! ! xfs_bmap_add_attrfork(...) trans alloc, bmap_finish trans_commit/cancel
! xfs_bmapi_write(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ... <see above>
Same here.

That's all I can see as problematic - maybe I read the output wrong
and there's others?

i.e. all other xfs_alloc_vextent() callers are either in metadata
context (so don't use the workqueue) or commit the transaction
directly after xfs_bmapi_write returns so will unlock the AGF buffer
before calling into xfs_bmapi_write a second time.

If these are the only loops, then patch 3 is all that is
necessary to avoid the problem of blocking on workqueue resource
while we are already on the workqueue, right? i.e. bmbt allocation
is a metadata allocation, even though the context is a data
allocation, and ensuring it is metadata means that the current
transaction won't get blocked waiting for workqueue resources...

What am I missing?

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Mark Tinguely
2012-09-25 15:14:16 UTC
Permalink
Post by Dave Chinner
Post by Mark Tinguely
Date: Mon, 24 Sep 2012 12:11:59 -0500
Subject: Re: [PATCH 0/3] xfs: allocation worker causes freelist buffer lock
hang
Hi Mark,
...
Post by t***@sgi.com
I traced the callers of xfs_alloc_vextent(), noting transaction creation,
commits and cancels; I noted loops in the callers and which were marked
as metadata/userdata. I can send this document to reviewers.
I'd like to see the doc of xfs_alloc_vextent callers and which of them loop.
Can you post your doc to the list?
Regards,
Ben
Here are the Linux 3.6.x callers of xfs_alloc_vextent() stopping at the
transaction commit/cancel level.
XFS_BMAPI_METADATA *not* being set implies user data.
Userdata being set is the community designated indication that an allocate
worker is needed to prevent the overflow of the kernel stack.
Calling xfs_alloc_vextent() several times with the same transaction can cause
a dead lock if a new allocation worker can not be allocated. I noted where the
loops occur. xfs_alloc_vextent() can call itself, those calls must be in the
same allocation worker.
As a bonus, consolidating the loops into one worker actually gives a slight
performance advantage.
Can you quantify it?
I was comparing the bonnie and iozone benchmarks outputs. I will see if
someone can enlighten me on how to quantify those numbers.
Post by Dave Chinner
Post by Mark Tinguely
Sorry this is wider than 80 characters wide.
---
xfs_bmap_btalloc(xfs_bmalloca_t)
! xfs_bmap_alloc(xfs_bmalloca_t)
! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! xfs_bmapi_write(xfs_trans_t ...) loops over above
! ! ! ! xfs_attr_rmtval_set(xfs_da_args_t) loops over bmapi_write (xfs_attr_set_int) **XFS_BMAPI_METADATA**
! ! ! ! xfs_da_grow_inode_int(xfs_da_args_t) loops over bmapi_write **XFS_BMAPI_METADATA**
! ! ! ! xfs_qm_dqalloc(xfs_trans_t ...) does a xfs_bmap_finish/cancel **XFS_BMAPI_METADATA**
! ! ! ! xfs_iomap_write_direct(...) alloc trans, xfs_trans_commit/cancel
! ! ! ! xfs_iomap_write_allocate(...) alloc trans, xfs_trans_commit/cancel safe loop
! ! ! ! xfs_iomap_write_unwritten(..) alloc trans, xfs_trans_commit/cancel safe loop
! ! ! ! xfs_growfs_rt_alloc(..) alloc trans, xfs_trans_commit/cancel safe loop
! ! ! ! xfs_symlink(...) allocates trans does a xfs_trans_commit/cancel
! ! ! ! xfs_alloc_file_space(...) alloc trans, xfs_trans_commit/cancel safe loop
So the only data path callers though here are
xfs_iomap_write_direct(), xfs_iomap_write_allocate() and
xfs_iomap_write_unwritten() and xfs_alloc_file_space(). Everything
else is metadata, so won't use
Correct. Only these use the worker. All the other paths directly call
the __xfs_alloc_vextent() routine. I saved the xfs_bmalloca and
fs_alloc_arg when allocating a buffer. I am quite confident that this is
the only path that causes the problem.
Post by Dave Chinner
Post by Mark Tinguely
xfs_bmap_extents_to_btree(xfs_trans_t ...)<- set userdata to 0 (patch 3)
! xfs_bmap_add_attrfork_extents(xfs_trans_t ...)
! ! xfs_bmap_add_attrfork(...) allocates trans does a xfs_trans_commit/cancel
! xfs_bmap_add_extent_delay_real(fs_bmalloca)
! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! !<see above>
! xfs_bmap_add_extent_unwritten_real(xfs_trans_t ...)
! ! xfs_bmapi_convert_unwritten(xfs_bmalloca ...)
! ! ! xfs_bmapi_write(xfs_trans ...) calls xfs_bmapi_convert_unwritten in loop XFS_BMAPI_METADATA
! ! ! ! ...<see above>
.....
Post by Mark Tinguely
! xfs_bmap_add_extent_hole_real(xfs_bmalloca ...)
! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! ...<see above>
So it's bmbt modification that looks to be the problem here, right?
Post by Mark Tinguely
xfs_bmap_local_to_extents(xfs_trans_t ...)<- set userdata to 0 (patch 3)
! xfs_bmap_add_attrfork_local(xfs_trans_t ..)
! ! xfs_bmap_add_attrfork(...) trans alloc, bmap_finish trans_commit/cancel
! xfs_bmapi_write(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ...<see above>
Same here.
That's all I can see as problematic - maybe I read the output wrong
and there's others?
i.e. all other xfs_alloc_vextent() callers are either in metadata
context (so don't use the workqueue) or commit the transaction
directly after xfs_bmapi_write returns so will unlock the AGF buffer
before calling into xfs_bmapi_write a second time.
If these are the only loops, then patch 3 is all that is
necessary to avoid the problem of blocking on workqueue resource
while we are already on the workqueue, right? i.e. bmbt allocation
is a metadata allocation, even though the context is a data
allocation, and ensuring it is metadata means that the current
transaction won't get blocked waiting for workqueue resources...
What am I missing?
Patch 3 is a clean up. userdata is used by the allocation routines and
its value should be correct for those routines. I discovered the
uninitialized values tracing the calling list. The fact that the worker
routine was using and randomly creating a worker based on the stack
value is not a factor in the problem because those paths do not loop on
xfs_alloc_vextent().

Patch 2 moves the worker so that when the worker servicing data
allocation request gets the lock, it will hold the worker over the loop
in xfs_bmapi_write() until it can do a xfs_trans_commit/cancel. If it
does not have the lock, then that worker will wait until it can get the
lock.

Your patch hangs when limiting the available workers on test 109 on the
3 machines (2 x86_64 and a x86_32) that I tried it on. I am trying a
fourth machine to be sure.

Thanks,

--Mark.
Dave Chinner
2012-09-25 22:01:10 UTC
Permalink
Post by Mark Tinguely
Post by Dave Chinner
Post by Mark Tinguely
Date: Mon, 24 Sep 2012 12:11:59 -0500
Subject: Re: [PATCH 0/3] xfs: allocation worker causes freelist buffer lock
hang
Hi Mark,
...
Post by t***@sgi.com
I traced the callers of xfs_alloc_vextent(), noting transaction creation,
commits and cancels; I noted loops in the callers and which were marked
as metadata/userdata. I can send this document to reviewers.
I'd like to see the doc of xfs_alloc_vextent callers and which of them loop.
Can you post your doc to the list?
Regards,
Ben
Here are the Linux 3.6.x callers of xfs_alloc_vextent() stopping at the
transaction commit/cancel level.
XFS_BMAPI_METADATA *not* being set implies user data.
Userdata being set is the community designated indication that an allocate
worker is needed to prevent the overflow of the kernel stack.
Calling xfs_alloc_vextent() several times with the same transaction can cause
a dead lock if a new allocation worker can not be allocated. I noted where the
loops occur. xfs_alloc_vextent() can call itself, those calls must be in the
same allocation worker.
As a bonus, consolidating the loops into one worker actually gives a slight
performance advantage.
Can you quantify it?
I was comparing the bonnie and iozone benchmarks outputs. I will see
if someone can enlighten me on how to quantify those numbers.
Ugh.

Don't bother. Those are two of the worst offenders in the "useless
benchmarks for regression testing" category. Yeah, they *look* like
they give decent numbers, but I've wasted so much time looking at
results from these benhmarks only to find they do basic things wrong
and give numbers that vary simple because you've made a change that
increases or decreases the CPU cache footprint of a code path.

e.g. IOZone uses the same memory buffer as the source/destination of
all it's IO, and does not touch the contents of it at all. Hence for
small IO, the buffer stays resident in the CPU caches and gives
unrealsitically high throughput results. Worse is the fact that CPU
cache residency of the buffer can change according to the kernel
code path taken, so you can get massive changes in throughput just
by changing the layout of the code without changing any logic....

IOZone can be useful if you know exactly what you are doing and
using it to test a specific code path with a specific set of
configurations. e.g. comparing ext3/4/xfs/btrfs on the same kernel
and storage is fine. However, the moment you start using it to
compare different kernels, it's a total crap shoot....
Post by Mark Tinguely
Post by Dave Chinner
Post by Mark Tinguely
Sorry this is wider than 80 characters wide.
---
xfs_bmap_btalloc(xfs_bmalloca_t)
! xfs_bmap_alloc(xfs_bmalloca_t)
! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! xfs_bmapi_write(xfs_trans_t ...) loops over above
! ! ! ! xfs_attr_rmtval_set(xfs_da_args_t) loops over bmapi_write (xfs_attr_set_int) **XFS_BMAPI_METADATA**
! ! ! ! xfs_da_grow_inode_int(xfs_da_args_t) loops over bmapi_write **XFS_BMAPI_METADATA**
! ! ! ! xfs_qm_dqalloc(xfs_trans_t ...) does a xfs_bmap_finish/cancel **XFS_BMAPI_METADATA**
! ! ! ! xfs_iomap_write_direct(...) alloc trans, xfs_trans_commit/cancel
! ! ! ! xfs_iomap_write_allocate(...) alloc trans, xfs_trans_commit/cancel safe loop
! ! ! ! xfs_iomap_write_unwritten(..) alloc trans, xfs_trans_commit/cancel safe loop
! ! ! ! xfs_growfs_rt_alloc(..) alloc trans, xfs_trans_commit/cancel safe loop
! ! ! ! xfs_symlink(...) allocates trans does a xfs_trans_commit/cancel
! ! ! ! xfs_alloc_file_space(...) alloc trans, xfs_trans_commit/cancel safe loop
So the only data path callers though here are
xfs_iomap_write_direct(), xfs_iomap_write_allocate() and
xfs_iomap_write_unwritten() and xfs_alloc_file_space(). Everything
else is metadata, so won't use
Correct. Only these use the worker. All the other paths directly
call the __xfs_alloc_vextent() routine. I saved the xfs_bmalloca and
fs_alloc_arg when allocating a buffer. I am quite confident that
this is the only path that causes the problem.
Of course - they are the only paths that do userdata allocation :/
Post by Mark Tinguely
Post by Dave Chinner
Post by Mark Tinguely
xfs_bmap_extents_to_btree(xfs_trans_t ...)<- set userdata to 0 (patch 3)
! xfs_bmap_add_attrfork_extents(xfs_trans_t ...)
! ! xfs_bmap_add_attrfork(...) allocates trans does a xfs_trans_commit/cancel
! xfs_bmap_add_extent_delay_real(fs_bmalloca)
! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! !<see above>
! xfs_bmap_add_extent_unwritten_real(xfs_trans_t ...)
! ! xfs_bmapi_convert_unwritten(xfs_bmalloca ...)
! ! ! xfs_bmapi_write(xfs_trans ...) calls xfs_bmapi_convert_unwritten in loop XFS_BMAPI_METADATA
! ! ! ! ...<see above>
.....
Post by Mark Tinguely
! xfs_bmap_add_extent_hole_real(xfs_bmalloca ...)
! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! ...<see above>
So it's bmbt modification that looks to be the problem here, right?
Post by Mark Tinguely
xfs_bmap_local_to_extents(xfs_trans_t ...)<- set userdata to 0 (patch 3)
! xfs_bmap_add_attrfork_local(xfs_trans_t ..)
! ! xfs_bmap_add_attrfork(...) trans alloc, bmap_finish trans_commit/cancel
! xfs_bmapi_write(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ...<see above>
Same here.
That's all I can see as problematic - maybe I read the output wrong
and there's others?
i.e. all other xfs_alloc_vextent() callers are either in metadata
context (so don't use the workqueue) or commit the transaction
directly after xfs_bmapi_write returns so will unlock the AGF buffer
before calling into xfs_bmapi_write a second time.
If these are the only loops, then patch 3 is all that is
necessary to avoid the problem of blocking on workqueue resource
while we are already on the workqueue, right? i.e. bmbt allocation
is a metadata allocation, even though the context is a data
allocation, and ensuring it is metadata means that the current
transaction won't get blocked waiting for workqueue resources...
What am I missing?
Patch 3 is a clean up.
I disagree - allocating metadata using the userdata tag is clearly
wrong, and AFAICT is a real deadlock.
Post by Mark Tinguely
userdata is used by the allocation routines
and its value should be correct for those routines. I discovered the
uninitialized values tracing the calling list. The fact that the
worker routine was using and randomly creating a worker based on the
stack value is not a factor in the problem because those paths do
not loop on xfs_alloc_vextent().
I guess I don't understand what you mean by "loop on
xfs_alloc_vextent()" then.

The problem I see above is this:

thread 1 worker 1 worker 2..max
xfs_bmapi_write(userdata)
xfs_bmapi_allocate(user)
xfs_alloc_vextent(user)
wait

_xfs_alloc_vextent()
locks AGF
_xfs_alloc_vextent()
blocks on AGF lock
completes allocation

<returns with AGF locked in transaction>
xfs_bmap_add_extent_hole_real
xfs_bmap_extents_to_btree
xfs_alloc_vextent(user)
wait

<deadlock as no more workers available>

The moment xfs_bmap_extents_to_btree() is converted to a metadata
allocation, this deadlock goes away because we end up doing:

xfs_bmap_add_extent_hole_real
xfs_bmap_extents_to_btree
xfs_alloc_vextent(meta)
__xfs_alloc_vextent

and continuing to the end of xfs_bmapi_write() without ever waiting
on a worker thread while holding the AGF locked.

I cannot see any other path where we call xfs_alloc_vextent() with
the AGF locked within an active transaction with userdata set, which
is why I'm saying that I don't understand what you mean by
"loops on xfs_alloc_vextent()".
Post by Mark Tinguely
Patch 2 moves the worker so that when the worker servicing data
allocation request gets the lock, it will hold the worker over the
loop in xfs_bmapi_write() until it can do a xfs_trans_commit/cancel.
If it does not have the lock, then that worker will wait until it
can get the lock.
Your patch hangs when limiting the available workers on test 109 on
the 3 machines (2 x86_64 and a x86_32) that I tried it on. I am
trying a fourth machine to be sure.
Yeah, looks like the loop in __xfs_alloc_vextent is not terminating
correctly. I'll fix it.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Mark Tinguely
2012-09-26 14:14:14 UTC
Permalink
<deletes>
Post by Dave Chinner
Post by Mark Tinguely
Post by Dave Chinner
Post by Mark Tinguely
As a bonus, consolidating the loops into one worker actually gives a slight
performance advantage.
Can you quantify it?
I was comparing the bonnie and iozone benchmarks outputs. I will see
if someone can enlighten me on how to quantify those numbers.
Ugh.
Don't bother. Those are two of the worst offenders in the "useless
benchmarks for regression testing" category. Yeah, they *look* like
they give decent numbers, but I've wasted so much time looking at
results from these benhmarks only to find they do basic things wrong
and give numbers that vary simple because you've made a change that
increases or decreases the CPU cache footprint of a code path.
e.g. IOZone uses the same memory buffer as the source/destination of
all it's IO, and does not touch the contents of it at all. Hence for
small IO, the buffer stays resident in the CPU caches and gives
unrealsitically high throughput results. Worse is the fact that CPU
cache residency of the buffer can change according to the kernel
code path taken, so you can get massive changes in throughput just
by changing the layout of the code without changing any logic....
IOZone can be useful if you know exactly what you are doing and
using it to test a specific code path with a specific set of
configurations. e.g. comparing ext3/4/xfs/btrfs on the same kernel
and storage is fine. However, the moment you start using it to
compare different kernels, it's a total crap shoot....
does anyone have a good benchmark XFS should use to share performance
results? A number that we can agree a series does not degrade the
filesystem..

lies, damn lies, statistics and then filesystem benchmarks?! :)
Post by Dave Chinner
I guess I don't understand what you mean by "loop on
xfs_alloc_vextent()" then.
thread 1 worker 1 worker 2..max
xfs_bmapi_write(userdata)
loops here calling xfs_bmapi_alloc()
Post by Dave Chinner
xfs_bmapi_allocate(user)
xfs_alloc_vextent(user)
wait
_xfs_alloc_vextent()
locks AGF
first loop it takes the lock
one of the next times through the above
loop it cannot get a worker. deadlock here.

I saved the xfs_bmalloca and fs_alloc_arg when
allocating a buffer to verify the paths.
Post by Dave Chinner
_xfs_alloc_vextent()
blocks on AGF lock
completes allocation
<returns with AGF locked in transaction>
xfs_bmap_add_extent_hole_real
xfs_bmap_extents_to_btree
xfs_alloc_vextent(user)
wait
this does not need a worker, and since in the same
transaction all locks to the AGF buffer are recursive locks.
no wait here.
Post by Dave Chinner
<deadlock as no more workers available>
<deletes>

--Mark.
Dave Chinner
2012-09-26 23:41:49 UTC
Permalink
Post by Mark Tinguely
<deletes>
Post by Dave Chinner
Post by Mark Tinguely
Post by Dave Chinner
Post by Mark Tinguely
As a bonus, consolidating the loops into one worker actually gives a slight
performance advantage.
Can you quantify it?
I was comparing the bonnie and iozone benchmarks outputs. I will see
if someone can enlighten me on how to quantify those numbers.
Ugh.
Don't bother. Those are two of the worst offenders in the "useless
benchmarks for regression testing" category. Yeah, they *look* like
they give decent numbers, but I've wasted so much time looking at
results from these benhmarks only to find they do basic things wrong
and give numbers that vary simple because you've made a change that
increases or decreases the CPU cache footprint of a code path.
e.g. IOZone uses the same memory buffer as the source/destination of
all it's IO, and does not touch the contents of it at all. Hence for
small IO, the buffer stays resident in the CPU caches and gives
unrealsitically high throughput results. Worse is the fact that CPU
cache residency of the buffer can change according to the kernel
code path taken, so you can get massive changes in throughput just
by changing the layout of the code without changing any logic....
IOZone can be useful if you know exactly what you are doing and
using it to test a specific code path with a specific set of
configurations. e.g. comparing ext3/4/xfs/btrfs on the same kernel
and storage is fine. However, the moment you start using it to
compare different kernels, it's a total crap shoot....
does anyone have a good benchmark XFS should use to share
performance results? A number that we can agree a series does not
degrade the filesystem..
ffsb and fio with benchmark style workload definitions, fsmark when
using total runtime rather than files/s for comparison. I also have a
bunch of hand written scripts that I've written specifically for
purpose for testing stuff like directory traversal and IO
parallelism scalability.

The problem is that common benchmarks like iozone, postmark,
bonnie++ and dbench are really just load generators that output
numbers - there are all very sensitive to many non-filesystem
changes (e.g. VM tunings) and so have to be run "just right" to
produce meaningful numbers. That "just right" changes from kernel
to kernel, machine to machine (running the same kernel!) and
filesystem to filesystem.

Dbench is particularly sensitive to VM changes and tunings, as well as
log IO latency. e.g. running it on the same system on different
region of the same disk give significantly different results because
writes to the log are slower on the inner regions of disk. Indeed,
just changing the log stripe unit without changing anything else can
have a marked affect on results...

I have machines here that I can't use IOzone at all for IO sizes of
less than 4MB because the timing loops it uses don't have enough
granularity to accurately measure syscall times, and hence it
doesn't report throughputs reliably. The worst part is that in the
results output, it doesn't tell you it had problems - youhave to
look inteh log files to get that information and realise that the
entire benchmark run was compromised...
Post by Mark Tinguely
lies, damn lies, statistics and then filesystem benchmarks?! :)
I think I've said that myself many times in the past. ;)
Post by Mark Tinguely
Post by Dave Chinner
I guess I don't understand what you mean by "loop on
xfs_alloc_vextent()" then.
thread 1 worker 1 worker 2..max
xfs_bmapi_write(userdata)
loops here calling xfs_bmapi_alloc()
I don't think it can for userdata allocations. Metadata, yes,
userdata, no.
Post by Mark Tinguely
Post by Dave Chinner
xfs_bmapi_allocate(user)
xfs_alloc_vextent(user)
wait
_xfs_alloc_vextent()
locks AGF
first loop it takes the lock
one of the next times through the above
loop it cannot get a worker. deadlock here.
I saved the xfs_bmalloca and fs_alloc_arg when
allocating a buffer to verify the paths.
Post by Dave Chinner
_xfs_alloc_vextent()
blocks on AGF lock
completes allocation
<returns with AGF locked in transaction>
And then xfs_bmapi_allocate() increments bma->nallocs after each
successful allocation. Hence the only way we can loop there for
userdata allocations is if the nmap parameter passes in is > 1.

When it returns to xfs-bmapi_write(), it trims and updates the map,
and then this code prevents it from looping:

/*
* If we're done, stop now. Stop when we've allocated
* XFS_BMAP_MAX_NMAP extents no matter what. Otherwise
* the transaction may get too big.
*/
if (bno >= end || n >= *nmap || bma.nallocs >= *nmap)
break;

Because the userdata callers use these values for *nmap:

xfs_iomap_write_direct nimaps = 1;
xfs_iomap_write_allocate nimaps = 1;
xfs_iomap_write_unwritten nimaps = 1;
xfs_alloc_file_space nimaps = 1;

So basically there will break out of the loop after a single
allocation, and hence should not be getting stuck on a second
allocation in xfs_bmapi_allocate()/xfs_alloc_vextent() with the AGF
locked from this code path.
Post by Mark Tinguely
Post by Dave Chinner
xfs_bmap_add_extent_hole_real
xfs_bmap_extents_to_btree
xfs_alloc_vextent(user)
wait
this does not need a worker, and since in the same
I agree that it does not *need* a worker, but it will *use* a worker
thread if args->userdata is not initialised to zero.
Post by Mark Tinguely
transaction all locks to the AGF buffer are recursive locks.
Right, the worker won't block on the AGF lock, but if it can't get a
worker because they are all blocked on the AGF lock, then it
deadlocks.

As I said before, I cannot see any other path that will trigger a
userdata allocation with the AGF locked. So does patch 3 by itself
solve the problem?

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Mark Tinguely
2012-09-27 20:10:38 UTC
Permalink
Post by Dave Chinner
Post by Mark Tinguely
<deletes>
Post by Dave Chinner
Post by Mark Tinguely
Post by Dave Chinner
Post by Mark Tinguely
As a bonus, consolidating the loops into one worker actually gives a slight
performance advantage.
Can you quantify it?
I was comparing the bonnie and iozone benchmarks outputs. I will see
if someone can enlighten me on how to quantify those numbers.
Ugh.
Don't bother. Those are two of the worst offenders in the "useless
benchmarks for regression testing" category. Yeah, they *look* like
they give decent numbers, but I've wasted so much time looking at
results from these benhmarks only to find they do basic things wrong
and give numbers that vary simple because you've made a change that
increases or decreases the CPU cache footprint of a code path.
e.g. IOZone uses the same memory buffer as the source/destination of
all it's IO, and does not touch the contents of it at all. Hence for
small IO, the buffer stays resident in the CPU caches and gives
unrealsitically high throughput results. Worse is the fact that CPU
cache residency of the buffer can change according to the kernel
code path taken, so you can get massive changes in throughput just
by changing the layout of the code without changing any logic....
IOZone can be useful if you know exactly what you are doing and
using it to test a specific code path with a specific set of
configurations. e.g. comparing ext3/4/xfs/btrfs on the same kernel
and storage is fine. However, the moment you start using it to
compare different kernels, it's a total crap shoot....
does anyone have a good benchmark XFS should use to share
performance results? A number that we can agree a series does not
degrade the filesystem..
ffsb and fio with benchmark style workload definitions, fsmark when
using total runtime rather than files/s for comparison. I also have a
bunch of hand written scripts that I've written specifically for
purpose for testing stuff like directory traversal and IO
parallelism scalability.
The problem is that common benchmarks like iozone, postmark,
bonnie++ and dbench are really just load generators that output
numbers - there are all very sensitive to many non-filesystem
changes (e.g. VM tunings) and so have to be run "just right" to
produce meaningful numbers. That "just right" changes from kernel
to kernel, machine to machine (running the same kernel!) and
filesystem to filesystem.
Dbench is particularly sensitive to VM changes and tunings, as well as
log IO latency. e.g. running it on the same system on different
region of the same disk give significantly different results because
writes to the log are slower on the inner regions of disk. Indeed,
just changing the log stripe unit without changing anything else can
have a marked affect on results...
I have machines here that I can't use IOzone at all for IO sizes of
less than 4MB because the timing loops it uses don't have enough
granularity to accurately measure syscall times, and hence it
doesn't report throughputs reliably. The worst part is that in the
results output, it doesn't tell you it had problems - youhave to
look inteh log files to get that information and realise that the
entire benchmark run was compromised...
Post by Mark Tinguely
lies, damn lies, statistics and then filesystem benchmarks?! :)
I think I've said that myself many times in the past. ;)
Thank-you for the information.
Post by Dave Chinner
Post by Mark Tinguely
Post by Dave Chinner
I guess I don't understand what you mean by "loop on
xfs_alloc_vextent()" then.
thread 1 worker 1 worker 2..max
xfs_bmapi_write(userdata)
loops here calling xfs_bmapi_alloc()
I don't think it can for userdata allocations. Metadata, yes,
userdata, no.
Post by Mark Tinguely
Post by Dave Chinner
xfs_bmapi_allocate(user)
xfs_alloc_vextent(user)
wait
_xfs_alloc_vextent()
locks AGF
first loop it takes the lock
one of the next times through the above
loop it cannot get a worker. deadlock here.
I saved the xfs_bmalloca and fs_alloc_arg when
allocating a buffer to verify the paths.
Post by Dave Chinner
_xfs_alloc_vextent()
blocks on AGF lock
completes allocation
<returns with AGF locked in transaction>
And then xfs_bmapi_allocate() increments bma->nallocs after each
successful allocation. Hence the only way we can loop there for
userdata allocations is if the nmap parameter passes in is> 1.
When it returns to xfs-bmapi_write(), it trims and updates the map,
/*
* If we're done, stop now. Stop when we've allocated
* XFS_BMAP_MAX_NMAP extents no matter what. Otherwise
* the transaction may get too big.
*/
if (bno>= end || n>= *nmap || bma.nallocs>= *nmap)
break;
xfs_iomap_write_direct nimaps = 1;
xfs_iomap_write_allocate nimaps = 1;
xfs_iomap_write_unwritten nimaps = 1;
xfs_alloc_file_space nimaps = 1;
So basically there will break out of the loop after a single
allocation, and hence should not be getting stuck on a second
allocation in xfs_bmapi_allocate()/xfs_alloc_vextent() with the AGF
locked from this code path.
Post by Mark Tinguely
Post by Dave Chinner
xfs_bmap_add_extent_hole_real
xfs_bmap_extents_to_btree
xfs_alloc_vextent(user)
wait
this does not need a worker, and since in the same
I agree that it does not *need* a worker, but it will *use* a worker
thread if args->userdata is not initialised to zero.
Post by Mark Tinguely
transaction all locks to the AGF buffer are recursive locks.
Right, the worker won't block on the AGF lock, but if it can't get a
worker because they are all blocked on the AGF lock, then it
deadlocks.
As I said before, I cannot see any other path that will trigger a
userdata allocation with the AGF locked. So does patch 3 by itself
solve the problem?
I thought I tried it, but I will retry it when I have a free machine.


--Mark.
Dave Chinner
2012-09-28 03:08:47 UTC
Permalink
Post by Mark Tinguely
Your patch hangs when limiting the available workers on test 109 on
the 3 machines (2 x86_64 and a x86_32) that I tried it on. I am
trying a fourth machine to be sure.
Update version attached - it was just a loop exit case that didn't
work, and that case is only triggered on xbf_low allocations (i.e.
ENOSPC).

It doesn't cause any problems here on 3.6-rc7 (combined with your
patch 3), and the two patches also fix the same hangs on my RHEL
test kernels that only have one worker thread per CPU.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com

xfs: don't block xfs_alloc_wq on AGF locks

From: Dave Chinner <***@redhat.com>

When the workqueue runs out of worker threads because all the
allocations are blockd on the AGF lock, the AGF lock holder cannot
run to complete the allocation transaction that will unlock the AGF.
As a result, allocations stop making progress. Raising the
workqueue concurrency limit does not solve the problem, just raises
the number of allocations that need to block before we hit the
starvation issue.

To avoid workqueue starvation, instead of blocking on the AGF lock
only try to lock the AGF when caled from worker thread context. If
we fail to get the lock on all the AGs we are allowed to try, return
an EAGAIN error and have the worker requeue the allocation work for
a short time in the future when progress has been made. Because we
no longer block worker threads, we avoid the issue of starving
allocations requests of service and hence will always make progress.

Because the trylock semantics are now slightly different, add a
"try-harder" flag to ensure that we avoid conflating the new
"trylock" with the existing "avoid allocating data in a metadata
prefered AG" semantics. The "try-harder" flag is now used to
indicate data allocation in metadata-preferred AGs is allowed.

Signed-off-by: Dave Chinner <***@redhat.com>
---
fs/xfs/xfs_alloc.c | 68 ++++++++++++++++++++++++++++++++++++----------------
fs/xfs/xfs_alloc.h | 3 ++-
2 files changed, 50 insertions(+), 21 deletions(-)

diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c
index 0287f3b..57fad5f 100644
--- a/fs/xfs/xfs_alloc.c
+++ b/fs/xfs/xfs_alloc.c
@@ -1764,9 +1764,9 @@ xfs_alloc_fix_freelist(
xfs_trans_t *tp; /* transaction pointer */

mp = args->mp;
-
pag = args->pag;
tp = args->tp;
+
if (!pag->pagf_init) {
if ((error = xfs_alloc_read_agf(mp, tp, args->agno, flags,
&agbp)))
@@ -1775,7 +1775,7 @@ xfs_alloc_fix_freelist(
ASSERT(flags & XFS_ALLOC_FLAG_TRYLOCK);
ASSERT(!(flags & XFS_ALLOC_FLAG_FREEING));
args->agbp = NULL;
- return 0;
+ return EAGAIN;
}
} else
agbp = NULL;
@@ -1786,7 +1786,7 @@ xfs_alloc_fix_freelist(
* try harder at this point
*/
if (pag->pagf_metadata && args->userdata &&
- (flags & XFS_ALLOC_FLAG_TRYLOCK)) {
+ (flags & XFS_ALLOC_FLAG_TRYHARD)) {
ASSERT(!(flags & XFS_ALLOC_FLAG_FREEING));
args->agbp = NULL;
return 0;
@@ -1822,7 +1822,7 @@ xfs_alloc_fix_freelist(
ASSERT(flags & XFS_ALLOC_FLAG_TRYLOCK);
ASSERT(!(flags & XFS_ALLOC_FLAG_FREEING));
args->agbp = NULL;
- return 0;
+ return EAGAIN;
}
}
/*
@@ -2209,7 +2209,8 @@ xfs_alloc_read_agf(
*/
int /* error */
__xfs_alloc_vextent(
- xfs_alloc_arg_t *args) /* allocation argument structure */
+ xfs_alloc_arg_t *args,
+ bool trylock)
{
xfs_agblock_t agsize; /* allocation group size */
int error;
@@ -2250,6 +2251,7 @@ __xfs_alloc_vextent(
}
minleft = args->minleft;

+ flags = trylock ? XFS_ALLOC_FLAG_TRYLOCK : 0;
switch (type) {
case XFS_ALLOCTYPE_THIS_AG:
case XFS_ALLOCTYPE_NEAR_BNO:
@@ -2260,7 +2262,7 @@ __xfs_alloc_vextent(
args->agno = XFS_FSB_TO_AGNO(mp, args->fsbno);
args->pag = xfs_perag_get(mp, args->agno);
args->minleft = 0;
- error = xfs_alloc_fix_freelist(args, 0);
+ error = xfs_alloc_fix_freelist(args, flags);
args->minleft = minleft;
if (error) {
trace_xfs_alloc_vextent_nofix(args);
@@ -2310,7 +2312,8 @@ __xfs_alloc_vextent(
args->agno = XFS_FSB_TO_AGNO(mp, args->fsbno);
args->type = XFS_ALLOCTYPE_THIS_AG;
sagno = 0;
- flags = 0;
+ trylock = 0;
+ flags |= XFS_ALLOC_FLAG_TRYHARD;
} else {
if (type == XFS_ALLOCTYPE_START_AG)
args->type = XFS_ALLOCTYPE_THIS_AG;
@@ -2329,7 +2332,7 @@ __xfs_alloc_vextent(
if (no_min) args->minleft = 0;
error = xfs_alloc_fix_freelist(args, flags);
args->minleft = minleft;
- if (error) {
+ if (error && error != EAGAIN) {
trace_xfs_alloc_vextent_nofix(args);
goto error0;
}
@@ -2373,10 +2376,18 @@ __xfs_alloc_vextent(
trace_xfs_alloc_vextent_allfailed(args);
break;
}
- if (flags == 0) {
+ if (flags == XFS_ALLOC_FLAG_TRYHARD) {
no_min = 1;
+ } else if (trylock &&
+ flags == XFS_ALLOC_FLAG_TRYLOCK) {
+ flags |= XFS_ALLOC_FLAG_TRYHARD;
+ } else if (trylock &&
+ (flags & XFS_ALLOC_FLAG_TRYHARD)) {
+ error = EAGAIN;
+ trace_xfs_alloc_vextent_noagbp(args);
+ goto error0;
} else {
- flags = 0;
+ flags = XFS_ALLOC_FLAG_TRYHARD;
if (type == XFS_ALLOCTYPE_START_BNO) {
args->agbno = XFS_FSB_TO_AGBNO(mp,
args->fsbno);
@@ -2418,28 +2429,45 @@ error0:
return error;
}

+/*
+ * The worker is not allowed to block indefinitely on AGF locks. This prevents
+ * worker thread starvation from preventing allocation progress from being made.
+ * This occurs when the transaction holding the AGF cannot be scheduled to run
+ * in a worker because all worker threads are blocked on the AGF lock and no
+ * more can be allocated.
+ */
static void
xfs_alloc_vextent_worker(
struct work_struct *work)
{
- struct xfs_alloc_arg *args = container_of(work,
+ struct delayed_work *dwork = container_of(work,
+ struct delayed_work, work);
+ struct xfs_alloc_arg *args = container_of(dwork,
struct xfs_alloc_arg, work);
unsigned long pflags;
+ int result;

/* we are in a transaction context here */
current_set_flags_nested(&pflags, PF_FSTRANS);

- args->result = __xfs_alloc_vextent(args);
- complete(args->done);
+ result = __xfs_alloc_vextent(args, true);
+ if (result == EAGAIN) {
+ /* requeue for a later retry */
+ queue_delayed_work(xfs_alloc_wq, &args->work,
+ msecs_to_jiffies(10));
+ } else {
+ args->result = result;
+ complete(args->done);
+ }

current_restore_flags_nested(&pflags, PF_FSTRANS);
}

/*
* Data allocation requests often come in with little stack to work on. Push
- * them off to a worker thread so there is lots of stack to use. Metadata
- * requests, OTOH, are generally from low stack usage paths, so avoid the
- * context switch overhead here.
+ * them off to a worker thread so there is lots of stack to use if we are not
+ * already in a worker thread context. Metadata requests, OTOH, are generally
+ * from low stack usage paths, so avoid the context switch overhead here.
*/
int
xfs_alloc_vextent(
@@ -2447,13 +2475,13 @@ xfs_alloc_vextent(
{
DECLARE_COMPLETION_ONSTACK(done);

- if (!args->userdata)
- return __xfs_alloc_vextent(args);
+ if (!args->userdata || (current->flags & PF_WQ_WORKER))
+ return __xfs_alloc_vextent(args, false);


args->done = &done;
- INIT_WORK_ONSTACK(&args->work, xfs_alloc_vextent_worker);
- queue_work(xfs_alloc_wq, &args->work);
+ INIT_DELAYED_WORK_ONSTACK(&args->work, xfs_alloc_vextent_worker);
+ queue_delayed_work(xfs_alloc_wq, &args->work, 0);
wait_for_completion(&done);
return args->result;
}
diff --git a/fs/xfs/xfs_alloc.h b/fs/xfs/xfs_alloc.h
index 93be4a6..047a239 100644
--- a/fs/xfs/xfs_alloc.h
+++ b/fs/xfs/xfs_alloc.h
@@ -54,6 +54,7 @@ typedef unsigned int xfs_alloctype_t;
*/
#define XFS_ALLOC_FLAG_TRYLOCK 0x00000001 /* use trylock for buffer locking */
#define XFS_ALLOC_FLAG_FREEING 0x00000002 /* indicate caller is freeing extents*/
+#define XFS_ALLOC_FLAG_TRYHARD 0x00000004 /* also use metadata AGs for data */

/*
* In order to avoid ENOSPC-related deadlock caused by
@@ -121,7 +122,7 @@ typedef struct xfs_alloc_arg {
char userdata; /* set if this is user data */
xfs_fsblock_t firstblock; /* io first block allocated */
struct completion *done;
- struct work_struct work;
+ struct delayed_work work;
int result;
} xfs_alloc_arg_t;
Mark Tinguely
2012-10-01 22:10:23 UTC
Permalink
v2 remove the architecture conditional.

The AGF hang is caused when the process that holds the AGF buffer
lock cannot get a worker. The allocation worker pool are blocked
waiting to take the AGF buffer lock.

Move the allocation worker call so that multiple calls to
xfs_alloc_vextent() for a particular transaction are contained
within a single worker.
---
With the xfs_alloc_arg structure zeroed, the AGF hang occurs in
xfs_bmap_btalloc() due to a secondary call to xfs_alloc_vextent().
These calls to xfs_alloc_vextent() try different strategies to
allocate the extent if the previous allocation attempt failed.

I still prefer this patch's approach. It also limits the number
worker context switches when xfs_alloc_ventent() is called multiple
times within a transaction. The intent of the patch is to move the
allocation worker as reasonably close to the xfs_trans_alloc() -
xfs_trans_commit / xfs_trans_cancel() calls as possible.

I have ported this patch to Linux 3.0.x. Linux 2.6.x will be the same
as the Linux 3.0 port.

This patch allows an easy addition of an architecture limit on the
allocation worker for those that choose to do so.

Signed-off-by: Mark Tinguely <***@sgi.com>
---
fs/xfs/xfs_alloc.c | 42 -------------------------------------
fs/xfs/xfs_alloc.h | 3 --
fs/xfs/xfs_bmap.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++--
fs/xfs/xfs_bmap.h | 16 ++++++++++++++
4 files changed, 75 insertions(+), 46 deletions(-)

Index: b/fs/xfs/xfs_alloc.c
===================================================================
--- a/fs/xfs/xfs_alloc.c
+++ b/fs/xfs/xfs_alloc.c
@@ -2211,7 +2211,7 @@ xfs_alloc_read_agf(
* group or loop over the allocation groups to find the result.
*/
int /* error */
-__xfs_alloc_vextent(
+xfs_alloc_vextent(
xfs_alloc_arg_t *args) /* allocation argument structure */
{
xfs_agblock_t agsize; /* allocation group size */
@@ -2421,46 +2421,6 @@ error0:
return error;
}

-static void
-xfs_alloc_vextent_worker(
- struct work_struct *work)
-{
- struct xfs_alloc_arg *args = container_of(work,
- struct xfs_alloc_arg, work);
- unsigned long pflags;
-
- /* we are in a transaction context here */
- current_set_flags_nested(&pflags, PF_FSTRANS);
-
- args->result = __xfs_alloc_vextent(args);
- complete(args->done);
-
- current_restore_flags_nested(&pflags, PF_FSTRANS);
-}
-
-/*
- * Data allocation requests often come in with little stack to work on. Push
- * them off to a worker thread so there is lots of stack to use. Metadata
- * requests, OTOH, are generally from low stack usage paths, so avoid the
- * context switch overhead here.
- */
-int
-xfs_alloc_vextent(
- struct xfs_alloc_arg *args)
-{
- DECLARE_COMPLETION_ONSTACK(done);
-
- if (!args->userdata)
- return __xfs_alloc_vextent(args);
-
-
- args->done = &done;
- INIT_WORK_ONSTACK(&args->work, xfs_alloc_vextent_worker);
- queue_work(xfs_alloc_wq, &args->work);
- wait_for_completion(&done);
- return args->result;
-}
-
/*
* Free an extent.
* Just break up the extent address and hand off to xfs_free_ag_extent
Index: b/fs/xfs/xfs_alloc.h
===================================================================
--- a/fs/xfs/xfs_alloc.h
+++ b/fs/xfs/xfs_alloc.h
@@ -120,9 +120,6 @@ typedef struct xfs_alloc_arg {
char isfl; /* set if is freelist blocks - !acctg */
char userdata; /* set if this is user data */
xfs_fsblock_t firstblock; /* io first block allocated */
- struct completion *done;
- struct work_struct work;
- int result;
} xfs_alloc_arg_t;

/*
Index: b/fs/xfs/xfs_bmap.c
===================================================================
--- a/fs/xfs/xfs_bmap.c
+++ b/fs/xfs/xfs_bmap.c
@@ -48,7 +48,6 @@
#include "xfs_vnodeops.h"
#include "xfs_trace.h"

-
kmem_zone_t *xfs_bmap_free_item_zone;

/*
@@ -4820,7 +4819,7 @@ xfs_bmapi_convert_unwritten(
* blocks then the call will fail (return NULLFSBLOCK in "firstblock").
*/
int
-xfs_bmapi_write(
+__xfs_bmapi_write(
struct xfs_trans *tp, /* transaction pointer */
struct xfs_inode *ip, /* incore inode */
xfs_fileoff_t bno, /* starting file offs. mapped */
@@ -5044,6 +5043,63 @@ error0:
return error;
}

+static void
+xfs_bmapi_write_worker(
+ struct work_struct *work)
+{
+ struct xfs_bmw_wkr *bw = container_of(work,
+ struct xfs_bmw_wkr, work);
+ unsigned long pflags;
+
+ /* we are in a transaction context here */
+ current_set_flags_nested(&pflags, PF_FSTRANS);
+
+ bw->result = __xfs_bmapi_write(bw->tp, bw->ip, bw->bno, bw->len,
+ bw->flags, bw->firstblock, bw->total,
+ bw->mval, bw->nmap, bw->flist);
+ complete(bw->done);
+
+ current_restore_flags_nested(&pflags, PF_FSTRANS);
+}
+
+int
+xfs_bmapi_write(
+ struct xfs_trans *tp, /* transaction pointer */
+ struct xfs_inode *ip, /* incore inode */
+ xfs_fileoff_t bno, /* starting file offs. mapped */
+ xfs_filblks_t len, /* length to map in file */
+ int flags, /* XFS_BMAPI_... */
+ xfs_fsblock_t *firstblock, /* first allocated block
+ controls a.g. for allocs */
+ xfs_extlen_t total, /* total blocks needed */
+ struct xfs_bmbt_irec *mval, /* output: map values */
+ int *nmap, /* i/o: mval size/count */
+ struct xfs_bmap_free *flist) /* i/o: list extents to free */
+{
+ struct xfs_bmw_wkr bw;
+ DECLARE_COMPLETION_ONSTACK(done);
+
+ if (flags & XFS_BMAPI_METADATA)
+ return __xfs_bmapi_write(tp, ip, bno, len, flags, firstblock,
+ total, mval, nmap, flist);
+ /* initialize the worker argument list structure */
+ bw.tp = tp;
+ bw.ip = ip;
+ bw.bno = bno;
+ bw.len = len;
+ bw.flags = flags;
+ bw.firstblock = firstblock;
+ bw.total = total;
+ bw.mval = mval;
+ bw.nmap = nmap;
+ bw.flist = flist;
+ bw.done = &done;
+ INIT_WORK_ONSTACK(&bw.work, xfs_bmapi_write_worker);
+ queue_work(xfs_alloc_wq, &bw.work);
+ wait_for_completion(&done);
+ return bw.result;
+}
+
/*
* Unmap (remove) blocks from a file.
* If nexts is nonzero then the number of extents to remove is limited to
Index: b/fs/xfs/xfs_bmap.h
===================================================================
--- a/fs/xfs/xfs_bmap.h
+++ b/fs/xfs/xfs_bmap.h
@@ -135,6 +135,22 @@ typedef struct xfs_bmalloca {
char conv; /* overwriting unwritten extents */
} xfs_bmalloca_t;

+struct xfs_bmw_wkr {
+ struct xfs_trans *tp; /* transaction pointer */
+ struct xfs_inode *ip; /* incore inode */
+ xfs_fileoff_t bno; /* starting file offs. mapped */
+ xfs_filblks_t len; /* length to map in file */
+ int flags; /* XFS_BMAPI_... */
+ xfs_fsblock_t *firstblock; /* first allocblock controls */
+ xfs_extlen_t total; /* total blocks needed */
+ struct xfs_bmbt_irec *mval; /* output: map values */
+ int *nmap; /* i/o: mval size/count */
+ struct xfs_bmap_free *flist; /* bmap freelist */
+ struct completion *done; /* worker completion ptr */
+ struct work_struct work; /* worker */
+ int result; /* worker function result */
+} ;
+
/*
* Flags for xfs_bmap_add_extent*.
*/
Dave Chinner
2012-10-01 23:10:27 UTC
Permalink
Post by Mark Tinguely
v2 remove the architecture conditional.
Version stuff goes after the first --- line where the diffstat lies.
It doesn't belong inteh commit messages.
Post by Mark Tinguely
The AGF hang is caused when the process that holds the AGF buffer
lock cannot get a worker. The allocation worker pool are blocked
waiting to take the AGF buffer lock.
Move the allocation worker call so that multiple calls to
xfs_alloc_vextent() for a particular transaction are contained
within a single worker.
---
With the xfs_alloc_arg structure zeroed, the AGF hang occurs in
xfs_bmap_btalloc() due to a secondary call to xfs_alloc_vextent().
How, exactly? You need to describe the exact hang so that everyone
understands what the problem is that is being fixed. This doesn't
tell me what the hang is that is being fixed. Document it in a call
timeline that shows when the locks are taken, and where they
subsequently hang....
Post by Mark Tinguely
These calls to xfs_alloc_vextent() try different strategies to
allocate the extent if the previous allocation attempt failed.
I suspect you are talking about this code chain:

if ((error = xfs_alloc_vextent(&args)))
return error;
if (tryagain && args.fsbno == NULLFSBLOCK) {
.....
if ((error = xfs_alloc_vextent(&args)))
return error;
}
if (isaligned && args.fsbno == NULLFSBLOCK) {
.....
if ((error = xfs_alloc_vextent(&args)))
return error;
}
.....

but I can't be certain from the description...
Post by Mark Tinguely
I still prefer this patch's approach. It also limits the number
worker context switches when xfs_alloc_ventent() is called multiple
times within a transaction. The intent of the patch is to move the
allocation worker as reasonably close to the xfs_trans_alloc() -
xfs_trans_commit / xfs_trans_cancel() calls as possible.
Except, as I've said before, it also adds context switches to
unwritten extent conversion that already occurs in a worker thread
that has no stack pressure (i.e. adds unnecessary latency via
context switches to IO completion), and it also pushes all realtime
device allocation into a worker thread. Once again, that will add
unpredictable latency to the allocation path (bad for realtime) when
no stack pressure actually exists.

These were particular concerns for placing the stack switch in
xfs_alloc_vextent() in the first place - to only switch stacks when
allocation was going to occur for allocations that are likely to
smash the stack. xfs_bmapi_write() is too high level to avoid
this problem in xfs_bmap_btalloc() with minimum impact because it
also captures operations that don't pass through the typical worst
case stack path or don't have stack pressure.

If we need to avoid the above problem in xfs_bmap_btalloc() for user
data allocation, then move the worker hand-off up one function from
xfs_alloc_vextent() to xfs_bmap_btalloc() - it's a precise fit for
the problem that (I think) has been described above. It's also a
simpler patch because it doesn't need to create a new worker args
structure - just add the completion to the struct xfs_bmalloca ....
Post by Mark Tinguely
I have ported this patch to Linux 3.0.x. Linux 2.6.x will be the same
as the Linux 3.0 port.
Not really relevant to a TOT commit, especially as the underlying
patch isn't in 3.0.x or 2.6.x. Indeed, if you want to back port it
and this fix to anything prior to 2.6.38, then you are going to need
the EAGAIN version I posted because the workqueue infrastructure is
vastly different and blocking workers on locks is guaranteed to have
serious performance impact, even if it doesn't deadlock.
Post by Mark Tinguely
This patch allows an easy addition of an architecture limit on the
allocation worker for those that choose to do so.
Not relevant. It's no different to the xfs_alloc_vextent
worker code that it replaces.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Ben Myers
2012-09-27 22:48:52 UTC
Permalink
Hey Mark,
Post by Mark Tinguely
Date: Mon, 24 Sep 2012 12:11:59 -0500
Subject: Re: [PATCH 0/3] xfs: allocation worker causes freelist buffer lock
hang
Hi Mark,
...
Post by t***@sgi.com
I traced the callers of xfs_alloc_vextent(), noting transaction creation,
commits and cancels; I noted loops in the callers and which were marked
as metadata/userdata. I can send this document to reviewers.
I'd like to see the doc of xfs_alloc_vextent callers and which of them loop.
Can you post your doc to the list?
Regards,
Ben
Here are the Linux 3.6.x callers of xfs_alloc_vextent() stopping at the
transaction commit/cancel level.
Thanks. Here I am going to make some notes to help me to understand the
deadlock as you have described. I'm going to make some idiotic, redundant, and
obvious commentary for my own understanding and benefit, and then go and spam
the list with it.
Post by Mark Tinguely
XFS_BMAPI_METADATA *not* being set implies user data.
Userdata being set is the community designated indication that an allocate
worker is needed to prevent the overflow of the kernel stack.
Even when XFS_BMAPI_METADATA is set, this doesn't necessarily translate into
userdata being clear in struct xfs_alloc_arg. This is due to declaring that
structure on the stack and then forgetting to zero it. You have fixed all
occurances of that in patch 3 of this series. They are:

xfs_alloc_fix_freelist,
xfs_bmap_btalloc, *
xfs_bmap_extents_to_btree,
xfs_bmap_local_to_extents,
xfs_ialloc_ag_alloc

In each of those five stacks we could have intended not to use a worker thread
for the allocation, but we used one anyway due to having forgotten to zero the
structure.

* In xfs_bmap_btalloc, args.userdata is set from ap->userdata before the first
call to xfs_alloc_vextent. As long as ap->userdata is initialized properly
this one is ok.
Post by Mark Tinguely
Calling xfs_alloc_vextent() several times with the same transaction can cause
a dead lock if a new allocation worker can not be allocated. I noted where the
loops occur. xfs_alloc_vextent() can call itself, those calls must be in the
same allocation worker.
This deadlock is characterized by a thread having taken the agf buffer lock in
this code path:

__xfs_alloc_vextent
xfs_alloc_fix_freelist
xfs_alloc_read_agf
xfs_read_agf

The agf buffer gets logged during allocation and stays locked until transaction
commit or cancel.

The deadlock comes when the transaction which holds the agf lock blocks waiting
for a worker, and all of the workers are blocked on the agf lock.

So... because the work being done in __xfs_alloc_vextent will always have need
of the agf buffer lock, we must never hold the agf lock and then go after a
worker thread for allocation. The availability of a worker thread cannot be
guaranteed.

It's not just a problem with loops. We risk deadlock on every call to
xfs_alloc_vextent in the same transaction while the agf lock is held. e.g.
multiple tries in xfs_bmap_btalloc.
Post by Mark Tinguely
As a bonus, consolidating the loops into one worker actually gives a slight
performance advantage.
Sorry this is wider than 80 characters wide.
---
xfs_bmap_btalloc(xfs_bmalloca_t)
args.userdata = ap->userdata; * not worried about patch 3 here
Post by Mark Tinguely
! xfs_bmap_alloc(xfs_bmalloca_t)
! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! xfs_bmapi_write(xfs_trans_t ...) loops over above
Here we risk deadlock in the loop on xfs_bmapi_allocate. That appears to be
limited by the number of maps requested.
Post by Mark Tinguely
! ! ! ! xfs_attr_rmtval_set(xfs_da_args_t) loops over bmapi_write (xfs_attr_set_int) **XFS_BMAPI_METADATA**
! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! xfs_attr_node_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
xfs_attr_rmtval_set calls xfs_bmapi_write in a loop. XFS_BMAPI_METADATA flag
was set, so this cannot be using a worker. No risk of this deadlock here.
Post by Mark Tinguely
! ! ! ! xfs_da_grow_inode_int(xfs_da_args_t) loops over bmapi_write **XFS_BMAPI_METADATA**
! ! ! ! ! xfs_dir2_grow_inode(struct xfs_da_args)
! ! ! ! ! ! xfs_dir2_sf_to_block(xfs_da_args_t)
! ! ! ! ! ! ! xfs_bmap_add_attrfork_local(xfs_trans_t ..)
! ! ! ! ! ! ! ! xfs_bmap_add_attrfork(...) trans alloc, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! xfs_dir2_sf_addname((xfs_da_args_t)
! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! xfs_dir2_leaf_addname(xfs_da_args_t)
! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! xfs_dir2_block_addname(xfs_da_args_t)
! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! xfs_dir2_sf_addname((xfs_da_args_t)
! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! xfs_dir2_leaf_to_node(xfs_da_args_t)
! ! ! ! ! ! ! xfs_dir2_leaf_addname(xfs_da_args_t) <also above>
! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir2_block_addname(xfs_da_args_t)
! ! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_dir2_sf_addname((xfs_da_args_t)
! ! ! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! xfs_dir2_node_addname_int(xfs_da_args_t ...)
! ! ! ! ! ! ! xfs_dir2_node_addname(xfs_da_args_t)
! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir2_leaf_addname(xfs_da_args_t)
! ! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! xfs_da_grow_inode(xfs_da_args)
! ! ! ! ! ! xfs_attr_shortform_to_leaf(xfs_da_args_t)
! ! ! ! ! ! ! xfs_attr_set_int(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! xfs_attr_leaf_to_node(xfs_da_args_t)
! ! ! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! xfs_attr_node_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! xfs_attr_leaf_split(xfs_da_state_t ...)
! ! ! ! ! ! ! xfs_da_split(xfs_da_state_t) loops over xfs_attr_leaf_split
! ! ! ! ! ! ! ! xfs_attr_node_addname(xfs_da_args_t) trans_roll
! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! xfs_dir2_node_addname(xfs_da_args_t)
! ! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_dir2_leaf_addname(xfs_da_args_t)
! ! ! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! xfs_da_root_split(xfs_da_state_t ...) <above>
! ! ! ! ! ! ! xfs_da_split(xfs_da_state_t) loops over xfs_attr_leaf_split
! ! ! ! ! ! ! ! xfs_attr_node_addname(xfs_da_args_t) trans_roll
! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! xfs_da_node_split(xfs_da_state_t ...)
! ! ! ! ! ! ! ! xfs_da_split(xfs_da_state_t) loops over xfs_attr_leaf_split
! ! ! ! ! ! ! ! ! xfs_attr_node_addname(xfs_da_args_t) trans_roll
! ! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! xfs_dir2_block_to_leaf(xfs_da_args_t ...)
! ! ! ! ! ! ! xfs_dir2_block_addname(xfs_da_args_t)
! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir2_sf_addname((xfs_da_args_t)
! ! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! xfs_dir2_leafn_split(xfs_da_state_t...)
! ! ! ! ! ! ! xfs_da_split(xfs_da_state_t) loops over xfs_attr_leaf_split
! ! ! ! ! ! ! ! xfs_attr_node_addname(xfs_da_args_t) trans_roll
! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
No risk of deadlock here due to XFS_BMAPI_METADATA flag.
Post by Mark Tinguely
! ! ! ! xfs_qm_dqalloc(xfs_trans_t ...) does a xfs_bmap_finish/cancel **XFS_BMAPI_METADATA**
! ! ! ! ! xfs_qm_dqtobp(xfs_trans_t ...)
! ! ! ! ! ! xfs_qm_dqread(...) does the trans_alloc/commit/cancel
No risk of deadlock here due to XFS_BMAPI_METADATA flag.
Post by Mark Tinguely
! ! ! ! xfs_iomap_write_direct(...) alloc trans, xfs_trans_commit/cancel
This allocates it's own transaction and will use a worker since it isn't
metadata. Hmm.
Post by Mark Tinguely
! ! ! ! xfs_iomap_write_allocate(...) alloc trans, xfs_trans_commit/cancel safe loop
So this is a safe loop because the transaction alloc/commit is within the
loop... and nmap = 1.

Pfff. This is in the stack you posted for this series. So I'm a little confused.

I haven't got through the full list yet. I'll send this today to keep things
moving and pick up again tomorrow.

Regards,
Ben
Post by Mark Tinguely
! ! ! ! xfs_iomap_write_unwritten(..) alloc trans, xfs_trans_commit/cancel safe loop
! ! ! ! xfs_growfs_rt_alloc(..) alloc trans, xfs_trans_commit/cancel safe loop
! ! ! ! xfs_symlink(...) allocates trans does a xfs_trans_commit/cancel
! ! ! ! xfs_alloc_file_space(...) alloc trans, xfs_trans_commit/cancel safe loop
xfs_bmap_extents_to_btree(xfs_trans_t ...) <- set userdata to 0 (patch 3)
! xfs_bmap_add_attrfork_extents(xfs_trans_t ...)
! ! xfs_bmap_add_attrfork(...) allocates trans does a xfs_trans_commit/cancel
! xfs_bmap_add_extent_delay_real(fs_bmalloca)
! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! <see above>
! xfs_bmap_add_extent_unwritten_real(xfs_trans_t ...)
! ! xfs_bmapi_convert_unwritten(xfs_bmalloca ...)
! ! ! xfs_bmapi_write(xfs_trans ...) calls xfs_bmapi_convert_unwritten in loop XFS_BMAPI_METADATA
! ! ! ! ... <see above>
! ! xfs_bunmapi(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ! xfs_attr_rmtval_remove(xfs_da_args_t) loops calling xfs_bunmapi **XFS_BMAPI_METADATA**
! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! xfs_attr_node_addname(xfs_da_args_t) trans_roll
! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! xfs_attr_node_removename(xfs_da_args_t)
! ! ! ! ! xfs_attr_remove_int(...) allocates trans does a xfs_trans_commit/cancel
! ! ! xfs_dir2_shrink_inode(xfs_da_args_t) **XFS_BMAPI_METADATA**
! ! ! ! xfs_dir2_leaf_trim_data(xfs_da_args_t)
! ! ! ! ! xfs_dir2_leaf_to_block(xfs_da_args_t ...)
! ! ! ! ! ! xfs_dir2_leaf_removename(xfs_da_args_t)
! ! ! ! ! ! ! xfs_dir_removename(xfs_trans ...)
! ! ! ! ! ! ! ! xfs_remove(...) allocates trans does a xfs_trans_commit/cancel
! ! ! ! xfs_dir2_leaf_trim_data(xfs_da_args_t)
! ! ! ! ! xfs_dir2_leaf_to_block(xfs_da_args_t ...)
! ! ! ! ! ! xfs_dir2_leaf_removename(xfs_da_args_t)
! ! ! ! ! ! ! xfs_dir_removename(xfs_trans ...)
! ! ! ! ! ! ! ! xfs_remove(...) allocates trans does a xfs_trans_commit/cancel
! ! ! ! ! ! xfs_dir2_node_to_leaf(xfs_da_state_t)
! ! ! ! ! ! ! xfs_dir2_node_removename(xfs_da_args_t)
! ! ! ! ! ! ! ! xfs_dir_removename(xfs_trans_t ...) creates the xfs_da_args_t
! ! ! ! ! ! ! ! ! xfs_remove(...) allocates trans does a xfs_trans_commit/cancel
! ! ! ! xfs_dir2_node_to_leaf(xfs_da_state_t)
! ! ! ! ! xfs_dir2_node_removename(xfs_da_args_t)
! ! ! ! ! ! xfs_dir_removename(xfs_trans_t ...) creates the xfs_da_args_t
! ! ! ! ! ! ! xfs_remove(...) allocates trans does a xfs_trans_commit/cancel
! ! ! xfs_bmap_punch_delalloc_range(...) loops calling xfs_bunmapi with NULL tp
! ! ! xfs_da_shrink_inode(xfs_da_args_t) loops calling xfs_bunmapi **XFS_BMAPI_METADATA**
! ! ! ! xfs_dir2_leaf_to_block(xfs_da_args_t ...)
! ! ! ! ! xfs_dir2_leaf_removename(xfs_da_args_t)
! ! ! ! ! ! xfs_dir_removename(xfs_trans ...)
! ! ! ! ! ! ! xfs_remove(...) allocates trans does a xfs_trans_commit/cancel
! ! ! xfs_itruncate_extents(xfs_trans ...) loops calling xfs_bunmapi with new tp
! ! ! xfs_inactive_symlink_rmt(..., xfs_trans_t) does a trans_commit and trans_dup
! ! ! xfs_free_file_space(...) loops calling xfs_bmapi with new tp
! xfs_bmap_add_extent_hole_real(xfs_bmalloca ...)
! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! ... <see above>
! xfs_bunmapi(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ... <see above>
xfs_bmap_local_to_extents(xfs_trans_t ...) <- set userdata to 0 (patch 3)
! xfs_bmap_add_attrfork_local(xfs_trans_t ..)
! ! xfs_bmap_add_attrfork(...) trans alloc, bmap_finish trans_commit/cancel
! xfs_bmapi_write(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ... <see above>
xfs_ialloc_ag_alloc(xfs_trans_t ...) userdata == 0
! xfs_dialloc(xfs_trans ...) loops over the above
! ! xfs_ialloc(xfs_trans ...)
! ! ! xfs_dir_ialloc(xfs_trans ...)
! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
xfs_bmbt_alloc_block(xfs_btree_cur, ...) userdata == 0
! xfs_btree_split(xfs_btree_cur, ...)
! ! xfs_btree_make_block_unfull
! ! ! xfs_btree_insrec(xfs_btree_cur, ...)
! ! ! ! xfs_btree_insert(xfs_btree_cur, ...)
! ! ! ! ! xfs_alloc_fixup_trees(xfs_btree_cur, ...)
! ! ! ! ! ! xfs_alloc_ag_vextent_exact(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_near(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_size(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! xfs_free_ag_extent(xfs_trans ...)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! xfs_bmap_add_extent_delay_real(xfs_bmalloca)
! ! ! ! ! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! ! ! ! ! ! ... <see above>
! ! ! ! ! xfs_bmap_add_extent_hole_real(xfs_bmalloca)
! ! ! ! ! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! ! ! ! ! ! ... <see above>
! ! ! ! ! xfs_bmap_add_extent_unwritten_real(xfs_trans, ...)
! ! ! ! ! ! xfs_bunmapi(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ! ! ! ! ! ... <see above>
! ! ! ! ! ! xfs_bmapi_convert_unwritten(xfs_bmalloca, ...)
! ! ! ! ! ! ! xfs_bmapi_write(xfs_trans_t ...) loop over the above
! ! ! ! ! ! ! ! ... <see above>
! ! ! ! ! xfs_bmap_del_extent(xfs_trans, ...)
! ! ! ! ! ! xfs_bunmapi(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ! ! ! ! ! ... <see above>
! ! ! ! ! xfs_ialloc_ag_alloc(xfs_trans, ...) loops over the above
! ! ! ! ! ! xfs_dialloc(xfs_trans ...) loops over the above
! ! ! ! ! ! ! xfs_ialloc(xfs_trans ...)
! ! ! ! ! ! ! ! xfs_dir_ialloc(xfs_trans ...)
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! xfs_btree_new_iroot(xfs_btree_cur, ...)
! ! xfs_btree_make_block_unfull(xfs_btree_cur, ...)
! ! ! xfs_btree_insrec(xfs_btree_cur, ...)
! ! ! ! xfs_btree_insert(xfs_btree_cur, ...)
! ! ! ! ! xfs_alloc_fixup_trees(xfs_btree_cur, ...)
! ! ! ! ! ! xfs_alloc_ag_vextent_exact(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_near(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_size(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! xfs_bmap_add_attrfork_btree(xfs_trans_t ...)
! ! ! xfs_bmap_add_attrfork(...) allocates trans does a xfs_trans_commit/cancel
! xfs_btree_new_root(xfs_btree_cur, ...)
! ! xfs_btree_insrec(xfs_btree_cur, ...)
! ! ! xfs_btree_insert(xfs_btree_cur, ...)
! ! ! xfs_alloc_fixup_trees(xfs_btree_cur, ...)
! ! ! ! xfs_alloc_ag_vextent_exact(xfs_alloc_arg)
! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! xfs_alloc_ag_vextent_near(xfs_alloc_arg)
! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! xfs_alloc_ag_vextent_size(xfs_alloc_arg)
! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
xfs_inobt_alloc_block(xfs_btree_cur, ...) userdata == 0
! xfs_btree_split(xfs_btree_cur, ...)
! ! xfs_btree_make_block_unfull
! ! ! xfs_btree_insrec(xfs_btree_cur, ...)
! ! ! ! xfs_btree_insert(xfs_btree_cur, ...)
! ! ! ! ! xfs_alloc_fixup_trees(xfs_btree_cur, ...)
! ! ! ! ! ! xfs_alloc_ag_vextent_exact(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_near(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_size(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! xfs_free_ag_extent(xfs_trans ...)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! xfs_bmap_add_extent_delay_real(xfs_bmalloca)
! ! ! ! ! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! ! ! ! ! ! ... <see above>
! ! ! ! ! xfs_bmap_add_extent_hole_real(xfs_bmalloca)
! ! ! ! ! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! ! ! ! ! ! ... <see above>
! ! ! ! ! xfs_bmap_add_extent_unwritten_real(xfs_trans, ...)
! ! ! ! ! ! xfs_bunmapi(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ! ! ! ! ! ... <see above>
! ! ! ! ! ! xfs_bmapi_convert_unwritten(xfs_bmalloca, ...)
! ! ! ! ! ! ! xfs_bmapi_write(xfs_trans_t ...) loop over the above
! ! ! ! ! ! ! ! ... <see above>
! ! ! ! ! xfs_bmap_del_extent(xfs_trans, ...)
! ! ! ! ! ! xfs_bunmapi(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ! ! ! ! ! ... <see above>
! ! ! ! ! xfs_ialloc_ag_alloc(xfs_trans, ...) loops over the above
! ! ! ! ! ! xfs_dialloc(xfs_trans ...) loops over the above
! ! ! ! ! ! ! xfs_ialloc(xfs_trans ...)
! ! ! ! ! ! ! ! xfs_dir_ialloc(xfs_trans ...)
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! xfs_btree_new_iroot(xfs_btree_cur, ...)
! ! xfs_btree_make_block_unfull(xfs_btree_cur, ...)
! ! ! xfs_btree_insrec(xfs_btree_cur, ...)
! ! ! ! xfs_btree_insert(xfs_btree_cur, ...)
! ! ! ! ! xfs_alloc_fixup_trees(xfs_btree_cur, ...)
! ! ! ! ! ! xfs_alloc_ag_vextent_exact(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_near(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_size(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! xfs_bmap_add_attrfork_btree(xfs_trans_t ...)
! ! ! xfs_bmap_add_attrfork(...) allocates trans does a xfs_trans_commit/cancel
! xfs_btree_new_root(xfs_btree_cur, ...)
! ! xfs_btree_insrec(xfs_btree_cur, ...)
! ! ! xfs_btree_insert(xfs_btree_cur, ...)
! ! ! xfs_alloc_fixup_trees(xfs_btree_cur, ...)
! ! ! ! xfs_alloc_ag_vextent_exact(xfs_alloc_arg)
! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! xfs_alloc_ag_vextent_near(xfs_alloc_arg)
! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! xfs_alloc_ag_vextent_size(xfs_alloc_arg)
! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
--Mark.
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
Mark Tinguely
2012-09-27 23:17:14 UTC
Permalink
Post by Ben Myers
Hey Mark,
Post by Mark Tinguely
Date: Mon, 24 Sep 2012 12:11:59 -0500
Subject: Re: [PATCH 0/3] xfs: allocation worker causes freelist buffer lock
hang
Hi Mark,
...
Post by t***@sgi.com
I traced the callers of xfs_alloc_vextent(), noting transaction creation,
commits and cancels; I noted loops in the callers and which were marked
as metadata/userdata. I can send this document to reviewers.
I'd like to see the doc of xfs_alloc_vextent callers and which of them loop.
Can you post your doc to the list?
Regards,
Ben
Here are the Linux 3.6.x callers of xfs_alloc_vextent() stopping at the
transaction commit/cancel level.
Thanks. Here I am going to make some notes to help me to understand the
deadlock as you have described. I'm going to make some idiotic, redundant, and
obvious commentary for my own understanding and benefit, and then go and spam
the list with it.
Post by Mark Tinguely
XFS_BMAPI_METADATA *not* being set implies user data.
Userdata being set is the community designated indication that an allocate
worker is needed to prevent the overflow of the kernel stack.
Even when XFS_BMAPI_METADATA is set, this doesn't necessarily translate into
userdata being clear in struct xfs_alloc_arg. This is due to declaring that
structure on the stack and then forgetting to zero it. You have fixed all
xfs_alloc_fix_freelist,
xfs_bmap_btalloc, *
xfs_bmap_extents_to_btree,
xfs_bmap_local_to_extents,
xfs_ialloc_ag_alloc
In each of those five stacks we could have intended not to use a worker thread
for the allocation, but we used one anyway due to having forgotten to zero the
structure.
* In xfs_bmap_btalloc, args.userdata is set from ap->userdata before the first
call to xfs_alloc_vextent. As long as ap->userdata is initialized properly
this one is ok.
Post by Mark Tinguely
Calling xfs_alloc_vextent() several times with the same transaction can cause
a dead lock if a new allocation worker can not be allocated. I noted where the
loops occur. xfs_alloc_vextent() can call itself, those calls must be in the
same allocation worker.
This deadlock is characterized by a thread having taken the agf buffer lock in
__xfs_alloc_vextent
xfs_alloc_fix_freelist
xfs_alloc_read_agf
xfs_read_agf
The agf buffer gets logged during allocation and stays locked until transaction
commit or cancel.
The deadlock comes when the transaction which holds the agf lock blocks waiting
for a worker, and all of the workers are blocked on the agf lock.
So... because the work being done in __xfs_alloc_vextent will always have need
of the agf buffer lock, we must never hold the agf lock and then go after a
worker thread for allocation. The availability of a worker thread cannot be
guaranteed.
It's not just a problem with loops. We risk deadlock on every call to
xfs_alloc_vextent in the same transaction while the agf lock is held. e.g.
multiple tries in xfs_bmap_btalloc.
Post by Mark Tinguely
As a bonus, consolidating the loops into one worker actually gives a slight
performance advantage.
Sorry this is wider than 80 characters wide.
---
xfs_bmap_btalloc(xfs_bmalloca_t)
args.userdata = ap->userdata; * not worried about patch 3 here
Post by Mark Tinguely
! xfs_bmap_alloc(xfs_bmalloca_t)
! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! xfs_bmapi_write(xfs_trans_t ...) loops over above
Here we risk deadlock in the loop on xfs_bmapi_allocate. That appears to be
limited by the number of maps requested.
Post by Mark Tinguely
! ! ! ! xfs_attr_rmtval_set(xfs_da_args_t) loops over bmapi_write (xfs_attr_set_int) **XFS_BMAPI_METADATA**
! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! xfs_attr_node_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
xfs_attr_rmtval_set calls xfs_bmapi_write in a loop. XFS_BMAPI_METADATA flag
was set, so this cannot be using a worker. No risk of this deadlock here.
Post by Mark Tinguely
! ! ! ! xfs_da_grow_inode_int(xfs_da_args_t) loops over bmapi_write **XFS_BMAPI_METADATA**
! ! ! ! ! xfs_dir2_grow_inode(struct xfs_da_args)
! ! ! ! ! ! xfs_dir2_sf_to_block(xfs_da_args_t)
! ! ! ! ! ! ! xfs_bmap_add_attrfork_local(xfs_trans_t ..)
! ! ! ! ! ! ! ! xfs_bmap_add_attrfork(...) trans alloc, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! xfs_dir2_sf_addname((xfs_da_args_t)
! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! xfs_dir2_leaf_addname(xfs_da_args_t)
! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! xfs_dir2_block_addname(xfs_da_args_t)
! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! xfs_dir2_sf_addname((xfs_da_args_t)
! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! xfs_dir2_leaf_to_node(xfs_da_args_t)
! ! ! ! ! ! ! xfs_dir2_leaf_addname(xfs_da_args_t)<also above>
! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir2_block_addname(xfs_da_args_t)
! ! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_dir2_sf_addname((xfs_da_args_t)
! ! ! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! xfs_dir2_node_addname_int(xfs_da_args_t ...)
! ! ! ! ! ! ! xfs_dir2_node_addname(xfs_da_args_t)
! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir2_leaf_addname(xfs_da_args_t)
! ! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! xfs_da_grow_inode(xfs_da_args)
! ! ! ! ! ! xfs_attr_shortform_to_leaf(xfs_da_args_t)
! ! ! ! ! ! ! xfs_attr_set_int(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! xfs_attr_leaf_to_node(xfs_da_args_t)
! ! ! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! xfs_attr_node_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! xfs_attr_leaf_split(xfs_da_state_t ...)
! ! ! ! ! ! ! xfs_da_split(xfs_da_state_t) loops over xfs_attr_leaf_split
! ! ! ! ! ! ! ! xfs_attr_node_addname(xfs_da_args_t) trans_roll
! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! xfs_dir2_node_addname(xfs_da_args_t)
! ! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_dir2_leaf_addname(xfs_da_args_t)
! ! ! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! xfs_da_root_split(xfs_da_state_t ...)<above>
! ! ! ! ! ! ! xfs_da_split(xfs_da_state_t) loops over xfs_attr_leaf_split
! ! ! ! ! ! ! ! xfs_attr_node_addname(xfs_da_args_t) trans_roll
! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! xfs_da_node_split(xfs_da_state_t ...)
! ! ! ! ! ! ! ! xfs_da_split(xfs_da_state_t) loops over xfs_attr_leaf_split
! ! ! ! ! ! ! ! ! xfs_attr_node_addname(xfs_da_args_t) trans_roll
! ! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! xfs_dir2_block_to_leaf(xfs_da_args_t ...)
! ! ! ! ! ! ! xfs_dir2_block_addname(xfs_da_args_t)
! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! xfs_dir2_sf_addname((xfs_da_args_t)
! ! ! ! ! ! ! ! ! xfs_dir_createname(xfs_trans_t)
! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_dir_canenter(xfs_trans_t ...)
! ! ! ! ! ! ! ! ! ! xfs_rename(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_link(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! xfs_dir2_leafn_split(xfs_da_state_t...)
! ! ! ! ! ! ! xfs_da_split(xfs_da_state_t) loops over xfs_attr_leaf_split
! ! ! ! ! ! ! ! xfs_attr_node_addname(xfs_da_args_t) trans_roll
! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
No risk of deadlock here due to XFS_BMAPI_METADATA flag.
Post by Mark Tinguely
! ! ! ! xfs_qm_dqalloc(xfs_trans_t ...) does a xfs_bmap_finish/cancel **XFS_BMAPI_METADATA**
! ! ! ! ! xfs_qm_dqtobp(xfs_trans_t ...)
! ! ! ! ! ! xfs_qm_dqread(...) does the trans_alloc/commit/cancel
No risk of deadlock here due to XFS_BMAPI_METADATA flag.
Post by Mark Tinguely
! ! ! ! xfs_iomap_write_direct(...) alloc trans, xfs_trans_commit/cancel
This allocates it's own transaction and will use a worker since it isn't
metadata. Hmm.
Post by Mark Tinguely
! ! ! ! xfs_iomap_write_allocate(...) alloc trans, xfs_trans_commit/cancel safe loop
So this is a safe loop because the transaction alloc/commit is within the
loop... and nmap = 1.
Pfff. This is in the stack you posted for this series. So I'm a little confused.
I haven't got through the full list yet. I'll send this today to keep things
moving and pick up again tomorrow.
Regards,
Ben
Post by Mark Tinguely
! ! ! ! xfs_iomap_write_unwritten(..) alloc trans, xfs_trans_commit/cancel safe loop
! ! ! ! xfs_growfs_rt_alloc(..) alloc trans, xfs_trans_commit/cancel safe loop
! ! ! ! xfs_symlink(...) allocates trans does a xfs_trans_commit/cancel
! ! ! ! xfs_alloc_file_space(...) alloc trans, xfs_trans_commit/cancel safe loop
xfs_bmap_extents_to_btree(xfs_trans_t ...)<- set userdata to 0 (patch 3)
! xfs_bmap_add_attrfork_extents(xfs_trans_t ...)
! ! xfs_bmap_add_attrfork(...) allocates trans does a xfs_trans_commit/cancel
! xfs_bmap_add_extent_delay_real(fs_bmalloca)
! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! !<see above>
! xfs_bmap_add_extent_unwritten_real(xfs_trans_t ...)
! ! xfs_bmapi_convert_unwritten(xfs_bmalloca ...)
! ! ! xfs_bmapi_write(xfs_trans ...) calls xfs_bmapi_convert_unwritten in loop XFS_BMAPI_METADATA
! ! ! ! ...<see above>
! ! xfs_bunmapi(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ! xfs_attr_rmtval_remove(xfs_da_args_t) loops calling xfs_bunmapi **XFS_BMAPI_METADATA**
! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! xfs_attr_node_addname(xfs_da_args_t) trans_roll
! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! xfs_attr_leaf_addname(xfs_da_args_t) may do a trans_roll
! ! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! ! xfs_attr_set_int(..) does the trans_alloc/commit/cancel
! ! ! ! xfs_attr_node_removename(xfs_da_args_t)
! ! ! ! ! xfs_attr_remove_int(...) allocates trans does a xfs_trans_commit/cancel
! ! ! xfs_dir2_shrink_inode(xfs_da_args_t) **XFS_BMAPI_METADATA**
! ! ! ! xfs_dir2_leaf_trim_data(xfs_da_args_t)
! ! ! ! ! xfs_dir2_leaf_to_block(xfs_da_args_t ...)
! ! ! ! ! ! xfs_dir2_leaf_removename(xfs_da_args_t)
! ! ! ! ! ! ! xfs_dir_removename(xfs_trans ...)
! ! ! ! ! ! ! ! xfs_remove(...) allocates trans does a xfs_trans_commit/cancel
! ! ! ! xfs_dir2_leaf_trim_data(xfs_da_args_t)
! ! ! ! ! xfs_dir2_leaf_to_block(xfs_da_args_t ...)
! ! ! ! ! ! xfs_dir2_leaf_removename(xfs_da_args_t)
! ! ! ! ! ! ! xfs_dir_removename(xfs_trans ...)
! ! ! ! ! ! ! ! xfs_remove(...) allocates trans does a xfs_trans_commit/cancel
! ! ! ! ! ! xfs_dir2_node_to_leaf(xfs_da_state_t)
! ! ! ! ! ! ! xfs_dir2_node_removename(xfs_da_args_t)
! ! ! ! ! ! ! ! xfs_dir_removename(xfs_trans_t ...) creates the xfs_da_args_t
! ! ! ! ! ! ! ! ! xfs_remove(...) allocates trans does a xfs_trans_commit/cancel
! ! ! ! xfs_dir2_node_to_leaf(xfs_da_state_t)
! ! ! ! ! xfs_dir2_node_removename(xfs_da_args_t)
! ! ! ! ! ! xfs_dir_removename(xfs_trans_t ...) creates the xfs_da_args_t
! ! ! ! ! ! ! xfs_remove(...) allocates trans does a xfs_trans_commit/cancel
! ! ! xfs_bmap_punch_delalloc_range(...) loops calling xfs_bunmapi with NULL tp
! ! ! xfs_da_shrink_inode(xfs_da_args_t) loops calling xfs_bunmapi **XFS_BMAPI_METADATA**
! ! ! ! xfs_dir2_leaf_to_block(xfs_da_args_t ...)
! ! ! ! ! xfs_dir2_leaf_removename(xfs_da_args_t)
! ! ! ! ! ! xfs_dir_removename(xfs_trans ...)
! ! ! ! ! ! ! xfs_remove(...) allocates trans does a xfs_trans_commit/cancel
! ! ! xfs_itruncate_extents(xfs_trans ...) loops calling xfs_bunmapi with new tp
! ! ! xfs_inactive_symlink_rmt(..., xfs_trans_t) does a trans_commit and trans_dup
! ! ! xfs_free_file_space(...) loops calling xfs_bmapi with new tp
! xfs_bmap_add_extent_hole_real(xfs_bmalloca ...)
! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! ...<see above>
! xfs_bunmapi(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ...<see above>
xfs_bmap_local_to_extents(xfs_trans_t ...)<- set userdata to 0 (patch 3)
! xfs_bmap_add_attrfork_local(xfs_trans_t ..)
! ! xfs_bmap_add_attrfork(...) trans alloc, bmap_finish trans_commit/cancel
! xfs_bmapi_write(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ...<see above>
xfs_ialloc_ag_alloc(xfs_trans_t ...) userdata == 0
! xfs_dialloc(xfs_trans ...) loops over the above
! ! xfs_ialloc(xfs_trans ...)
! ! ! xfs_dir_ialloc(xfs_trans ...)
! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
xfs_bmbt_alloc_block(xfs_btree_cur, ...) userdata == 0
! xfs_btree_split(xfs_btree_cur, ...)
! ! xfs_btree_make_block_unfull
! ! ! xfs_btree_insrec(xfs_btree_cur, ...)
! ! ! ! xfs_btree_insert(xfs_btree_cur, ...)
! ! ! ! ! xfs_alloc_fixup_trees(xfs_btree_cur, ...)
! ! ! ! ! ! xfs_alloc_ag_vextent_exact(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_near(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_size(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! xfs_free_ag_extent(xfs_trans ...)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! xfs_bmap_add_extent_delay_real(xfs_bmalloca)
! ! ! ! ! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! ! ! ! ! ! ...<see above>
! ! ! ! ! xfs_bmap_add_extent_hole_real(xfs_bmalloca)
! ! ! ! ! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! ! ! ! ! ! ...<see above>
! ! ! ! ! xfs_bmap_add_extent_unwritten_real(xfs_trans, ...)
! ! ! ! ! ! xfs_bunmapi(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ! ! ! ! ! ...<see above>
! ! ! ! ! ! xfs_bmapi_convert_unwritten(xfs_bmalloca, ...)
! ! ! ! ! ! ! xfs_bmapi_write(xfs_trans_t ...) loop over the above
! ! ! ! ! ! ! ! ...<see above>
! ! ! ! ! xfs_bmap_del_extent(xfs_trans, ...)
! ! ! ! ! ! xfs_bunmapi(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ! ! ! ! ! ...<see above>
! ! ! ! ! xfs_ialloc_ag_alloc(xfs_trans, ...) loops over the above
! ! ! ! ! ! xfs_dialloc(xfs_trans ...) loops over the above
! ! ! ! ! ! ! xfs_ialloc(xfs_trans ...)
! ! ! ! ! ! ! ! xfs_dir_ialloc(xfs_trans ...)
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! xfs_btree_new_iroot(xfs_btree_cur, ...)
! ! xfs_btree_make_block_unfull(xfs_btree_cur, ...)
! ! ! xfs_btree_insrec(xfs_btree_cur, ...)
! ! ! ! xfs_btree_insert(xfs_btree_cur, ...)
! ! ! ! ! xfs_alloc_fixup_trees(xfs_btree_cur, ...)
! ! ! ! ! ! xfs_alloc_ag_vextent_exact(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_near(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_size(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! xfs_bmap_add_attrfork_btree(xfs_trans_t ...)
! ! ! xfs_bmap_add_attrfork(...) allocates trans does a xfs_trans_commit/cancel
! xfs_btree_new_root(xfs_btree_cur, ...)
! ! xfs_btree_insrec(xfs_btree_cur, ...)
! ! ! xfs_btree_insert(xfs_btree_cur, ...)
! ! ! xfs_alloc_fixup_trees(xfs_btree_cur, ...)
! ! ! ! xfs_alloc_ag_vextent_exact(xfs_alloc_arg)
! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! xfs_alloc_ag_vextent_near(xfs_alloc_arg)
! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! xfs_alloc_ag_vextent_size(xfs_alloc_arg)
! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
xfs_inobt_alloc_block(xfs_btree_cur, ...) userdata == 0
! xfs_btree_split(xfs_btree_cur, ...)
! ! xfs_btree_make_block_unfull
! ! ! xfs_btree_insrec(xfs_btree_cur, ...)
! ! ! ! xfs_btree_insert(xfs_btree_cur, ...)
! ! ! ! ! xfs_alloc_fixup_trees(xfs_btree_cur, ...)
! ! ! ! ! ! xfs_alloc_ag_vextent_exact(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_near(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_size(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! xfs_free_ag_extent(xfs_trans ...)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! xfs_bmap_add_extent_delay_real(xfs_bmalloca)
! ! ! ! ! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! ! ! ! ! ! ...<see above>
! ! ! ! ! xfs_bmap_add_extent_hole_real(xfs_bmalloca)
! ! ! ! ! ! xfs_bmapi_allocate(xfs_bmalloca_t, ...)
! ! ! ! ! ! ! ! ...<see above>
! ! ! ! ! xfs_bmap_add_extent_unwritten_real(xfs_trans, ...)
! ! ! ! ! ! xfs_bunmapi(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ! ! ! ! ! ...<see above>
! ! ! ! ! ! xfs_bmapi_convert_unwritten(xfs_bmalloca, ...)
! ! ! ! ! ! ! xfs_bmapi_write(xfs_trans_t ...) loop over the above
! ! ! ! ! ! ! ! ...<see above>
! ! ! ! ! xfs_bmap_del_extent(xfs_trans, ...)
! ! ! ! ! ! xfs_bunmapi(xfs_trans_t ...) XFS_BMAPI_METADATA
! ! ! ! ! ! ! ...<see above>
! ! ! ! ! xfs_ialloc_ag_alloc(xfs_trans, ...) loops over the above
! ! ! ! ! ! xfs_dialloc(xfs_trans ...) loops over the above
! ! ! ! ! ! ! xfs_ialloc(xfs_trans ...)
! ! ! ! ! ! ! ! xfs_dir_ialloc(xfs_trans ...)
! ! ! ! ! ! ! ! ! xfs_create(...) trans allocated, bmap_finish trans_commit/cancel
! ! ! ! ! ! ! ! ! xfs_symlink(...) trans allocated, bmap_finish trans_commit/cancel
! xfs_btree_new_iroot(xfs_btree_cur, ...)
! ! xfs_btree_make_block_unfull(xfs_btree_cur, ...)
! ! ! xfs_btree_insrec(xfs_btree_cur, ...)
! ! ! ! xfs_btree_insert(xfs_btree_cur, ...)
! ! ! ! ! xfs_alloc_fixup_trees(xfs_btree_cur, ...)
! ! ! ! ! ! xfs_alloc_ag_vextent_exact(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_near(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! ! ! xfs_alloc_ag_vextent_size(xfs_alloc_arg)
! ! ! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! xfs_bmap_add_attrfork_btree(xfs_trans_t ...)
! ! ! xfs_bmap_add_attrfork(...) allocates trans does a xfs_trans_commit/cancel
! xfs_btree_new_root(xfs_btree_cur, ...)
! ! xfs_btree_insrec(xfs_btree_cur, ...)
! ! ! xfs_btree_insert(xfs_btree_cur, ...)
! ! ! xfs_alloc_fixup_trees(xfs_btree_cur, ...)
! ! ! ! xfs_alloc_ag_vextent_exact(xfs_alloc_arg)
! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! xfs_alloc_ag_vextent_near(xfs_alloc_arg)
! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
! ! ! ! xfs_alloc_ag_vextent_size(xfs_alloc_arg)
! ! ! ! ! xfs_alloc_ag_vextent(xfs_alloc_arg)
! ! ! ! ! ! __xfs_alloc_vextent(xfs_alloc_arg)
--Mark.
Adding just the 3 patch does not prevent the hang.

penguin9:~/xfstests-dev # cat /proc/11432/stack
[<e0d8e688>] xfs_alloc_vextent+0x58/0x80 [xfs]
[<e0d99319>] xfs_bmap_btalloc+0x319/0x9b0 [xfs]
[<e0d999ca>] xfs_bmap_alloc+0x1a/0x30 [xfs]
[<e0da61be>] xfs_bmapi_allocate+0xce/0x390 [xfs]
[<e0da6a38>] xfs_bmapi_write+0x5b8/0xb90 [xfs]
[<e0d7d5f8>] xfs_iomap_write_allocate+0x1b8/0x420 [xfs]
[<e0d6d962>] xfs_map_blocks+0x282/0x410 [xfs]
[<e0d6e8de>] xfs_vm_writepage+0x27e/0x660 [xfs]
[<c02e676b>] __writepage+0xb/0x30
[<c02e6f8f>] write_cache_pages+0x18f/0x3e0
[<c02e7217>] generic_writepages+0x37/0x50
[<e0d6cc5f>] xfs_vm_writepages+0x2f/0x40 [xfs]
[<c02e7f4c>] do_writepages+0x1c/0x30
[<c02e046f>] __filemap_fdatawrite_range+0x4f/0x60
[<c02e0fb6>] filemap_fdatawrite_range+0x26/0x30
[<e0d78a4e>] xfs_flush_pages+0x6e/0xe0 [xfs]
[<e0d75e1b>] xfs_file_buffered_aio_write+0x11b/0x1c0 [xfs]
[<e0d75fab>] xfs_file_aio_write+0xeb/0x170 [xfs]
[<c031e884>] do_sync_write+0x94/0xd0
[<c031f26a>] vfs_write+0x8a/0x160
[<c031f67b>] sys_pwrite64+0x7b/0x80
[<c068bb65>] syscall_call+0x7/0xb
[<ffffffff>] 0xffffffff

penguin9:~/xfstests-dev # cat /proc/11434/stack
[<e0d8e688>] xfs_alloc_vextent+0x58/0x80 [xfs]
[<e0d99319>] xfs_bmap_btalloc+0x319/0x9b0 [xfs]
[<e0d999ca>] xfs_bmap_alloc+0x1a/0x30 [xfs]
[<e0da61be>] xfs_bmapi_allocate+0xce/0x390 [xfs]
[<e0da6a38>] xfs_bmapi_write+0x5b8/0xb90 [xfs]
[<e0d7d5f8>] xfs_iomap_write_allocate+0x1b8/0x420 [xfs]
[<e0d6d962>] xfs_map_blocks+0x282/0x410 [xfs]
[<e0d6e8de>] xfs_vm_writepage+0x27e/0x660 [xfs]
[<c02e676b>] __writepage+0xb/0x30
[<c02e6f8f>] write_cache_pages+0x18f/0x3e0
[<c02e7217>] generic_writepages+0x37/0x50
[<e0d6cc5f>] xfs_vm_writepages+0x2f/0x40 [xfs]
[<c02e7f4c>] do_writepages+0x1c/0x30
[<c02e046f>] __filemap_fdatawrite_range+0x4f/0x60
[<c02e0fb6>] filemap_fdatawrite_range+0x26/0x30
[<e0d78a4e>] xfs_flush_pages+0x6e/0xe0 [xfs]
[<e0d75e1b>] xfs_file_buffered_aio_write+0x11b/0x1c0 [xfs]
[<e0d75fab>] xfs_file_aio_write+0xeb/0x170 [xfs]
[<c031e884>] do_sync_write+0x94/0xd0
[<c031f26a>] vfs_write+0x8a/0x160
[<c031f67b>] sys_pwrite64+0x7b/0x80
[<c068bb65>] syscall_call+0x7/0xb
[<ffffffff>] 0xffffffff

penguin9:~/xfstests-dev # cat /proc/11440/stack
[<e0d8e688>] xfs_alloc_vextent+0x58/0x80 [xfs]
[<e0d99ba8>] xfs_bmap_extents_to_btree+0x198/0x5d0 [xfs]
[<e0da5a47>] xfs_bmap_add_extent_delay_real+0x2057/0x2700 [xfs]
[<e0da6357>] xfs_bmapi_allocate+0x267/0x390 [xfs]
[<e0da6a38>] xfs_bmapi_write+0x5b8/0xb90 [xfs]
[<e0d7d5f8>] xfs_iomap_write_allocate+0x1b8/0x420 [xfs]
[<e0d6d962>] xfs_map_blocks+0x282/0x410 [xfs]
[<e0d6e8de>] xfs_vm_writepage+0x27e/0x660 [xfs]
[<c02e676b>] __writepage+0xb/0x30
[<c02e6f8f>] write_cache_pages+0x18f/0x3e0
[<c02e7217>] generic_writepages+0x37/0x50
[<e0d6cc5f>] xfs_vm_writepages+0x2f/0x40 [xfs]
[<c02e7f4c>] do_writepages+0x1c/0x30
[<c02e046f>] __filemap_fdatawrite_range+0x4f/0x60
[<c02e0fb6>] filemap_fdatawrite_range+0x26/0x30
[<e0d78a4e>] xfs_flush_pages+0x6e/0xe0 [xfs]
[<e0d75e1b>] xfs_file_buffered_aio_write+0x11b/0x1c0 [xfs]
[<e0d75fab>] xfs_file_aio_write+0xeb/0x170 [xfs]
[<c031e884>] do_sync_write+0x94/0xd0
[<c031f26a>] vfs_write+0x8a/0x160
[<c031f67b>] sys_pwrite64+0x7b/0x80
[<c068bb65>] syscall_call+0x7/0xb
[<ffffffff>] 0xffffffff

penguin9:~/xfstests-dev # cat /proc/11443/stack
[<e0d8e688>] xfs_alloc_vextent+0x58/0x80 [xfs]
[<e0d99319>] xfs_bmap_btalloc+0x319/0x9b0 [xfs]
[<e0d999ca>] xfs_bmap_alloc+0x1a/0x30 [xfs]
[<e0da61be>] xfs_bmapi_allocate+0xce/0x390 [xfs]
[<e0da6a38>] xfs_bmapi_write+0x5b8/0xb90 [xfs]
[<e0d7d5f8>] xfs_iomap_write_allocate+0x1b8/0x420 [xfs]
[<e0d6d962>] xfs_map_blocks+0x282/0x410 [xfs]
[<e0d6e8de>] xfs_vm_writepage+0x27e/0x660 [xfs]
[<c02e676b>] __writepage+0xb/0x30
[<c02e6f8f>] write_cache_pages+0x18f/0x3e0
[<c02e7217>] generic_writepages+0x37/0x50
[<e0d6cc5f>] xfs_vm_writepages+0x2f/0x40 [xfs]
[<c02e7f4c>] do_writepages+0x1c/0x30
[<c02e046f>] __filemap_fdatawrite_range+0x4f/0x60
[<c02e0fb6>] filemap_fdatawrite_range+0x26/0x30
[<e0d78a4e>] xfs_flush_pages+0x6e/0xe0 [xfs]
[<e0d75e1b>] xfs_file_buffered_aio_write+0x11b/0x1c0 [xfs]
[<e0d75fab>] xfs_file_aio_write+0xeb/0x170 [xfs]
[<c031e884>] do_sync_write+0x94/0xd0
[<c031f26a>] vfs_write+0x8a/0x160
[<c031f67b>] sys_pwrite64+0x7b/0x80
[<c068bb65>] syscall_call+0x7/0xb
[<ffffffff>] 0xffffffff

I will run a x86_64 machine overnight to see if I can get a hang
on that box too.

We are dealing with multiple AGs here, so analysing this one may take
more time.

Rarely but sometimes, this machine will save core upon receiving a NMI.
I will try to get as much information before trying that.

--Mark.
Mark Tinguely
2012-09-27 23:27:08 UTC
Permalink
Post by Mark Tinguely
I will run a x86_64 machine overnight to see if I can get a hang
on that box too.
We are dealing with multiple AGs here, so analysing this one may take
more time.
Rarely but sometimes, this machine will save core upon receiving a NMI.
I will try to get as much information before trying that.
--Mark.
Sorry, I am tired. This does not have to be multiple AGs.

Anyway, the clearing of the xfs_alloc_args structure does not prevent
the hang. More tomorrow.

--Mark.
Continue reading on narkive:
Loading...