Discussion:
XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
Josef 'Jeff' Sipek
2013-08-21 15:24:58 UTC
Permalink
We've started experimenting with larger directory block sizes to avoid
directory fragmentation. Everything seems to work fine, except that the log
is spammed with these lovely debug messages:

XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
From looking at the code, it looks like that each of those messages (there
are thousands) equates to 100 trips through the loop. My guess is that the
larger blocks require multi-page allocations which are harder to satisfy.
This is with 3.10 kernel.

The hardware is something like (I can find out the exact config is you want):

32 cores
128 GB RAM
LSI 9271-8i RAID (one big RAID-60 with 36 disks, partitioned)

As I hinted at earlier, we end up with pretty big directories. We can
semi-reliably trigger this when we run rsync on the data between two
(identical) hosts over 10GbitE.

# xfs_info /dev/sda9
meta-data=/dev/sda9 isize=256 agcount=6, agsize=268435455 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=1454213211, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=65536 ascii-ci=0
log =internal bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

/proc/slabinfo: https://www.copy.com/s/1x1yZFjYO2EI/slab.txt
sysrq m output: https://www.copy.com/s/mYfMYfJJl2EB/sysrq-m.txt


While I realize that the message isn't bad, it does mean that the system is
having hard time allocating memory. This could potentially lead to bad
performance, or even an actual deadlock. Do you have any suggestions?

Thanks,

Jeff.
--
The reasonable man adapts himself to the world; the unreasonable one
persists in trying to adapt the world to himself. Therefore all progress
depends on the unreasonable man.
- George Bernard Shaw
Dave Chinner
2013-08-22 02:25:44 UTC
Permalink
Post by Josef 'Jeff' Sipek
We've started experimenting with larger directory block sizes to avoid
directory fragmentation. Everything seems to work fine, except that the log
XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
From looking at the code, it looks like that each of those messages (there
are thousands) equates to 100 trips through the loop. My guess is that the
larger blocks require multi-page allocations which are harder to satisfy.
This is with 3.10 kernel.
No, larger blocks simply require more single pages. The buffer cache
does not require multi-page allocation at all. So, mode = 0x250,
which means ___GFP_NOWARN | ___GFP_IO | ___GFP_WAIT which is also
known as a GFP_NOFS allocation context.

So, it's entirely possible that your memory is full of cached
filesystem data and metadata, and the allocation that needs more
can't reclaim them.
Post by Josef 'Jeff' Sipek
32 cores
128 GB RAM
LSI 9271-8i RAID (one big RAID-60 with 36 disks, partitioned)
As I hinted at earlier, we end up with pretty big directories. We can
semi-reliably trigger this when we run rsync on the data between two
(identical) hosts over 10GbitE.
# xfs_info /dev/sda9
meta-data=/dev/sda9 isize=256 agcount=6, agsize=268435455 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=1454213211, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=65536 ascii-ci=0
log =internal bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
/proc/slabinfo: https://www.copy.com/s/1x1yZFjYO2EI/slab.txt
Hmmm. You're using filestreams. That's unusual.

Only major slab cache is the buffer_head slab, with ~12 million
active bufferheads. So, that means you've got at least 47-48GB of
data in the page cache.....

And there's only ~35000 xfs_buf items in the slab, so the metadata
cache isn't very big, and reclaim from that isn't a problem, nor the
inode caches as there's only 130,000 cached inodes.
Post by Josef 'Jeff' Sipek
sysrq m output: https://www.copy.com/s/mYfMYfJJl2EB/sysrq-m.txt
27764401 total pagecache pages

which indicates that you've got close to 110GB of pages in the page
cache. Hmmm, and 24-25GB of dirty pages in memory.

You know, I'd be suspecting a memory reclaim problem here to do with
having large amounts of dirty memory in the page cache. I don't
think the underlying cause is going to be the filesystem code, as
the warning should never be emitted if memory reclaim is making
progress. Perhaps you could try lowering all the dirty memory
thresholds to see if that allows memory reclaim to make more
progress because there are fewer dirty pages in memory...

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Josef 'Jeff' Sipek
2013-08-22 15:07:17 UTC
Permalink
Post by Dave Chinner
Post by Josef 'Jeff' Sipek
We've started experimenting with larger directory block sizes to avoid
directory fragmentation. Everything seems to work fine, except that the log
XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
From looking at the code, it looks like that each of those messages (there
are thousands) equates to 100 trips through the loop. My guess is that the
larger blocks require multi-page allocations which are harder to satisfy.
This is with 3.10 kernel.
No, larger blocks simply require more single pages. The buffer cache
does not require multi-page allocation at all. So, mode = 0x250,
which means ___GFP_NOWARN | ___GFP_IO | ___GFP_WAIT which is also
known as a GFP_NOFS allocation context.
Doh! Not sure why I didn't remember the fact that directories are no
different from regular files...

...
Post by Dave Chinner
Post by Josef 'Jeff' Sipek
/proc/slabinfo: https://www.copy.com/s/1x1yZFjYO2EI/slab.txt
Hmmm. You're using filestreams. That's unusual.
Right. I keep forgetting about that.
Post by Dave Chinner
Post by Josef 'Jeff' Sipek
sysrq m output: https://www.copy.com/s/mYfMYfJJl2EB/sysrq-m.txt
27764401 total pagecache pages
which indicates that you've got close to 110GB of pages in the page
cache. Hmmm, and 24-25GB of dirty pages in memory.
You know, I'd be suspecting a memory reclaim problem here to do with
having large amounts of dirty memory in the page cache. I don't
think the underlying cause is going to be the filesystem code, as
the warning should never be emitted if memory reclaim is making
progress. Perhaps you could try lowering all the dirty memory
thresholds to see if that allows memory reclaim to make more
progress because there are fewer dirty pages in memory...
Yep. This makes perfect sense. Amusingly enough, we don't read much of the
data so really the pagecache is supposed to buffer the writes because I/O is
slow. We'll play with the dirty memory thresholds and see if that helps.

Thanks!

Jeff.
--
All parts should go together without forcing. You must remember that the
parts you are reassembling were disassembled by you. Therefore, if you
can’t get them together again, there must be a reason. By all means, do not
use a hammer.
— IBM Manual, 1925
Loading...