Discussion:
bad performance on touch/cp file on XFS system
Zhang Qiang
2014-08-25 03:34:34 UTC
Permalink
Dear XFS community & developers,

I am using CentOS 6.3 and xfs as base file system and use RAID5 as hardware
storage.

Detail environment as follow:
OS: CentOS 6.3
Kernel: kernel-2.6.32-279.el6.x86_64
XFS option info(df output): /dev/sdb1 on /data type xfs
(rw,noatime,nodiratime,nobarrier)

Detail phenomenon:

# df
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 29G 17G 11G 61% /
/dev/sdb1 893G 803G 91G 90% /data
/dev/sda4 2.2T 1.6T 564G 75% /data1

# time touch /data1/1111
real 0m23.043s
user 0m0.001s
sys 0m0.349s

# perf top
Events: 6K cycles
16.96% [xfs] [k] xfs_inobt_get_rec
11.95% [xfs] [k] xfs_btree_increment
11.16% [xfs] [k] xfs_btree_get_rec
7.39% [xfs] [k] xfs_btree_get_block
5.02% [xfs] [k] xfs_dialloc
4.87% [xfs] [k] xfs_btree_rec_offset
4.33% [xfs] [k] xfs_btree_readahead
4.13% [xfs] [k] _xfs_buf_find
4.05% [kernel] [k] intel_idle
2.89% [xfs] [k] xfs_btree_rec_addr
1.04% [kernel] [k] kmem_cache_free


It seems that some xfs kernel function spend much time (xfs_inobt_get_rec,
xfs_btree_increment, etc.)

I found a bug in bugzilla [1], is that is the same issue like this?

It's very greatly appreciated if you can give constructive suggestion about
this issue, as It's really hard to reproduce from another system and it's
not possible to do upgrade on that online machine.


[1] https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=813137

Thanks in advance
Qiang
Dave Chinner
2014-08-25 05:18:01 UTC
Permalink
Post by Zhang Qiang
Dear XFS community & developers,
I am using CentOS 6.3 and xfs as base file system and use RAID5 as hardware
storage.
OS: CentOS 6.3
Kernel: kernel-2.6.32-279.el6.x86_64
XFS option info(df output): /dev/sdb1 on /data type xfs
(rw,noatime,nodiratime,nobarrier)
# df
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 29G 17G 11G 61% /
/dev/sdb1 893G 803G 91G 90% /data
/dev/sda4 2.2T 1.6T 564G 75% /data1
# time touch /data1/1111
real 0m23.043s
user 0m0.001s
sys 0m0.349s
# perf top
Events: 6K cycles
16.96% [xfs] [k] xfs_inobt_get_rec
11.95% [xfs] [k] xfs_btree_increment
11.16% [xfs] [k] xfs_btree_get_rec
7.39% [xfs] [k] xfs_btree_get_block
5.02% [xfs] [k] xfs_dialloc
4.87% [xfs] [k] xfs_btree_rec_offset
4.33% [xfs] [k] xfs_btree_readahead
4.13% [xfs] [k] _xfs_buf_find
4.05% [kernel] [k] intel_idle
2.89% [xfs] [k] xfs_btree_rec_addr
1.04% [kernel] [k] kmem_cache_free
It seems that some xfs kernel function spend much time (xfs_inobt_get_rec,
xfs_btree_increment, etc.)
I found a bug in bugzilla [1], is that is the same issue like this?
No.
Post by Zhang Qiang
It's very greatly appreciated if you can give constructive suggestion about
this issue, as It's really hard to reproduce from another system and it's
not possible to do upgrade on that online machine.
You've got very few free inodes, widely distributed in the allocated
inode btree. The CPU time above is the btree search for the next
free inode.

This is the issue solved by this series of recent commits to add a
new on-disk free inode btree index:

53801fd xfs: enable the finobt feature on v5 superblocks
0c153c1 xfs: report finobt status in fs geometry
a3fa516 xfs: add finobt support to growfs
3efa4ff xfs: update the finobt on inode free
2b64ee5 xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() helper
6dd8638 xfs: use and update the finobt on inode allocation
0aa0a75 xfs: insert newly allocated inode chunks into the finobt
9d43b18 xfs: update inode allocation/free transaction reservations for finobt
aafc3c2 xfs: support the XFS_BTNUM_FINOBT free inode btree type
8e2c84d xfs: reserve v5 superblock read-only compat. feature bit for finobt
57bd3db xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers

Which is of no help to you, however, because it's not available in
any CentOS kernel.

There's really not much you can do to avoid the problem once you've
punched random freespace holes in the allocated inode btree. IT
generally doesn't affect many people; those that it does affect are
normally using XFS as an object store indexed by a hard link farm
(e.g. various backup programs do this).

If you dump the superblock via xfs_db, the difference between icount
and ifree will give you idea of how much "needle in a haystack"
searching is going on. You can probably narrow it down to a specific
AG by dumping the AGI headers and checking the same thing. filling
in all the holes (by creating a bunch of zero length files in the
appropriate AGs) might take some time, but it should make the
problem go away until you remove more filesystem and create random
free inode holes again...

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Zhang Qiang
2014-08-25 08:09:05 UTC
Permalink
Post by Zhang Qiang
Post by Zhang Qiang
Dear XFS community & developers,
I am using CentOS 6.3 and xfs as base file system and use RAID5 as
hardware
Post by Zhang Qiang
storage.
OS: CentOS 6.3
Kernel: kernel-2.6.32-279.el6.x86_64
XFS option info(df output): /dev/sdb1 on /data type xfs
(rw,noatime,nodiratime,nobarrier)
# df
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 29G 17G 11G 61% /
/dev/sdb1 893G 803G 91G 90% /data
/dev/sda4 2.2T 1.6T 564G 75% /data1
# time touch /data1/1111
real 0m23.043s
user 0m0.001s
sys 0m0.349s
# perf top
Events: 6K cycles
16.96% [xfs] [k] xfs_inobt_get_rec
11.95% [xfs] [k] xfs_btree_increment
11.16% [xfs] [k] xfs_btree_get_rec
7.39% [xfs] [k] xfs_btree_get_block
5.02% [xfs] [k] xfs_dialloc
4.87% [xfs] [k] xfs_btree_rec_offset
4.33% [xfs] [k] xfs_btree_readahead
4.13% [xfs] [k] _xfs_buf_find
4.05% [kernel] [k] intel_idle
2.89% [xfs] [k] xfs_btree_rec_addr
1.04% [kernel] [k] kmem_cache_free
It seems that some xfs kernel function spend much time
(xfs_inobt_get_rec,
Post by Zhang Qiang
xfs_btree_increment, etc.)
I found a bug in bugzilla [1], is that is the same issue like this?
No.
Post by Zhang Qiang
It's very greatly appreciated if you can give constructive suggestion
about
Post by Zhang Qiang
this issue, as It's really hard to reproduce from another system and it's
not possible to do upgrade on that online machine.
You've got very few free inodes, widely distributed in the allocated
inode btree. The CPU time above is the btree search for the next
free inode.
This is the issue solved by this series of recent commits to add a
[Qiang] This meas that if I want to fix this issue, I have to apply the
following patches and build my own kernel.

As the on-disk structure has been changed, so should I also re-create xfs
filesystem again? is there any user space tools to convert old disk
filesystem to new one, and don't need to backup and restore currently data?
Post by Zhang Qiang
53801fd xfs: enable the finobt feature on v5 superblocks
0c153c1 xfs: report finobt status in fs geometry
a3fa516 xfs: add finobt support to growfs
3efa4ff xfs: update the finobt on inode free
2b64ee5 xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() helper
6dd8638 xfs: use and update the finobt on inode allocation
0aa0a75 xfs: insert newly allocated inode chunks into the finobt
9d43b18 xfs: update inode allocation/free transaction reservations for finobt
aafc3c2 xfs: support the XFS_BTNUM_FINOBT free inode btree type
8e2c84d xfs: reserve v5 superblock read-only compat. feature bit for finobt
57bd3db xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers
Which is of no help to you, however, because it's not available in
any CentOS kernel.
[Qiang] Do you think if it's possible to just backport these patches to
kernel 6.2.32 (CentOS 6.3) to fix this issue?

Or it's better to backport to 3.10 kernel, used in CentOS 7.0?
Post by Zhang Qiang
There's really not much you can do to avoid the problem once you've
punched random freespace holes in the allocated inode btree. IT
generally doesn't affect many people; those that it does affect are
normally using XFS as an object store indexed by a hard link farm
(e.g. various backup programs do this).
OK, I see.

Could you please guide me to reproduce this issue easily? as I have tried
to use a 500G xfs partition, and use about 98 % spaces, but still can't
reproduce this issue. Is there any easy way from your mind?
Post by Zhang Qiang
If you dump the superblock via xfs_db, the difference between icount
and ifree will give you idea of how much "needle in a haystack"
searching is going on. You can probably narrow it down to a specific
AG by dumping the AGI headers and checking the same thing. filling
in all the holes (by creating a bunch of zero length files in the
appropriate AGs) might take some time, but it should make the
problem go away until you remove more filesystem and create random
free inode holes again...
I will try to investigate the detail issue.

Thanks for your kindly response.
Qiang
Post by Zhang Qiang
Cheers,
Dave.
--
Dave Chinner
Dave Chinner
2014-08-25 08:56:16 UTC
Permalink
Post by Zhang Qiang
Post by Zhang Qiang
Post by Zhang Qiang
Dear XFS community & developers,
I am using CentOS 6.3 and xfs as base file system and use RAID5 as
hardware
Post by Zhang Qiang
storage.
OS: CentOS 6.3
Kernel: kernel-2.6.32-279.el6.x86_64
XFS option info(df output): /dev/sdb1 on /data type xfs
(rw,noatime,nodiratime,nobarrier)
....
Post by Zhang Qiang
Post by Zhang Qiang
Post by Zhang Qiang
It's very greatly appreciated if you can give constructive suggestion
about
Post by Zhang Qiang
this issue, as It's really hard to reproduce from another system and it's
not possible to do upgrade on that online machine.
You've got very few free inodes, widely distributed in the allocated
inode btree. The CPU time above is the btree search for the next
free inode.
This is the issue solved by this series of recent commits to add a
[Qiang] This meas that if I want to fix this issue, I have to apply the
following patches and build my own kernel.
Yes. Good luck, even I wouldn't attempt to do that.

And then use xfsprogs 3.2.1, and make a new filesystem that enables
metadata CRCs and the free inode btree feature.
Post by Zhang Qiang
As the on-disk structure has been changed, so should I also re-create xfs
filesystem again?
Yes, you need to download the latest xfsprogs (3.2.1) to be able to
make it with the necessary feature bits set.
Post by Zhang Qiang
is there any user space tools to convert old disk
filesystem to new one, and don't need to backup and restore currently data?
No, we don't write utilities to mangle on disk formats. dump, mkfs
and restore is far more reliable than any "in-place conversion" code
we could write. It will probably be faster, too.
Post by Zhang Qiang
Post by Zhang Qiang
Which is of no help to you, however, because it's not available in
any CentOS kernel.
[Qiang] Do you think if it's possible to just backport these patches to
kernel 6.2.32 (CentOS 6.3) to fix this issue?
Or it's better to backport to 3.10 kernel, used in CentOS 7.0?
You can try, but if you break it you get to keep all the pieces
yourself. Eventually someone who maintains the RHEL code will do a
backport that will trickle down to CentOS. If you need it any
sooner, then you'll need to do it yourself, or upgrade to RHEL
and ask your support contact for it to be included in RHEL 7.1....
Post by Zhang Qiang
Post by Zhang Qiang
There's really not much you can do to avoid the problem once you've
punched random freespace holes in the allocated inode btree. IT
generally doesn't affect many people; those that it does affect are
normally using XFS as an object store indexed by a hard link farm
(e.g. various backup programs do this).
OK, I see.
Could you please guide me to reproduce this issue easily? as I have tried
to use a 500G xfs partition, and use about 98 % spaces, but still can't
reproduce this issue. Is there any easy way from your mind?
Search the archives for the test cases that were used for the patch
set. There's a performance test case documented in the review
discussions.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Zhang Qiang
2014-08-25 09:05:33 UTC
Permalink
Great, thank you.
icount = 220619904
ifree = 26202919

So the number of free inode take about 10%, so that's not so few.

So, are you still sure the patches can fix this issue?

Here's the detail xfs_db info:

# mount /dev/sda4 /data1/
# xfs_info /data1/
meta-data=/dev/sda4 isize=256 agcount=4, agsize=142272384
blks
= sectsz=512 attr=2, projid32bit=0
data = bsize=4096 blocks=569089536, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=277875, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
# umount /dev/sda4
# xfs_db /dev/sda4
xfs_db> sb 0
xfs_db> p
magicnum = 0x58465342
blocksize = 4096
dblocks = 569089536
rblocks = 0
rextents = 0
uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008
logstart = 536870916
rootino = 128
rbmino = 129
rsumino = 130
rextsize = 1
agblocks = 142272384
agcount = 4
rbmblocks = 0
logblocks = 277875
versionnum = 0xb4a4
sectsize = 512
inodesize = 256
inopblock = 16
fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
blocklog = 12
sectlog = 9
inodelog = 8
inopblog = 4
agblklog = 28
rextslog = 0
inprogress = 0
imax_pct = 5
icount = 220619904
ifree = 26202919
fdblocks = 147805479
frextents = 0
uquotino = 0
gquotino = 0
qflags = 0
flags = 0
shared_vn = 0
inoalignmt = 2
unit = 0
width = 0
dirblklog = 0
logsectlog = 0
logsectsize = 0
logsunit = 1
features2 = 0xa
bad_features2 = 0xa
xfs_db> sb 1
xfs_db> p
magicnum = 0x58465342
blocksize = 4096
dblocks = 569089536
rblocks = 0
rextents = 0
uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008
logstart = 536870916
rootino = 128
rbmino = null
rsumino = null
rextsize = 1
agblocks = 142272384
agcount = 4
rbmblocks = 0
logblocks = 277875
versionnum = 0xb4a4
sectsize = 512
inodesize = 256
inopblock = 16
fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
blocklog = 12
sectlog = 9
inodelog = 8
inopblog = 4
agblklog = 28
rextslog = 0
inprogress = 1
imax_pct = 5
icount = 0
ifree = 0
fdblocks = 568811645
frextents = 0
uquotino = 0
gquotino = 0
qflags = 0
flags = 0
shared_vn = 0
inoalignmt = 2
unit = 0
width = 0
dirblklog = 0
logsectlog = 0
logsectsize = 0
logsunit = 1
features2 = 0xa
bad_features2 = 0xa
xfs_db> sb 2
xfs_db> p
magicnum = 0x58465342
blocksize = 4096
dblocks = 569089536
rblocks = 0
rextents = 0
uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008
logstart = 536870916
rootino = null
rbmino = null
rsumino = null
rextsize = 1
agblocks = 142272384
agcount = 4
rbmblocks = 0
logblocks = 277875
versionnum = 0xb4a4
sectsize = 512
inodesize = 256
inopblock = 16
fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
blocklog = 12
sectlog = 9
inodelog = 8
inopblog = 4
agblklog = 28
rextslog = 0
inprogress = 1
imax_pct = 5
icount = 0
ifree = 0
fdblocks = 568811645
frextents = 0
uquotino = 0
gquotino = 0
qflags = 0
flags = 0
shared_vn = 0
inoalignmt = 2
unit = 0
width = 0
dirblklog = 0
logsectlog = 0
logsectsize = 0
logsunit = 1
features2 = 0xa
bad_features2 = 0xa
xfs_db> sb 3
xfs_db> p
magicnum = 0x58465342
blocksize = 4096
dblocks = 569089536
rblocks = 0
rextents = 0
uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008
logstart = 536870916
rootino = 128
rbmino = null
rsumino = null
rextsize = 1
agblocks = 142272384
agcount = 4
rbmblocks = 0
logblocks = 277875
versionnum = 0xb4a4
sectsize = 512
inodesize = 256
inopblock = 16
fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
blocklog = 12
sectlog = 9
inodelog = 8
inopblog = 4
agblklog = 28
rextslog = 0
inprogress = 1
imax_pct = 5
icount = 0
ifree = 0
fdblocks = 568811645
frextents = 0
uquotino = 0
gquotino = 0
qflags = 0
flags = 0
shared_vn = 0
inoalignmt = 2
unit = 0
width = 0
dirblklog = 0
logsectlog = 0
logsectsize = 0
logsunit = 1
features2 = 0xa
bad_features2 = 0xa


Thanks
Qiang
Post by Dave Chinner
Post by Zhang Qiang
Post by Zhang Qiang
Post by Zhang Qiang
Dear XFS community & developers,
I am using CentOS 6.3 and xfs as base file system and use RAID5 as
hardware
Post by Zhang Qiang
storage.
OS: CentOS 6.3
Kernel: kernel-2.6.32-279.el6.x86_64
XFS option info(df output): /dev/sdb1 on /data type xfs
(rw,noatime,nodiratime,nobarrier)
....
Post by Zhang Qiang
Post by Zhang Qiang
Post by Zhang Qiang
It's very greatly appreciated if you can give constructive suggestion
about
Post by Zhang Qiang
this issue, as It's really hard to reproduce from another system and
it's
Post by Zhang Qiang
Post by Zhang Qiang
Post by Zhang Qiang
not possible to do upgrade on that online machine.
You've got very few free inodes, widely distributed in the allocated
inode btree. The CPU time above is the btree search for the next
free inode.
This is the issue solved by this series of recent commits to add a
[Qiang] This meas that if I want to fix this issue, I have to apply the
following patches and build my own kernel.
Yes. Good luck, even I wouldn't attempt to do that.
And then use xfsprogs 3.2.1, and make a new filesystem that enables
metadata CRCs and the free inode btree feature.
Post by Zhang Qiang
As the on-disk structure has been changed, so should I also re-create xfs
filesystem again?
Yes, you need to download the latest xfsprogs (3.2.1) to be able to
make it with the necessary feature bits set.
Post by Zhang Qiang
is there any user space tools to convert old disk
filesystem to new one, and don't need to backup and restore currently
data?
No, we don't write utilities to mangle on disk formats. dump, mkfs
and restore is far more reliable than any "in-place conversion" code
we could write. It will probably be faster, too.
Post by Zhang Qiang
Post by Zhang Qiang
Which is of no help to you, however, because it's not available in
any CentOS kernel.
[Qiang] Do you think if it's possible to just backport these patches to
kernel 6.2.32 (CentOS 6.3) to fix this issue?
Or it's better to backport to 3.10 kernel, used in CentOS 7.0?
You can try, but if you break it you get to keep all the pieces
yourself. Eventually someone who maintains the RHEL code will do a
backport that will trickle down to CentOS. If you need it any
sooner, then you'll need to do it yourself, or upgrade to RHEL
and ask your support contact for it to be included in RHEL 7.1....
Post by Zhang Qiang
Post by Zhang Qiang
There's really not much you can do to avoid the problem once you've
punched random freespace holes in the allocated inode btree. IT
generally doesn't affect many people; those that it does affect are
normally using XFS as an object store indexed by a hard link farm
(e.g. various backup programs do this).
OK, I see.
Could you please guide me to reproduce this issue easily? as I have tried
to use a 500G xfs partition, and use about 98 % spaces, but still can't
reproduce this issue. Is there any easy way from your mind?
Search the archives for the test cases that were used for the patch
set. There's a performance test case documented in the review
discussions.
Cheers,
Dave.
--
Dave Chinner
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
Zhang Qiang
2014-08-25 08:47:39 UTC
Permalink
I have checked icount and ifree, but I found there are about 11.8 percent
free, so the free inode should not be too few.

Here's the detail log, any new clue?

# mount /dev/sda4 /data1/
# xfs_info /data1/
meta-data=/dev/sda4 isize=256 agcount=4, agsize=142272384
blks
= sectsz=512 attr=2, projid32bit=0
data = bsize=4096 blocks=569089536, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=277875, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
# umount /dev/sda4
# xfs_db /dev/sda4
xfs_db> sb 0
xfs_db> p
magicnum = 0x58465342
blocksize = 4096
dblocks = 569089536
rblocks = 0
rextents = 0
uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008
logstart = 536870916
rootino = 128
rbmino = 129
rsumino = 130
rextsize = 1
agblocks = 142272384
agcount = 4
rbmblocks = 0
logblocks = 277875
versionnum = 0xb4a4
sectsize = 512
inodesize = 256
inopblock = 16
fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
blocklog = 12
sectlog = 9
inodelog = 8
inopblog = 4
agblklog = 28
rextslog = 0
inprogress = 0
imax_pct = 5
icount = 220619904
ifree = 26202919
fdblocks = 147805479
frextents = 0
uquotino = 0
gquotino = 0
qflags = 0
flags = 0
shared_vn = 0
inoalignmt = 2
unit = 0
width = 0
dirblklog = 0
logsectlog = 0
logsectsize = 0
logsunit = 1
features2 = 0xa
bad_features2 = 0xa
xfs_db> sb 1
xfs_db> p
magicnum = 0x58465342
blocksize = 4096
dblocks = 569089536
rblocks = 0
rextents = 0
uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008
logstart = 536870916
rootino = 128
rbmino = null
rsumino = null
rextsize = 1
agblocks = 142272384
agcount = 4
rbmblocks = 0
logblocks = 277875
versionnum = 0xb4a4
sectsize = 512
inodesize = 256
inopblock = 16
fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
blocklog = 12
sectlog = 9
inodelog = 8
inopblog = 4
agblklog = 28
rextslog = 0
inprogress = 1
imax_pct = 5
icount = 0
ifree = 0
fdblocks = 568811645
frextents = 0
uquotino = 0
gquotino = 0
qflags = 0
flags = 0
shared_vn = 0
inoalignmt = 2
unit = 0
width = 0
dirblklog = 0
logsectlog = 0
logsectsize = 0
logsunit = 1
features2 = 0xa
bad_features2 = 0xa
xfs_db> sb 2
xfs_db> p
magicnum = 0x58465342
blocksize = 4096
dblocks = 569089536
rblocks = 0
rextents = 0
uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008
logstart = 536870916
rootino = null
rbmino = null
rsumino = null
rextsize = 1
agblocks = 142272384
agcount = 4
rbmblocks = 0
logblocks = 277875
versionnum = 0xb4a4
sectsize = 512
inodesize = 256
inopblock = 16
fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
blocklog = 12
sectlog = 9
inodelog = 8
inopblog = 4
agblklog = 28
rextslog = 0
inprogress = 1
imax_pct = 5
icount = 0
ifree = 0
fdblocks = 568811645
frextents = 0
uquotino = 0
gquotino = 0
qflags = 0
flags = 0
shared_vn = 0
inoalignmt = 2
unit = 0
width = 0
dirblklog = 0
logsectlog = 0
logsectsize = 0
logsunit = 1
features2 = 0xa
bad_features2 = 0xa
xfs_db> sb 3
xfs_db> p
magicnum = 0x58465342
blocksize = 4096
dblocks = 569089536
rblocks = 0
rextents = 0
uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008
logstart = 536870916
rootino = 128
rbmino = null
rsumino = null
rextsize = 1
agblocks = 142272384
agcount = 4
rbmblocks = 0
logblocks = 277875
versionnum = 0xb4a4
sectsize = 512
inodesize = 256
inopblock = 16
fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
blocklog = 12
sectlog = 9
inodelog = 8
inopblog = 4
agblklog = 28
rextslog = 0
inprogress = 1
imax_pct = 5
icount = 0
ifree = 0
fdblocks = 568811645
frextents = 0
uquotino = 0
gquotino = 0
qflags = 0
flags = 0
shared_vn = 0
inoalignmt = 2
unit = 0
width = 0
dirblklog = 0
logsectlog = 0
logsectsize = 0
logsunit = 1
features2 = 0xa
bad_features2 = 0xa
Post by Zhang Qiang
Post by Zhang Qiang
Dear XFS community & developers,
I am using CentOS 6.3 and xfs as base file system and use RAID5 as
hardware
Post by Zhang Qiang
storage.
OS: CentOS 6.3
Kernel: kernel-2.6.32-279.el6.x86_64
XFS option info(df output): /dev/sdb1 on /data type xfs
(rw,noatime,nodiratime,nobarrier)
# df
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 29G 17G 11G 61% /
/dev/sdb1 893G 803G 91G 90% /data
/dev/sda4 2.2T 1.6T 564G 75% /data1
# time touch /data1/1111
real 0m23.043s
user 0m0.001s
sys 0m0.349s
# perf top
Events: 6K cycles
16.96% [xfs] [k] xfs_inobt_get_rec
11.95% [xfs] [k] xfs_btree_increment
11.16% [xfs] [k] xfs_btree_get_rec
7.39% [xfs] [k] xfs_btree_get_block
5.02% [xfs] [k] xfs_dialloc
4.87% [xfs] [k] xfs_btree_rec_offset
4.33% [xfs] [k] xfs_btree_readahead
4.13% [xfs] [k] _xfs_buf_find
4.05% [kernel] [k] intel_idle
2.89% [xfs] [k] xfs_btree_rec_addr
1.04% [kernel] [k] kmem_cache_free
It seems that some xfs kernel function spend much time
(xfs_inobt_get_rec,
Post by Zhang Qiang
xfs_btree_increment, etc.)
I found a bug in bugzilla [1], is that is the same issue like this?
No.
Post by Zhang Qiang
It's very greatly appreciated if you can give constructive suggestion
about
Post by Zhang Qiang
this issue, as It's really hard to reproduce from another system and it's
not possible to do upgrade on that online machine.
You've got very few free inodes, widely distributed in the allocated
inode btree. The CPU time above is the btree search for the next
free inode.
This is the issue solved by this series of recent commits to add a
53801fd xfs: enable the finobt feature on v5 superblocks
0c153c1 xfs: report finobt status in fs geometry
a3fa516 xfs: add finobt support to growfs
3efa4ff xfs: update the finobt on inode free
2b64ee5 xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() helper
6dd8638 xfs: use and update the finobt on inode allocation
0aa0a75 xfs: insert newly allocated inode chunks into the finobt
9d43b18 xfs: update inode allocation/free transaction reservations for finobt
aafc3c2 xfs: support the XFS_BTNUM_FINOBT free inode btree type
8e2c84d xfs: reserve v5 superblock read-only compat. feature bit for finobt
57bd3db xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers
Which is of no help to you, however, because it's not available in
any CentOS kernel.
There's really not much you can do to avoid the problem once you've
punched random freespace holes in the allocated inode btree. IT
generally doesn't affect many people; those that it does affect are
normally using XFS as an object store indexed by a hard link farm
(e.g. various backup programs do this).
If you dump the superblock via xfs_db, the difference between icount
and ifree will give you idea of how much "needle in a haystack"
searching is going on. You can probably narrow it down to a specific
AG by dumping the AGI headers and checking the same thing. filling
in all the holes (by creating a bunch of zero length files in the
appropriate AGs) might take some time, but it should make the
problem go away until you remove more filesystem and create random
free inode holes again...
Cheers,
Dave.
--
Dave Chinner
Dave Chinner
2014-08-25 09:08:43 UTC
Permalink
Post by Zhang Qiang
I have checked icount and ifree, but I found there are about 11.8 percent
free, so the free inode should not be too few.
Here's the detail log, any new clue?
# mount /dev/sda4 /data1/
# xfs_info /data1/
meta-data=/dev/sda4 isize=256 agcount=4, agsize=142272384
4 AGs
Post by Zhang Qiang
icount = 220619904
ifree = 26202919
And 220 million inodes. There's your problem - that's an average
of 55 million inodes per AGI btree assuming you are using inode64.
If you are using inode32, then the inodes will be in 2 btrees, or
maybe even only one.

Anyway you look at it, searching btrees with tens of millions of
entries is going to consume a *lot* of CPU time. So, really, the
state your fs is in is probably unfixable without mkfs. And really,
that's probably pushing the boundaries of what xfsdump and
xfs-restore can support - it's going to take a long tiem to dump and
restore that data....

With that many inodes, I'd be considering moving to 32 or 64 AGs to
keep the btree size down to a more manageable size. The free inode
btree would also help, but, really, 220M inodes in a 2TB filesystem
is really pushing the boundaries of sanity.....

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Zhang Qiang
2014-08-25 10:31:10 UTC
Permalink
Post by Dave Chinner
Post by Zhang Qiang
I have checked icount and ifree, but I found there are about 11.8 percent
free, so the free inode should not be too few.
Here's the detail log, any new clue?
# mount /dev/sda4 /data1/
# xfs_info /data1/
meta-data=/dev/sda4 isize=256 agcount=4, agsize=142272384
4 AGs
Yes.
Post by Dave Chinner
Post by Zhang Qiang
icount = 220619904
ifree = 26202919
And 220 million inodes. There's your problem - that's an average
of 55 million inodes per AGI btree assuming you are using inode64.
If you are using inode32, then the inodes will be in 2 btrees, or
maybe even only one.
You are right, all inodes stay on one AG.

BTW, why i allocate 4 AGs, and all inodes stay in one AG for inode32?,
sorry as I am not familiar with xfs currently.
Post by Dave Chinner
Anyway you look at it, searching btrees with tens of millions of
entries is going to consume a *lot* of CPU time. So, really, the
state your fs is in is probably unfixable without mkfs. And really,
that's probably pushing the boundaries of what xfsdump and
xfs-restore can support - it's going to take a long tiem to dump and
restore that data....
Thanks reasonable.
Post by Dave Chinner
With that many inodes, I'd be considering moving to 32 or 64 AGs to
keep the btree size down to a more manageable size. The free inode
btree would also help, but, really, 220M inodes in a 2TB filesystem
Post by Dave Chinner
is really pushing the boundaries of sanity.....
So the better inodes size in one AG is about 5M, is there any documents
about these options I can learn more?

I will spend more time to learn how to use xfs, and the internal of xfs,
and try to contribute code.

Thanks for your help.
Post by Dave Chinner
Cheers,
Dave.
--
Dave Chinner
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs
Dave Chinner
2014-08-25 22:26:57 UTC
Permalink
Post by Zhang Qiang
Post by Dave Chinner
Post by Zhang Qiang
I have checked icount and ifree, but I found there are about 11.8 percent
free, so the free inode should not be too few.
Here's the detail log, any new clue?
# mount /dev/sda4 /data1/
# xfs_info /data1/
meta-data=/dev/sda4 isize=256 agcount=4, agsize=142272384
4 AGs
Yes.
Post by Dave Chinner
Post by Zhang Qiang
icount = 220619904
ifree = 26202919
And 220 million inodes. There's your problem - that's an average
of 55 million inodes per AGI btree assuming you are using inode64.
If you are using inode32, then the inodes will be in 2 btrees, or
maybe even only one.
You are right, all inodes stay on one AG.
BTW, why i allocate 4 AGs, and all inodes stay in one AG for inode32?,
Because the top addresses in the 2nd AG go over 32 bits, hence only
AG 0 can be used for inodes. Changing to inode64 will give you some
relief, but any time allocation occurs in AG0 is will be slow. i.e.
you'll be trading always slow for "unpredictably slow".
Post by Zhang Qiang
Post by Dave Chinner
With that many inodes, I'd be considering moving to 32 or 64 AGs to
keep the btree size down to a more manageable size. The free inode
btree would also help, but, really, 220M inodes in a 2TB filesystem
Post by Dave Chinner
is really pushing the boundaries of sanity.....
So the better inodes size in one AG is about 5M,
Not necessarily. But for your storage it's almost certainly going to
minimise the problem you are seeing.
Post by Zhang Qiang
is there any documents
about these options I can learn more?
http://xfs.org/index.php/XFS_Papers_and_Documentation

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Greg Freemyer
2014-08-25 22:46:31 UTC
Permalink
Post by Dave Chinner
Post by Zhang Qiang
Post by Dave Chinner
Post by Zhang Qiang
I have checked icount and ifree, but I found there are about 11.8 percent
free, so the free inode should not be too few.
Here's the detail log, any new clue?
# mount /dev/sda4 /data1/
# xfs_info /data1/
meta-data=/dev/sda4 isize=256 agcount=4, agsize=142272384
4 AGs
Yes.
Post by Dave Chinner
Post by Zhang Qiang
icount = 220619904
ifree = 26202919
And 220 million inodes. There's your problem - that's an average
of 55 million inodes per AGI btree assuming you are using inode64.
If you are using inode32, then the inodes will be in 2 btrees, or
maybe even only one.
You are right, all inodes stay on one AG.
BTW, why i allocate 4 AGs, and all inodes stay in one AG for inode32?,
Because the top addresses in the 2nd AG go over 32 bits, hence only
AG 0 can be used for inodes. Changing to inode64 will give you some
relief, but any time allocation occurs in AG0 is will be slow. i.e.
you'll be trading always slow for "unpredictably slow".
Post by Zhang Qiang
Post by Dave Chinner
With that many inodes, I'd be considering moving to 32 or 64 AGs to
keep the btree size down to a more manageable size. The free inode
btree would also help, but, really, 220M inodes in a 2TB filesystem
Post by Dave Chinner
is really pushing the boundaries of sanity.....
So the better inodes size in one AG is about 5M,
Not necessarily. But for your storage it's almost certainly going to
minimise the problem you are seeing.
Post by Zhang Qiang
is there any documents
about these options I can learn more?
http://xfs.org/index.php/XFS_Papers_and_Documentation
Given the apparently huge number of small files would he likely see a
big performance increase if he replaced that 2TB or rust with SSD.

Greg
Dave Chinner
2014-08-26 02:37:55 UTC
Permalink
Post by Greg Freemyer
Post by Dave Chinner
Post by Zhang Qiang
Post by Dave Chinner
Post by Zhang Qiang
I have checked icount and ifree, but I found there are about 11.8 percent
free, so the free inode should not be too few.
Here's the detail log, any new clue?
# mount /dev/sda4 /data1/
# xfs_info /data1/
meta-data=/dev/sda4 isize=256 agcount=4, agsize=142272384
4 AGs
Yes.
Post by Dave Chinner
Post by Zhang Qiang
icount = 220619904
ifree = 26202919
And 220 million inodes. There's your problem - that's an average
of 55 million inodes per AGI btree assuming you are using inode64.
If you are using inode32, then the inodes will be in 2 btrees, or
maybe even only one.
You are right, all inodes stay on one AG.
BTW, why i allocate 4 AGs, and all inodes stay in one AG for inode32?,
Because the top addresses in the 2nd AG go over 32 bits, hence only
AG 0 can be used for inodes. Changing to inode64 will give you some
relief, but any time allocation occurs in AG0 is will be slow. i.e.
you'll be trading always slow for "unpredictably slow".
Post by Zhang Qiang
Post by Dave Chinner
With that many inodes, I'd be considering moving to 32 or 64 AGs to
keep the btree size down to a more manageable size. The free inode
btree would also help, but, really, 220M inodes in a 2TB filesystem
Post by Dave Chinner
is really pushing the boundaries of sanity.....
So the better inodes size in one AG is about 5M,
Not necessarily. But for your storage it's almost certainly going to
minimise the problem you are seeing.
Post by Zhang Qiang
is there any documents
about these options I can learn more?
http://xfs.org/index.php/XFS_Papers_and_Documentation
Given the apparently huge number of small files would he likely see a
big performance increase if he replaced that 2TB or rust with SSD.
Doubt it - the profiles showed the allocation being CPU bound
searching the metadata that indexes all those inodes. Those same
profiles showed all the signs that it was hitting the buffer
cache most of the time, too, which is why it was CPU bound....

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Zhang Qiang
2014-08-26 10:04:52 UTC
Permalink
Thanks Dave/Greg for your analysis and suggestions.

I can summarize what I should do next:

- backup my data using xfsdump
- rebuilt filesystem using mkfs with options: agcount=32 for 2T disk
- mount filesystem with option inode64,nobarrier
- applied patches about adding free list inode on disk structure

As we have about ~100 servers need back up, so that will take much effort,
do you have any other suggestion?

What I am testing (ongoing):
- created a new 2T partition filesystem
- try to create small files and fill whole spaces then remove some of them
randomly
- check the performance of touch/cp files
- apply patches and verify it.

I have got more data from server:

1) flush all cache(echo 3 > /proc/sys/vm/drop_caches), and umount filesystem
2) mount filesystem and testing with touch command
* The first touch new file command take about ~23s
* second touch command take about ~0.1s.

Here's the perf data:
First touch command:

Events: 435 cycles
+ 7.51% touch [xfs] [k] xfs_inobt_get_rec
+ 5.61% touch [xfs] [k] xfs_btree_get_block
+ 5.38% touch [xfs] [k] xfs_btree_increment
+ 4.26% touch [xfs] [k] xfs_btree_get_rec
+ 3.73% touch [kernel.kallsyms] [k] find_busiest_group
+ 3.43% touch [xfs] [k] _xfs_buf_find
+ 2.72% touch [xfs] [k] xfs_btree_readahead
+ 2.38% touch [xfs] [k] xfs_trans_buf_item_match
+ 2.34% touch [xfs] [k] xfs_dialloc
+ 2.32% touch [kernel.kallsyms] [k] generic_make_request
+ 2.09% touch [xfs] [k] xfs_btree_rec_offset
+ 1.75% touch [kernel.kallsyms] [k] kmem_cache_alloc
+ 1.63% touch [kernel.kallsyms] [k] cpumask_next_and
+ 1.41% touch [sd_mod] [k] sd_prep_fn
+ 1.41% touch [kernel.kallsyms] [k] get_page_from_freelist
+ 1.38% touch [kernel.kallsyms] [k] __alloc_pages_nodemask
+ 1.27% touch [kernel.kallsyms] [k] scsi_request_fn
+ 1.22% touch [kernel.kallsyms] [k] blk_queue_bounce
+ 1.20% touch [kernel.kallsyms] [k] cfq_should_idle
+ 1.10% touch [xfs] [k] xfs_btree_rec_addr
+ 1.03% touch [kernel.kallsyms] [k] cfq_dispatch_requests+ 1.00%
touch [kernel.kallsyms] [k] _spin_lock_irqsave+ 0.94% touch
[kernel.kallsyms] [k] memcpy+ 0.86% touch [kernel.kallsyms] [k]
swiotlb_map_sg_attrs+ 0.84% touch [kernel.kallsyms] [k]
alloc_pages_current
+ 0.82% touch [kernel.kallsyms] [k] submit_bio
+ 0.81% touch [megaraid_sas] [k] megasas_build_and_issue_cmd_fusion
+ 0.77% touch [kernel.kallsyms] [k] blk_peek_request
+ 0.73% touch [xfs] [k] xfs_btree_setbuf
+ 0.73% touch [megaraid_sas] [k] MR_GetPhyParams
+ 0.73% touch [kernel.kallsyms] [k] run_timer_softirq
+ 0.71% touch [kernel.kallsyms] [k] pick_next_task_rt
+ 0.71% touch [kernel.kallsyms] [k] init_request_from_bio
+ 0.70% touch [kernel.kallsyms] [k] thread_return
+ 0.69% touch [kernel.kallsyms] [k] cfq_set_request
+ 0.67% touch [kernel.kallsyms] [k] mempool_alloc
+ 0.66% touch [xfs] [k] xfs_buf_hold
+ 0.66% touch [kernel.kallsyms] [k] find_next_bit
+ 0.62% touch [kernel.kallsyms] [k] cfq_insert_request
+ 0.61% touch [kernel.kallsyms] [k] scsi_init_io
+ 0.60% touch [megaraid_sas] [k] MR_BuildRaidContext
+ 0.59% touch [kernel.kallsyms] [k] policy_zonelist
+ 0.59% touch [kernel.kallsyms] [k] elv_insert
+ 0.58% touch [xfs] [k] xfs_buf_allocate_memory


Second perf command:


Events: 105 cycles
+ 20.92% touch [xfs] [k] xfs_inobt_get_rec
+ 14.27% touch [xfs] [k] xfs_btree_get_rec
+ 12.21% touch [xfs] [k] xfs_btree_get_block
+ 12.12% touch [xfs] [k] xfs_btree_increment
+ 9.86% touch [xfs] [k] xfs_btree_readahead
+ 7.87% touch [xfs] [k] _xfs_buf_find
+ 4.93% touch [xfs] [k] xfs_btree_rec_addr
+ 4.12% touch [xfs] [k] xfs_dialloc
+ 3.03% touch [kernel.kallsyms] [k] clear_page_c
+ 2.96% touch [xfs] [k] xfs_btree_rec_offset
+ 1.31% touch [kernel.kallsyms] [k] kmem_cache_free
+ 1.03% touch [xfs] [k] xfs_trans_buf_item_match
+ 0.99% touch [kernel.kallsyms] [k] _atomic_dec_and_lock
+ 0.99% touch [xfs] [k] xfs_inobt_get_maxrecs
+ 0.99% touch [xfs] [k] xfs_buf_unlock
+ 0.99% touch [xfs] [k] kmem_zone_alloc
+ 0.98% touch [kernel.kallsyms] [k] kmem_cache_alloc
+ 0.28% touch [kernel.kallsyms] [k] pgd_alloc
+ 0.17% touch [kernel.kallsyms] [k] page_fault
+ 0.01% touch [kernel.kallsyms] [k] native_write_msr_safe

I have compared the memory used, it seems that xfs try to load inode bmap
block for the first time, which take much time, is that the reason to take
so much time for the first touch operation?

Thanks
Qiang
Post by Zhang Qiang
Post by Greg Freemyer
Post by Dave Chinner
Post by Zhang Qiang
Post by Dave Chinner
Post by Zhang Qiang
I have checked icount and ifree, but I found there are about 11.8
percent
Post by Greg Freemyer
Post by Dave Chinner
Post by Zhang Qiang
Post by Dave Chinner
Post by Zhang Qiang
free, so the free inode should not be too few.
Here's the detail log, any new clue?
# mount /dev/sda4 /data1/
# xfs_info /data1/
meta-data=/dev/sda4 isize=256 agcount=4,
agsize=142272384
Post by Greg Freemyer
Post by Dave Chinner
Post by Zhang Qiang
Post by Dave Chinner
4 AGs
Yes.
Post by Dave Chinner
Post by Zhang Qiang
icount = 220619904
ifree = 26202919
And 220 million inodes. There's your problem - that's an average
of 55 million inodes per AGI btree assuming you are using inode64.
If you are using inode32, then the inodes will be in 2 btrees, or
maybe even only one.
You are right, all inodes stay on one AG.
BTW, why i allocate 4 AGs, and all inodes stay in one AG for inode32?,
Because the top addresses in the 2nd AG go over 32 bits, hence only
AG 0 can be used for inodes. Changing to inode64 will give you some
relief, but any time allocation occurs in AG0 is will be slow. i.e.
you'll be trading always slow for "unpredictably slow".
Post by Zhang Qiang
Post by Dave Chinner
With that many inodes, I'd be considering moving to 32 or 64 AGs to
keep the btree size down to a more manageable size. The free inode
btree would also help, but, really, 220M inodes in a 2TB filesystem
Post by Dave Chinner
is really pushing the boundaries of sanity.....
So the better inodes size in one AG is about 5M,
Not necessarily. But for your storage it's almost certainly going to
minimise the problem you are seeing.
Post by Zhang Qiang
is there any documents
about these options I can learn more?
http://xfs.org/index.php/XFS_Papers_and_Documentation
Given the apparently huge number of small files would he likely see a
big performance increase if he replaced that 2TB or rust with SSD.
Doubt it - the profiles showed the allocation being CPU bound
searching the metadata that indexes all those inodes. Those same
profiles showed all the signs that it was hitting the buffer
cache most of the time, too, which is why it was CPU bound....
Cheers,
Dave.
--
Dave Chinner
Dave Chinner
2014-08-26 13:13:54 UTC
Permalink
Post by Zhang Qiang
Thanks Dave/Greg for your analysis and suggestions.
- backup my data using xfsdump
- rebuilt filesystem using mkfs with options: agcount=32 for 2T disk
- mount filesystem with option inode64,nobarrier
Ok up to here.
Post by Zhang Qiang
- applied patches about adding free list inode on disk structure
No, don't do that. You're almost certain to get it wrong and corrupt
your filesysetms and lose data.
Post by Zhang Qiang
As we have about ~100 servers need back up, so that will take much effort,
do you have any other suggestion?
Just remount them with inode64. Nothing else. Over time as you add
and remove files the inodes will redistribute across all 4 AGs.
Post by Zhang Qiang
- created a new 2T partition filesystem
- try to create small files and fill whole spaces then remove some of them
randomly
- check the performance of touch/cp files
- apply patches and verify it.
1) flush all cache(echo 3 > /proc/sys/vm/drop_caches), and umount filesystem
2) mount filesystem and testing with touch command
* The first touch new file command take about ~23s
* second touch command take about ~0.1s.
So it's cache population that is your issue. You didn't say that
first time around, which means the diagnosis was wrong. Again, it's having to
search a btree with 220 million inodes in it to find the first free
inode, and that btree has to be pulled in from disk and searched.
Once it's cached, then each subsequent allocation will be much
faster becaue the majority of the tree being searched will already
be in cache...
Post by Zhang Qiang
I have compared the memory used, it seems that xfs try to load inode bmap
block for the first time, which take much time, is that the reason to take
so much time for the first touch operation?
No. reading the AGI btree to find the first free inode to allocate
is what is taking the time. If you spread the inodes out over 4 AGs
(using inode64) then the overhead of the first read will go down
proportionally. Indeed, that is one of the reasons for using more
AGs than 4 for filesystems lik ethis.

Still, I can't help but wonder why you are using a filesystem to
store hundreds of millions of tiny files, when a database is far
better suited to storing and indexing this type and quantity of
data....

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Zhang Qiang
2014-08-27 08:53:17 UTC
Permalink
Post by Dave Chinner
Post by Zhang Qiang
Thanks Dave/Greg for your analysis and suggestions.
- backup my data using xfsdump
- rebuilt filesystem using mkfs with options: agcount=32 for 2T disk
- mount filesystem with option inode64,nobarrier
Ok up to here.
Post by Zhang Qiang
- applied patches about adding free list inode on disk structure
No, don't do that. You're almost certain to get it wrong and corrupt
your filesysetms and lose data.
Post by Zhang Qiang
As we have about ~100 servers need back up, so that will take much
effort,
Post by Zhang Qiang
do you have any other suggestion?
Just remount them with inode64. Nothing else. Over time as you add
and remove files the inodes will redistribute across all 4 AGs.
OK.

How I can see the layout number of inodes on each AGs? Here's my checking
steps:

1) Check unmounted file system first:
[***@fstest data1]# xfs_db -c "sb 0" -c "p" /dev/sdb1 |egrep
'icount|ifree'
icount = 421793920
ifree = 41
[***@fstest data1]# xfs_db -c "sb 1" -c "p" /dev/sdb1 |egrep
'icount|ifree'
icount = 0
ifree = 0
[***@fstest data1]# xfs_db -c "sb 2" -c "p" /dev/sdb1 |egrep
'icount|ifree'
icount = 0
ifree = 0
[***@fstest data1]# xfs_db -c "sb 3" -c "p" /dev/sdb1 |egrep
'icount|ifree'
icount = 0
ifree = 0
2) mount it with inode64 and create many files:

[***@fstest /]# mount -o inode64,nobarrier /dev/sdb1 /data
[***@fstest /]# cd /data/tmp/
[***@fstest tmp]# fdtree.bash -d 16 -l 2 -f 100 -s 1
[***@fstest /]# umount /data

3) Check with xfs_db again:

[***@fstest data1]# xfs_db -f -c "sb 0" -c "p" /dev/sdb1 |egrep
'icount|ifree'
icount = 421821504
ifree = 52
[***@fstest data1]# xfs_db -f -c "sb 1" -c "p" /dev/sdb1 |egrep
'icount|ifree'
icount = 0
ifree = 0

So, it seems that inodes only on first AG. Or icount/ifree is not the
correct value to check, and how should I check how many inodes on each AGs?


I am finding a way to improve the performance based on current filesystem
and kernel just remounting with inode64, I am trying how to make all inodes
redistribute on all AGs averagely.

Is there any good way?, for example backup half of data to another device
and remove it, then copy back it.
Post by Dave Chinner
Post by Zhang Qiang
- created a new 2T partition filesystem
- try to create small files and fill whole spaces then remove some of
them
Post by Zhang Qiang
randomly
- check the performance of touch/cp files
- apply patches and verify it.
1) flush all cache(echo 3 > /proc/sys/vm/drop_caches), and umount
filesystem
Post by Zhang Qiang
2) mount filesystem and testing with touch command
* The first touch new file command take about ~23s
* second touch command take about ~0.1s.
So it's cache population that is your issue. You didn't say that
first time around, which means the diagnosis was wrong. Again, it's having to
search a btree with 220 million inodes in it to find the first free
inode, and that btree has to be pulled in from disk and searched.
Once it's cached, then each subsequent allocation will be much
faster becaue the majority of the tree being searched will already
be in cache...
Post by Zhang Qiang
I have compared the memory used, it seems that xfs try to load inode bmap
block for the first time, which take much time, is that the reason to
take
Post by Zhang Qiang
so much time for the first touch operation?
No. reading the AGI btree to find the first free inode to allocate
is what is taking the time. If you spread the inodes out over 4 AGs
(using inode64) then the overhead of the first read will go down
proportionally. Indeed, that is one of the reasons for using more
AGs than 4 for filesystems lik ethis.
OK, I see.
Post by Dave Chinner
Still, I can't help but wonder why you are using a filesystem to
store hundreds of millions of tiny files, when a database is far
better suited to storing and indexing this type and quantity of
data....
OK, this is a social networking website back end servers, actually the CDN
infrastructure, and different server located different cities.
We have a global sync script to make all these 100 servers have the same
data.

For each server we use RAID10 and XFS (CentOS6.3).

There are about 3M files (50K in size) generated every day, and we track
the path of each files in database.

Do you have any suggestions to improve our solution?
Post by Dave Chinner
Cheers,
Dave.
--
Dave Chinner
Dave Chinner
2014-08-28 02:08:19 UTC
Permalink
Post by Zhang Qiang
Post by Dave Chinner
Post by Zhang Qiang
Thanks Dave/Greg for your analysis and suggestions.
- backup my data using xfsdump
- rebuilt filesystem using mkfs with options: agcount=32 for 2T disk
- mount filesystem with option inode64,nobarrier
Ok up to here.
Post by Zhang Qiang
- applied patches about adding free list inode on disk structure
No, don't do that. You're almost certain to get it wrong and corrupt
your filesysetms and lose data.
Post by Zhang Qiang
As we have about ~100 servers need back up, so that will take much
effort,
Post by Zhang Qiang
do you have any other suggestion?
Just remount them with inode64. Nothing else. Over time as you add
and remove files the inodes will redistribute across all 4 AGs.
OK.
How I can see the layout number of inodes on each AGs? Here's my checking
'icount|ifree'
icount = 421793920
ifree = 41
'icount|ifree'
icount = 0
ifree = 0
That's wrong. You need to check the AGI headers, not the superblock.
Only the primary superblock gets updated, and it's the aggregated of
all the AGI values, not the per AG values.

And, BTW, that's *421 million* inodes in that filesystem. Almost
twice as many as the filesystem you started showing problems on...
Post by Zhang Qiang
OK, this is a social networking website back end servers, actually the CDN
infrastructure, and different server located different cities.
We have a global sync script to make all these 100 servers have the same
data.
For each server we use RAID10 and XFS (CentOS6.3).
There are about 3M files (50K in size) generated every day, and we track
the path of each files in database.
I'd suggest you are overestimating the size of the files being
storedi by an order of magnitude: 200M files at 50k in size is 10TB,
not 1.5TB.

But you've confirmed exactly what I thought - you're using the
filesystem as an anonymous object store for hundreds of millions of
small objects and that's exactly the situation I'd expect to see
these problems....
Post by Zhang Qiang
Do you have any suggestions to improve our solution?
TANSTAAFL.

I've given you some stuff to try, worst case is reformating and
recopying all the data around. I don't really have much time to do
much more than that - talk to Red Hat (because you are using CentOS)
if you want help with a more targeted solution to your problem...

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Loading...