Thanks Dave/Greg for your analysis and suggestions.
I can summarize what I should do next:
- backup my data using xfsdump
- rebuilt filesystem using mkfs with options: agcount=32 for 2T disk
- mount filesystem with option inode64,nobarrier
- applied patches about adding free list inode on disk structure
As we have about ~100 servers need back up, so that will take much effort,
do you have any other suggestion?
What I am testing (ongoing):
- created a new 2T partition filesystem
- try to create small files and fill whole spaces then remove some of them
randomly
- check the performance of touch/cp files
- apply patches and verify it.
I have got more data from server:
1) flush all cache(echo 3 > /proc/sys/vm/drop_caches), and umount filesystem
2) mount filesystem and testing with touch command
* The first touch new file command take about ~23s
* second touch command take about ~0.1s.
Here's the perf data:
First touch command:
Events: 435 cycles
+ 7.51% touch [xfs] [k] xfs_inobt_get_rec
+ 5.61% touch [xfs] [k] xfs_btree_get_block
+ 5.38% touch [xfs] [k] xfs_btree_increment
+ 4.26% touch [xfs] [k] xfs_btree_get_rec
+ 3.73% touch [kernel.kallsyms] [k] find_busiest_group
+ 3.43% touch [xfs] [k] _xfs_buf_find
+ 2.72% touch [xfs] [k] xfs_btree_readahead
+ 2.38% touch [xfs] [k] xfs_trans_buf_item_match
+ 2.34% touch [xfs] [k] xfs_dialloc
+ 2.32% touch [kernel.kallsyms] [k] generic_make_request
+ 2.09% touch [xfs] [k] xfs_btree_rec_offset
+ 1.75% touch [kernel.kallsyms] [k] kmem_cache_alloc
+ 1.63% touch [kernel.kallsyms] [k] cpumask_next_and
+ 1.41% touch [sd_mod] [k] sd_prep_fn
+ 1.41% touch [kernel.kallsyms] [k] get_page_from_freelist
+ 1.38% touch [kernel.kallsyms] [k] __alloc_pages_nodemask
+ 1.27% touch [kernel.kallsyms] [k] scsi_request_fn
+ 1.22% touch [kernel.kallsyms] [k] blk_queue_bounce
+ 1.20% touch [kernel.kallsyms] [k] cfq_should_idle
+ 1.10% touch [xfs] [k] xfs_btree_rec_addr
+ 1.03% touch [kernel.kallsyms] [k] cfq_dispatch_requests+ 1.00%
touch [kernel.kallsyms] [k] _spin_lock_irqsave+ 0.94% touch
[kernel.kallsyms] [k] memcpy+ 0.86% touch [kernel.kallsyms] [k]
swiotlb_map_sg_attrs+ 0.84% touch [kernel.kallsyms] [k]
alloc_pages_current
+ 0.82% touch [kernel.kallsyms] [k] submit_bio
+ 0.81% touch [megaraid_sas] [k] megasas_build_and_issue_cmd_fusion
+ 0.77% touch [kernel.kallsyms] [k] blk_peek_request
+ 0.73% touch [xfs] [k] xfs_btree_setbuf
+ 0.73% touch [megaraid_sas] [k] MR_GetPhyParams
+ 0.73% touch [kernel.kallsyms] [k] run_timer_softirq
+ 0.71% touch [kernel.kallsyms] [k] pick_next_task_rt
+ 0.71% touch [kernel.kallsyms] [k] init_request_from_bio
+ 0.70% touch [kernel.kallsyms] [k] thread_return
+ 0.69% touch [kernel.kallsyms] [k] cfq_set_request
+ 0.67% touch [kernel.kallsyms] [k] mempool_alloc
+ 0.66% touch [xfs] [k] xfs_buf_hold
+ 0.66% touch [kernel.kallsyms] [k] find_next_bit
+ 0.62% touch [kernel.kallsyms] [k] cfq_insert_request
+ 0.61% touch [kernel.kallsyms] [k] scsi_init_io
+ 0.60% touch [megaraid_sas] [k] MR_BuildRaidContext
+ 0.59% touch [kernel.kallsyms] [k] policy_zonelist
+ 0.59% touch [kernel.kallsyms] [k] elv_insert
+ 0.58% touch [xfs] [k] xfs_buf_allocate_memory
Second perf command:
Events: 105 cycles
+ 20.92% touch [xfs] [k] xfs_inobt_get_rec
+ 14.27% touch [xfs] [k] xfs_btree_get_rec
+ 12.21% touch [xfs] [k] xfs_btree_get_block
+ 12.12% touch [xfs] [k] xfs_btree_increment
+ 9.86% touch [xfs] [k] xfs_btree_readahead
+ 7.87% touch [xfs] [k] _xfs_buf_find
+ 4.93% touch [xfs] [k] xfs_btree_rec_addr
+ 4.12% touch [xfs] [k] xfs_dialloc
+ 3.03% touch [kernel.kallsyms] [k] clear_page_c
+ 2.96% touch [xfs] [k] xfs_btree_rec_offset
+ 1.31% touch [kernel.kallsyms] [k] kmem_cache_free
+ 1.03% touch [xfs] [k] xfs_trans_buf_item_match
+ 0.99% touch [kernel.kallsyms] [k] _atomic_dec_and_lock
+ 0.99% touch [xfs] [k] xfs_inobt_get_maxrecs
+ 0.99% touch [xfs] [k] xfs_buf_unlock
+ 0.99% touch [xfs] [k] kmem_zone_alloc
+ 0.98% touch [kernel.kallsyms] [k] kmem_cache_alloc
+ 0.28% touch [kernel.kallsyms] [k] pgd_alloc
+ 0.17% touch [kernel.kallsyms] [k] page_fault
+ 0.01% touch [kernel.kallsyms] [k] native_write_msr_safe
I have compared the memory used, it seems that xfs try to load inode bmap
block for the first time, which take much time, is that the reason to take
so much time for the first touch operation?
Thanks
Qiang
Post by Zhang Qiangpercent
Post by Greg FreemyerPost by Dave ChinnerPost by Zhang QiangPost by Dave ChinnerPost by Zhang Qiangfree, so the free inode should not be too few.
Here's the detail log, any new clue?
# mount /dev/sda4 /data1/
# xfs_info /data1/
meta-data=/dev/sda4 isize=256 agcount=4,
agsize=142272384
Post by Greg FreemyerPost by Dave ChinnerPost by Zhang QiangYes.
Post by Dave ChinnerPost by Zhang Qiangicount = 220619904
ifree = 26202919
And 220 million inodes. There's your problem - that's an average
of 55 million inodes per AGI btree assuming you are using inode64.
If you are using inode32, then the inodes will be in 2 btrees, or
maybe even only one.
You are right, all inodes stay on one AG.
BTW, why i allocate 4 AGs, and all inodes stay in one AG for inode32?,
Because the top addresses in the 2nd AG go over 32 bits, hence only
AG 0 can be used for inodes. Changing to inode64 will give you some
relief, but any time allocation occurs in AG0 is will be slow. i.e.
you'll be trading always slow for "unpredictably slow".
Post by Zhang QiangPost by Dave ChinnerWith that many inodes, I'd be considering moving to 32 or 64 AGs to
keep the btree size down to a more manageable size. The free inode
btree would also help, but, really, 220M inodes in a 2TB filesystem
Post by Dave Chinneris really pushing the boundaries of sanity.....
So the better inodes size in one AG is about 5M,
Not necessarily. But for your storage it's almost certainly going to
minimise the problem you are seeing.
Post by Zhang Qiangis there any documents
about these options I can learn more?
http://xfs.org/index.php/XFS_Papers_and_Documentation
Given the apparently huge number of small files would he likely see a
big performance increase if he replaced that 2TB or rust with SSD.
Doubt it - the profiles showed the allocation being CPU bound
searching the metadata that indexes all those inodes. Those same
profiles showed all the signs that it was hitting the buffer
cache most of the time, too, which is why it was CPU bound....
Cheers,
Dave.
--
Dave Chinner