Ralf Liebenow
2009-01-30 22:23:59 UTC
Hello !
I heavily use XFS for an incremental backup server (by using rsync --link-dest option
to create hardlinks to unchanged files), and therefore have about 10 million files
on my TB Harddisk. To remove old versions nightly an "rm -rf" will remove a million
hardlinks/files every night.
After a while I had regular oopses and so I updated the system to make sure its
on a current version.
It is now a SuSE 11.1 64Bit with SuSE's Kernel 2.6.27.7-9-default
The Server is a Quad-Core Intel 64Bit with 8 GB RAM running a 64Bit Linux.
(I have vmware server 2 installed, so those modules can be seen in the kmesg,
but the OOPs happens also without them).
Now sometimes the "rm -rf" Job OOPses the kernel and get stuck (there is no
other measurable IO traffic on that system). The /proc/kmesg gives:
cat /proc/kmsg
<0>general protection fault: 0000 [1] SMP
<0>last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
<4>CPU 3
<4>Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device binfmt_mi
sc vmnet(N) vsock(N) vmci(N) vmmon(N) nfsd lockd nfs_acl auth_rpcgss sunrpc expo
rtfs microcode fuse loop dm_mod snd_hda_intel st r8169 snd_pcm snd_timer osst sn
d_page_alloc ppdev iTCO_wdt mii shpchp button rtc_cmos snd_hwdep pci_hotplug par
port_pc rtc_core sky2 ohci1394 intel_agp rtc_lib snd i2c_i801 iTCO_vendor_suppor
t ieee1394 parport pcspkr i2c_core sg soundcore raid456 async_xor async_memcpy a
sync_tx xor raid0 sd_mod crc_t10dif ehci_hcd uhci_hcd usbcore edd raid1 xfs fan
ahci libata dock aic79xx scsi_transport_spi scsi_mod thermal processor thermal_s
ys hwmon
<4>Supported: No
<4>Pid: 5176, comm: xfssyncd Tainted: G 2.6.27.7-9-default #1
<4>RIP: 0010:[<ffffffff80230865>] [<ffffffff80230865>] __wake_up_common+0x29/0x
76
<4>RSP: 0018:ffff880114df9d30 EFLAGS: 00010086
<4>RAX: 7fff8800255b8a70 RBX: ffff8800255b8a60 RCX: 0000000000000000
<4>RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff8800255b8a68
<4>RBP: ffff880114df9d60 R08: 7fff8800255b8a58 R09: 0000000000000282
<4>R10: 0000000000000002 R11: ffff8800255b87c0 R12: 0000000000000001
<4>R13: 0000000000000282 R14: ffff8800255b8a70 R15: 0000000000000000
<4>FS: 0000000000000000(0000) GS:ffff88012fba0ec0(0000) knlGS:0000000000000000
<4>CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>CR2: 00007f28d42a2000 CR3: 0000000124e34000 CR4: 00000000000006e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process xfssyncd (pid: 5176, threadinfo ffff880114df8000, task ffff88012bc1e0
c0)
<4>Stack: 0000000300000000 ffff8800255b8a60 ffff8800255b8a68 0000000000000282
<4> ffff88012d802000 0000000000000001 ffff880114df9d90 ffffffff8023219a
<4> 0000000000000286 0000000000000000 ffff88006ef1d240 ffff88012aca3800
<4>Call Trace:
<4> [<ffffffff8023219a>] complete+0x38/0x4b
<4> [<ffffffffa00f5316>] xfs_iflush+0x73/0x2ab [xfs]
<4> [<ffffffffa010a7a2>] xfs_finish_reclaim+0x12a/0x168 [xfs]
<4> [<ffffffffa010a871>] xfs_finish_reclaim_all+0x91/0xcb [xfs]
<4> [<ffffffffa010925c>] xfs_syncsub+0x50/0x22b [xfs]
<4> [<ffffffffa0118a3a>] xfs_sync_worker+0x17/0x36 [xfs]
<4> [<ffffffffa01189d4>] xfssyncd+0x15d/0x1ac [xfs]
<4> [<ffffffff8025434d>] kthread+0x47/0x73
<4> [<ffffffff8020d7b9>] child_rip+0xa/0x11
<4>
<4>
<0>Code: c9 c3 55 48 89 e5 41 57 4d 89 c7 41 56 4c 8d 77 08 41 55 41 54 41 89 d4
53 48 83 ec 08 89 75 d4 89 4d d0 48 8b 47 08 4c 8d 40 e8 <49> 8b 40 18 48 8d 58
e8 eb 2d 45 8b 28 4c 89 f9 8b 55 d0 8b 75
<1>RIP [<ffffffff80230865>] __wake_up_common+0x29/0x76
<4> RSP <ffff880114df9d30>
<4>---[ end trace a069bd11f2b4e6ab ]---
It _always_ gets stuck at the same place in "complete" of xfssyncd, so i dont
think its hardware related.
I also always did a xfs_repair after very OOPS->Reboot, so the filesystem itself
should be consistent.
I initilly used default settings for mkfs.xfs and mount. Now I use different
settings, but get the same OOPs again, it seems to be unrelated.
What do you recommend ? Has this bug already been addressed within the
hundrets of fixes I've seen on the mailing list ? Shall I try a stock 2.6.28
kernel ?
Thanks in advance !
Ralf
I heavily use XFS for an incremental backup server (by using rsync --link-dest option
to create hardlinks to unchanged files), and therefore have about 10 million files
on my TB Harddisk. To remove old versions nightly an "rm -rf" will remove a million
hardlinks/files every night.
After a while I had regular oopses and so I updated the system to make sure its
on a current version.
It is now a SuSE 11.1 64Bit with SuSE's Kernel 2.6.27.7-9-default
The Server is a Quad-Core Intel 64Bit with 8 GB RAM running a 64Bit Linux.
(I have vmware server 2 installed, so those modules can be seen in the kmesg,
but the OOPs happens also without them).
Now sometimes the "rm -rf" Job OOPses the kernel and get stuck (there is no
other measurable IO traffic on that system). The /proc/kmesg gives:
cat /proc/kmsg
<0>general protection fault: 0000 [1] SMP
<0>last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
<4>CPU 3
<4>Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device binfmt_mi
sc vmnet(N) vsock(N) vmci(N) vmmon(N) nfsd lockd nfs_acl auth_rpcgss sunrpc expo
rtfs microcode fuse loop dm_mod snd_hda_intel st r8169 snd_pcm snd_timer osst sn
d_page_alloc ppdev iTCO_wdt mii shpchp button rtc_cmos snd_hwdep pci_hotplug par
port_pc rtc_core sky2 ohci1394 intel_agp rtc_lib snd i2c_i801 iTCO_vendor_suppor
t ieee1394 parport pcspkr i2c_core sg soundcore raid456 async_xor async_memcpy a
sync_tx xor raid0 sd_mod crc_t10dif ehci_hcd uhci_hcd usbcore edd raid1 xfs fan
ahci libata dock aic79xx scsi_transport_spi scsi_mod thermal processor thermal_s
ys hwmon
<4>Supported: No
<4>Pid: 5176, comm: xfssyncd Tainted: G 2.6.27.7-9-default #1
<4>RIP: 0010:[<ffffffff80230865>] [<ffffffff80230865>] __wake_up_common+0x29/0x
76
<4>RSP: 0018:ffff880114df9d30 EFLAGS: 00010086
<4>RAX: 7fff8800255b8a70 RBX: ffff8800255b8a60 RCX: 0000000000000000
<4>RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff8800255b8a68
<4>RBP: ffff880114df9d60 R08: 7fff8800255b8a58 R09: 0000000000000282
<4>R10: 0000000000000002 R11: ffff8800255b87c0 R12: 0000000000000001
<4>R13: 0000000000000282 R14: ffff8800255b8a70 R15: 0000000000000000
<4>FS: 0000000000000000(0000) GS:ffff88012fba0ec0(0000) knlGS:0000000000000000
<4>CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>CR2: 00007f28d42a2000 CR3: 0000000124e34000 CR4: 00000000000006e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process xfssyncd (pid: 5176, threadinfo ffff880114df8000, task ffff88012bc1e0
c0)
<4>Stack: 0000000300000000 ffff8800255b8a60 ffff8800255b8a68 0000000000000282
<4> ffff88012d802000 0000000000000001 ffff880114df9d90 ffffffff8023219a
<4> 0000000000000286 0000000000000000 ffff88006ef1d240 ffff88012aca3800
<4>Call Trace:
<4> [<ffffffff8023219a>] complete+0x38/0x4b
<4> [<ffffffffa00f5316>] xfs_iflush+0x73/0x2ab [xfs]
<4> [<ffffffffa010a7a2>] xfs_finish_reclaim+0x12a/0x168 [xfs]
<4> [<ffffffffa010a871>] xfs_finish_reclaim_all+0x91/0xcb [xfs]
<4> [<ffffffffa010925c>] xfs_syncsub+0x50/0x22b [xfs]
<4> [<ffffffffa0118a3a>] xfs_sync_worker+0x17/0x36 [xfs]
<4> [<ffffffffa01189d4>] xfssyncd+0x15d/0x1ac [xfs]
<4> [<ffffffff8025434d>] kthread+0x47/0x73
<4> [<ffffffff8020d7b9>] child_rip+0xa/0x11
<4>
<4>
<0>Code: c9 c3 55 48 89 e5 41 57 4d 89 c7 41 56 4c 8d 77 08 41 55 41 54 41 89 d4
53 48 83 ec 08 89 75 d4 89 4d d0 48 8b 47 08 4c 8d 40 e8 <49> 8b 40 18 48 8d 58
e8 eb 2d 45 8b 28 4c 89 f9 8b 55 d0 8b 75
<1>RIP [<ffffffff80230865>] __wake_up_common+0x29/0x76
<4> RSP <ffff880114df9d30>
<4>---[ end trace a069bd11f2b4e6ab ]---
It _always_ gets stuck at the same place in "complete" of xfssyncd, so i dont
think its hardware related.
I also always did a xfs_repair after very OOPS->Reboot, so the filesystem itself
should be consistent.
I initilly used default settings for mkfs.xfs and mount. Now I use different
settings, but get the same OOPs again, it seems to be unrelated.
What do you recommend ? Has this bug already been addressed within the
hundrets of fixes I've seen on the mailing list ? Shall I try a stock 2.6.28
kernel ?
Thanks in advance !
Ralf
--
theCode AG
HRB 78053, Amtsgericht Charlottenbg
USt-IdNr.: DE204114808
Vorstand: Ralf Liebenow, Michael Oesterreich, Peter Witzel
Aufsichtsratsvorsitzender: Wolf von Jaduczynski
Oranienstr. 10-11, 10997 Berlin [×]
fon +49 30 617 897-0 fax -10
***@theCo.de http://www.theCo.de
theCode AG
HRB 78053, Amtsgericht Charlottenbg
USt-IdNr.: DE204114808
Vorstand: Ralf Liebenow, Michael Oesterreich, Peter Witzel
Aufsichtsratsvorsitzender: Wolf von Jaduczynski
Oranienstr. 10-11, 10997 Berlin [×]
fon +49 30 617 897-0 fax -10
***@theCo.de http://www.theCo.de