bad primary superblock - bad magic number !!!

Discussion:

stress_buster

2010-05-10 14:54:00 UTC

my hp proliant DL185 server hangs/crashes and sometimes do not boot
correctly...

[***@localhost dev]# xfs_repair -n /dev/cciss/c0d0p1
Phase 1 - find and verify superblock...
bad primary superblock - bad magic number !!!

attempting to find secondary superblock...
.....................................................................................Sorry,
could not find valid secondary superblock
Exiting now.

[***@localhost dev]# xfs_repair -n /dev/cciss/c0d2
Phase 1 - find and verify superblock...
superblock read failed, offset 0, size 524288, ag 0, rval -1

fatal error -- Input/output error

[***@localhost dev]# xfs_db /dev/cciss/c0d0p1
xfs_db: /dev/cciss/c0d0p1 is not a valid XFS filesystem (unexpected SB magic
number 0x00000000)

The next time, server didnt even boot up alright

i've managed to capture the msgs & traces dumped to console. See below

end_request: I/O error, dev cciss/c0d2, sector 0
end_request: I/O error, dev cciss/c0d2, sector 0
end_request: I/O error, dev cciss/c0d2, sector 1
Quote:
ciss: cmd f6c00000 has CHECK CONDITION sense key = 0x4
end_request: I/O error, dev cciss/c0d3, sector 0
cciss: cmd f6c00000 has CHECK CONDITION sense key = 0x4
end_request: I/O error, dev cciss/c0d3, sector 0

backtrace from SysRq -w

SysRq : Show Blocked State
f7ad1e40 00203082 f7853b90 e54af7f0 e54af948 cba30e00 00000001 00000020
e5ad2250 00000000 000000ff e5ad2250 00000000 00000000 00000000 7fffffff
e55afe00 e55afd44 e55afe04 c05ab1c5 256e2000 00000000 e56e2000 00000000
Call Trace:
[<c05ab1c5>] schedule_timeout+0x13/0x86
[<c05ab095>] wait_for_common+0xb9/0x103
[<c021a4b6>] default_wake_function+0x0/0x8
[<c0409473>] cciss_ioctl+0x6fb/0xd1e
[<c0207852>] read_tsc+0x6/0x22
[<c02335a6>] getnstimeofday+0x4a/0xca
[<c023618a>] tick_dev_program_event+0x1e/0x8c
[<c026c316>] dput+0x31/0xf7
[<c026570c>] __link_path_walk+0x9fd/0xb2b
[<c038442f>] blkdev_driver_ioctl+0x4b/0x5b
[<c054420b>] igmp_rcv+0x38f/0x496
[<c0384ad6>] blkdev_ioctl+0x697/0x6e5
[<c054420b>] igmp_rcv+0x38f/0x496
[<c054420b>] igmp_rcv+0x38f/0x496
[<c027e02d>] do_open+0x1d9/0x258
[<c027e21a>] blkdev_open+0x0/0x4d
[<c027e23f>] blkdev_open+0x25/0x4d
[<c025c3a5>] __dentry_open+0x13b/0x212
[<c025c498>] nameidata_to_filp+0x1c/0x2c
[<c02667c3>] do_filp_open+0x350/0x64d
[<c023786c>] do_futex+0x8a/0x6ee
[<c024feaa>] handle_mm_fault+0x4e0/0x4ea
[<c054420b>] igmp_rcv+0x38f/0x496
[<c027d871>] block_ioctl+0x13/0x16
[<c027d85e>] block_ioctl+0x0/0x16
[<c026744c>] vfs_ioctl+0x1c/0x5d
[<c02676c6>] do_vfs_ioctl+0x239/0x247
[<c025c203>] do_sys_open+0xae/0xb6
[<c0267715>] sys_ioctl+0x41/0x58
[<c0203759>] sysenter_do_call+0x12/0x25
[<c054420b>] igmp_rcv+0x38f/0x496

Plz help.....

Thanks in advance,
David
ub007 is offline Click here to find out more!

--
View this message in context: http://old.nabble.com/bad-primary-superblock---bad-magic-number-%21%21%21-tp28512276p28512276.html
Sent from the Xfs - General mailing list archive at Nabble.com.

Eric Sandeen

2010-05-10 15:42:13 UTC

Permalink

Post by stress_buster
my hp proliant DL185 server hangs/crashes and sometimes do not boot
correctly...
Phase 1 - find and verify superblock...
bad primary superblock - bad magic number !!!
attempting to find secondary superblock...
.....................................................................................Sorry,
could not find valid secondary superblock
Exiting now.
Phase 1 - find and verify superblock...
superblock read failed, offset 0, size 524288, ag 0, rval -1
fatal error -- Input/output error
xfs_db: /dev/cciss/c0d0p1 is not a valid XFS filesystem (unexpected SB magic
number 0x00000000)
The next time, server didnt even boot up alright
i've managed to capture the msgs & traces dumped to console. See below
end_request: I/O error, dev cciss/c0d2, sector 0
end_request: I/O error, dev cciss/c0d2, sector 0
end_request: I/O error, dev cciss/c0d2, sector 1
ciss: cmd f6c00000 has CHECK CONDITION sense key = 0x4
end_request: I/O error, dev cciss/c0d3, sector 0
cciss: cmd f6c00000 has CHECK CONDITION sense key = 0x4
end_request: I/O error, dev cciss/c0d3, sector 0

You seem to have serious storage problems that are not XFS related.

You'll need to get that resolved.

-Eric

Leo Davis

2010-05-10 18:11:45 UTC

Permalink

Many Thanks.

I agree. I destroy and re-create raid and everything would show up GOOD, only for it to break again.
So was wondering whether those traces would point to anything.... my prime suspect is hard drives, but those xfs msgs confused me.
Apologies for posting before carrying put more tests.

Thanks,
leo

________________________________
From: Eric Sandeen <***@sandeen.net>
To: stress_buster <***@yahoo.com>
Cc: ***@oss.sgi.com
Sent: Mon, May 10, 2010 4:42:13 PM
Subject: Re: bad primary superblock - bad magic number !!!

You seem to have serious storage problems that are not XFS related.

You'll need to get that resolved.

-Eric

Emmanuel Florac

2010-05-10 20:22:11 UTC

Permalink

Post by Leo Davis
I agree. I destroy and re-create raid and everything would show up
GOOD, only for it to break again. So was wondering whether those
traces would point to anything.... my prime suspect is hard drives,
but those xfs msgs confused me.

Check the hard drives separately with the maker utility (Seatools,
etc). One of them at the very least must be seriously ill.

--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <***@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------

Leo Davis

2010-05-12 09:19:16 UTC

Permalink

havent had much success with testing the hard drives, tried mhdd & seatools with no luck yet.

Meanwhile i recreated RAID, all shows up OK for now.

Previously the msgs shown were:

end_request: I/O error, dev cciss/c0d2, sector 0
end_request: I/O error, dev cciss/c0d2, sector 0
end_request: I/O error, dev cciss/c0d2, sector 1

That seems to indicate that the problem is with the disc or array. It
is unable to read the beginning of the device.
So is I do a - dd if=/dev/random of=dev/cciss/c0d2 , that should fail
and therby confirm that the drive or array has issues...do i make any
sense here?

thanks

________________________________
From: Emmanuel Florac <***@intellique.com>
To: Leo Davis <***@yahoo.com>
Cc: ***@oss.sgi.com
Sent: Mon, May 10, 2010 9:22:11 PM
Subject: Re: bad primary superblock - bad magic number !!!

Check the hard drives separately with the maker utility (Seatools,
etc). One of them at the very least must be seriously ill.

Emmanuel Florac

2010-05-12 12:29:02 UTC

Permalink

Le Wed, 12 May 2010 02:19:16 -0700 (PDT)

Post by Leo Davis
havent had much success with testing the hard drives, tried mhdd &
seatools with no luck yet.

What do you mean? Does the tools report any problem with the drives?

Post by Leo Davis
So is I do a - dd if=/dev/random of=dev/cciss/c0d2 , that should fail
and therby confirm that the drive or array has issues...do i make any
sense here?

Uh, you should try the other way around to avoid breaking the
filesystem :

dd if=/dev/cciss/c0d2 of=/dev/null bs=131072

If no error occurs it should be OK.

Leo Davis

2010-05-12 12:45:03 UTC

Permalink

Post by Leo Davis
So is I do a - dd if=/dev/random of=dev/cciss/c0d2 , that should fail
and therby confirm that the drive or array has issues...do i make any
sense here?
Uh, you should try the other way around to avoid breaking the
dd if=/dev/cciss/c0d2 of=/dev/null bs=131072
If no error occurs it should be OK.

# dd if=/dev/cciss/c0d2 of=/dev/null bs=131072
796+1 records in
796+1 records out
#

doesn't show any errors here ......

Post by Leo Davis
havent had much success with testing the hard drives, tried mhdd &
seatools with no luck yet.
What do you mean? Does the tools report any problem with the drives?

mhdd doesnt detect the drives, probably an issue with chipset...still looking for tools

thanks

________________________________
From: Emmanuel Florac <***@intellique.com>
To: Leo Davis <***@yahoo.com>
Cc: ***@oss.sgi.com
Sent: Wed, May 12, 2010 1:29:02 PM
Subject: Re: bad primary superblock - bad magic number !!!

Le Wed, 12 May 2010 02:19:16 -0700 (PDT)

Post by Leo Davis
havent had much success with testing the hard drives, tried mhdd &
seatools with no luck yet.

What do you mean? Does the tools report any problem with the drives?

Post by Leo Davis
So is I do a - dd if=/dev/random of=dev/cciss/c0d2 , that should fail
and therby confirm that the drive or array has issues...do i make any
sense here?

Uh, you should try the other way around to avoid breaking the
filesystem :

dd if=/dev/cciss/c0d2 of=/dev/null bs=131072

If no error occurs it should be OK.

Leo Davis

2010-05-13 08:38:25 UTC

Permalink

Post by Emmanuel Florac
Uh, you should try the other way around to avoid breaking the
dd if=/dev/cciss/c0d2 of=/dev/null bs=131072
If no error occurs it should be OK.

i did that on all 4 luns
#dd if=/dev/cciss/c0d2 of=/dev/null bs=131072
796+1 records in
796+1 records out

#dd if=/dev/cciss/c0d0 of=/dev/null bs=131072
796+1 records in
796+1 records out

# dd if=/dev/cciss/c0d1 of=/dev/null bs=131072
68675509+1 records in
68675509+1 records out

## dd if=/dev/cciss/c0d3 of=/dev/null bs=131072
68675509+1 records in
68675509+1 records out

I also had a serial cable attached to my P800 controller to capture any traces..this is what that picked up:

/dev/cciss/c0d0: [05/12 13:38:28]Int13 BIOS unit 0x81 = CISS LUN 0x0000004000000
000
/dev/cciss/c0d0: [05/12 13:38:28]Int13 BIOS unit 0x82 = CISS LUN 0x0100004000000
000
/dev/cciss/c0d0: [05/12 13:38:28]Int13 BIOS unit 0x83 = CISS LUN 0x0200004000000
000
/dev/cciss/c0d0: [05/12 13:38:28]Int13 BIOS unit 0x84 = CISS LUN 0x0300004000000
000
/dev/cciss/c0d0: [05/13 09:13:03]PR=030fefb8h D245 Op=1c PLErr=04 IopErr=30 S=00
STag=0x018d Has/dev/cciss/c0d0: hAddr=0x00e59c6c PLLog=0x31190000
/dev/cciss/c0d0: [05/13 09:21:04]Ctlr SCSI Request, Illegal CDB Opcode=0x3c
/dev/cciss/c0d0: [05/13 09:21:08]BadReq:CDB0-15=260008000000A200A000000000000000
,LUN=00000000L00/dev/cciss/c0d0: 000000H
/dev/cciss/c0d0: [05/13 09:21:08]BadReq:CDB0-15=260009000000A200A000000000000000
,LUN=00000000L00/dev/cciss/c0d0: 000000H
/dev/cciss/c0d0: [05/13 09:21:08]BadReq:CDB0-15=26000A000000A200A000000000000000
,LUN=00000000L00/dev/cciss/c0d0: 000000H
/dev/cciss/c0d0: [05/13 09:21:08]BadReq:CDB0-15=26000B000000A200A000000000000000
,LUN=00000000L00/dev/cciss/c0d0: 000000H
/dev/cciss/c0d0: [05/13 09:21:08]BadReq:CDB0-15=26000C000000A200A000000000000000
..the spew continues..

Any thoughts here?

________________________________
From: Emmanuel Florac <***@intellique.com>
To: Leo Davis <***@yahoo.com>
Cc: ***@oss.sgi.com
Sent: Wed, May 12, 2010 1:29:02 PM
Subject: Re: bad primary superblock - bad magic number !!!

Le Wed, 12 May 2010 02:19:16 -0700 (PDT)

Post by Emmanuel Florac
havent had much success with testing the hard drives, tried mhdd &
seatools with no luck yet.

What do you mean? Does the tools report any problem with the drives?

Post by Emmanuel Florac
So is I do a - dd if=/dev/random of=dev/cciss/c0d2 , that should fail
and therby confirm that the drive or array has issues...do i make any
sense here?

Uh, you should try the other way around to avoid breaking the
filesystem :

dd if=/dev/cciss/c0d2 of=/dev/null bs=131072

If no error occurs it should be OK.

Emmanuel Florac

2010-05-13 09:41:51 UTC

Permalink

Post by Leo Davis
/dev/cciss/c0d0: [05/13
09:21:08]BadReq:CDB0-15=26000C000000A200A000000000000000 ..the spew
continues..
Any thoughts here?

If I understand correctly, c0d0 represents a drive (the first one).
Apparently this drive is dead, or close. You should probably ditch it.

Leo Davis

2010-05-13 11:00:33 UTC

Permalink

Post by Emmanuel Florac
If I understand correctly, c0d0 represents a drive (the first one).
Apparently this drive is dead, or close. You should probably ditch it.

nope, c0d0 represents ( ControllerNumber[c0] LogicalDriveNumber[d0] )
so its 12 disks in 2 partitions- c0d0 and c0d1
c0d0 holds configuration information
I boot from a different device, the raid set is used only for storing data.

cheers

________________________________
From: Emmanuel Florac <***@intellique.com>
To: Leo Davis <***@yahoo.com>
Cc: ***@oss.sgi.com
Sent: Thu, May 13, 2010 10:41:51 AM
Subject: Re: bad primary superblock - bad magic number !!!

Post by Emmanuel Florac
/dev/cciss/c0d0: [05/13
09:21:08]BadReq:CDB0-15=26000C000000A200A000000000000000 ..the spew
continues..
Any thoughts here?

If I understand correctly, c0d0 represents a drive (the first one).
Apparently this drive is dead, or close. You should probably ditch it.

Emmanuel Florac

2010-05-13 14:03:20 UTC

Permalink

Post by Leo Davis
nope, c0d0 represents ( ControllerNumber[c0] LogicalDriveNumber[d0] )
so its 12 disks in 2 partitions- c0d0 and c0d1
c0d0 holds configuration information
I boot from a different device, the raid set is used only for storing data.

Oh, OK. I don't understand how a whole array may generate errors. Maybe
the controller's bad then?

Stan Hoeppner

2010-05-13 17:23:13 UTC

Permalink

Post by Leo Davis

Post by Emmanuel Florac
If I understand correctly, c0d0 represents a drive (the first one).
Apparently this drive is dead, or close. You should probably ditch it.

Which model SmartArray controller is this? Is it SCSI, SAS, or SATA?

If SAS or SATA, is there an expander in the enclosure?

What model is the external drive enclosure which houses the 12 drives?

--
Stan