XFS write cache flush policy

Post by Lin Li
Hi, Guys. I recently suffered a huge data loss on power cut on an XFS
partition. The problem was that I copied a lot of files (roughly 20Gb) to
an XFS partition, then 10 hours later, I got an unexpected power cut. As a
result, all these newly copied files disappeared as if they had never been
copied. I tried to check and repair the partition, but xfs_check reports no
error at all. So I guess the problem is that the meta data for these files
were all kept in the cache (64Mb) and were never committed to the hard
disk.
What is the cache flush policy for XFS? Does it always reserve some fixed
space in cache for metadata? I asked because I thought since I copied such
a huge amount of data, at least some of these files must be fully committed
to the hard disk, then cache is only 64Mb anyway. But the reality is all of
them were lost. the only possibility I can think is some part of the cache
was reserved for meta data, so even the cache is fully filled, this part
will not be written to the disk. Am I right?

I have the same problem, several times.

The latest just an hour ago.
I'm copying a HDD onto another. Plain rsync -a /src/ /tgt/ Both HDDs are
3TB SATA-drives in a USB3-enclosure with a dm-crypt layer in between.
About 45 minutes into copying the target HDD disconnects for a moment.
45minutes means someting over 200GB were copied, each file is about
900MB.
After remounting the filesystems there were exactly 0 files.

After that i started a "while true; do sync ; done"-loop in the
background.
And just while i was writing this email the HDD disconnected a second
time. But this time the files up until the last 'sync' were retained.

And something like this has happend to me at least a half dozen times in
the last few month. I think the first time was with kernel 3.5.X, when i
was actually booting into 3.6 with a plain "reboot" (filesystem might
not have been umounted cleanly.), after the reboot the changes of about
the last half hour were gone. e.g. i had renamed a directory about 15
minutes before i rebooted and after the reboot the directory had it's
old name back.

Kernel in all but (maybe)one case is between 3.6 and 3.6.2 (currently),
the first time MIGHT have been something around 3.5.8 but i'm not sure.
HDDs were either connected by plain SATA(AHCI) or by USB3 enclosure. All
affected filesystems were/are with a dm-crypt layer inbetween.

--
Matthias

Michael Monnerie

2012-12-08 19:40:07 UTC

Post by Matthias Schniedermeyer
I have the same problem, several times.

I'd like to chime in here, with a similar issue: I have Linux on my
desktop, xfs is the home partition, 16G RAM. One day my system froze, no
chance to do a buffer flush via SYS-S/U/B, I had to press the reset
button (no power off, just reset). Upon restart, *lots* of files were
gone, destroyed, etc, and my KDE desktop wouldn't work anymore. Luckily
I have backups, and could restore - but this just shouldn't happen, and
it was *much* better with older kernels. What is the problem that
metadata isn't written to disk occasionally? I was on 3.6.6 when that
happened, now on 3.6.8, so very recent.
--
mit freundlichen GrÃŒssen,
Michael Monnerie, Ing. BSc

it-management Internet Services: ProtÃ©ger
http://proteger.at [gesprochen: Prot-e-schee]
Tel: +43 660 / 415 6531
XING: https://www.xing.com/profile/Michael_Monnerie
Twitter: @MichaelMonnerie https://twitter.com/MichaelMonnerie
FaceBook: https://www.facebook.com/michael.monnerie
LinkedIn: http://lnkd.in/uGx6ug

Joe Landman

2012-12-08 19:51:11 UTC

Post by Matthias Schniedermeyer
I have the same problem, several times.

I am not sure this is xfs specific ... I think I've had this problem on
ext3 in the 3.x (x >= 5) region ... though I am trying to disambiguate
this from an mdadm 3.2.6 bug.

Matthias Schniedermeyer

2012-12-08 19:53:04 UTC

Post by Matthias Schniedermeyer
I have the same problem, several times.

Now that you say it, this is what happend one of the other times,
luckily i had done a backup just before i rebooted, so after
xfs_repair'ing the partition (the only time i had to repair something
for as long as i'm using XFS) i had to restore my home-directory from
backup to get my desktop in a usable state again.

--
Matthias

Lin Li

2012-12-09 07:19:37 UTC

I used to JFS. JFS used to (maybe still) have the problem of indefinite
cache flush that is the write cache will not be flushed until the "flush
trigger" is fired. But in case of JFS, if I copy a lot of reasonable large
files, it seems that only the last one will have part of the content remain
in the cache. But with XFS, it "seems" to me that the file system dedicates
some cache space to the meta data, even new data is loading into the cache,
the meta data part seems not to be flushed.

At least, in your case, xfs_repair can detect errors, but in my case, it
does not find anything.

Post by Matthias Schniedermeyer
I have the same problem, several times.

Dave Chinner

2012-12-10 01:01:41 UTC

Post by Matthias Schniedermeyer
I have the same problem, several times.

So the hard lockup caused your filesystem to be corrupted, and after
you ran repair lots of files were missing?

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

People, please be precise about what happened when reporting
problems. Random "me too" posts with ambiguous information in
them does not help solve problems.

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Michael Monnerie

2012-12-10 20:14:47 UTC

Post by Michael Monnerie
I'd like to chime in here, with a similar issue: I have Linux on my
desktop, xfs is the home partition, 16G RAM. One day my system froze,
no chance to do a buffer flush via SYS-S/U/B, I had to press the
reset button (no power off, just reset). Upon restart, lots of files
were gone, destroyed, etc, and my KDE desktop wouldn't work anymore.

Similar problem yesterday: this time I could stop KDE, and get a root
session. Killed all processes by hand, because one xfs partition was
still open despite "lsof" not showing anything. At last there were only
a handful processes open, I killed "rsyncd", and suddenly disk I/O was
100%, looks like it was writing a lot of buffers, the disk was about 15s
in full activity. I've seen this strange behaviour before (lot and long
disk activity on reboot), but only now I could trace it down to rsyncd.

I have rsyncd running here as a target, my server is backuped here once
per night. So it's strange to see it having "something" open. I've moved
the backup target dir to another partition now, to see if I can see that
behaviour again.
--
mit freundlichen GrÃŒssen,
Michael Monnerie, Ing. BSc

it-management Internet Services: ProtÃ©ger
http://proteger.at [gesprochen: Prot-e-schee]
Tel: +43 660 / 415 6531
XING: https://www.xing.com/profile/Michael_Monnerie
Twitter: @MichaelMonnerie https://twitter.com/MichaelMonnerie
FaceBook: https://www.facebook.com/michael.monnerie
LinkedIn: http://lnkd.in/uGx6ug

Dave Chinner

2012-12-10 00:58:20 UTC

This sounds like an entirely different problem to what the OP
reported. Did the filesystem have an error returned? i.e. did it
shut down (what's in dmesg)? Did you run repair in between the
shutdown and remount? How many files in that 200GB of data?

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

Basically, you have an IO error situation, and you have dm-crypt
in-between buffering an unknown about of changes. In my experience,
data loss eventsi are rarely filesystem problems when USB drives or
dm-crypt is involved...

Post by Matthias Schniedermeyer
After that i started a "while true; do sync ; done"-loop in the
background.
And just while i was writing this email the HDD disconnected a second
time. But this time the files up until the last 'sync' were retained.

Exactly as I'd expect.

Post by Matthias Schniedermeyer
And something like this has happend to me at least a half dozen times in
the last few month. I think the first time was with kernel 3.5.X, when i
was actually booting into 3.6 with a plain "reboot" (filesystem might
not have been umounted cleanly.), after the reboot the changes of about
the last half hour were gone. e.g. i had renamed a directory about 15
minutes before i rebooted and after the reboot the directory had it's
old name back.
Kernel in all but (maybe)one case is between 3.6 and 3.6.2 (currently),
the first time MIGHT have been something around 3.5.8 but i'm not sure.
HDDs were either connected by plain SATA(AHCI) or by USB3 enclosure. All
affected filesystems were/are with a dm-crypt layer inbetween.

Given that dm-crypt is the common factor here, I'd start by ruling
that out. i.e. reproduce the problem without dm-crypt being used.

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Matthias Schniedermeyer

2012-12-10 09:12:39 UTC

This sounds like an entirely different problem to what the OP
reported.

For me it sounds only like different timing.
Otherwise i don't see much difference in files vanished after a few
hours(of inactiviry) and a few minutes (while still beeing active).

Post by Dave Chinner
Did the filesystem have an error returned?

No.

Post by Dave Chinner
i.e. did it shut down (what's in dmesg)?

There's not much XFS could have done after the block-device vanished.
A dis-/r-eappierung block-device gets a new name because the old name is
still "in use", the block-devic gets cleaned up after 'umount'ing and
closing the dm-crypt device.

When the USB3-HDD disconnected it reappered a moment later under a new
name, it bounced between sdc <-> sdf.

In syslog it's a plain "USB disconnect, device number XX" message.
Followed by a standard new device found message-bombardment. In between
there are some error-messages, but as it's pratically a yanked out and
replugged cable, a little complaing by the kernel is to be expected.

Post by Dave Chinner
Did you run repair in between the shutdown and remount?

No.

XFS (dm-3): Mounting Filesystem
XFS (dm-3): Starting recovery (logdev: internal)
XFS (dm-3): Ending recovery (logdev: internal)

Post by Dave Chinner
How many files in that 200GB of data?

At 0.9GB/file at least 220.

Post by Dave Chinner
http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
Basically, you have an IO error situation, and you have dm-crypt
in-between buffering an unknown about of changes. In my experience,
data loss eventsi are rarely filesystem problems when USB drives or
dm-crypt is involved...

I don't know the inner workings auf dm-*, but shouldn't it behave
transparent and rely on the block-layer for buffering.

Exactly as I'd expect.

Given that dm-crypt is the common factor here, I'd start by ruling
that out. i.e. reproduce the problem without dm-crypt being used.

That's a slight problem for me, pratically everything i have is
encrypted.

Now that i think about it, maybe dm-crypt really is to blame, up until a
few month ago i was using loop-AES. After dm-crypt got the capability to
emulate it i have moved over to dm-crypt because the loop-AES support in
Debian got worse over time. I didn't have any problems until after i
moved to dm-crypt, but OTOH i'm not the only one using dm-crypt. But
OTOOH maybe not so many people use the loop-AES compatibility-mode.

--
Matthias

Eric Sandeen

2012-12-10 20:54:47 UTC

This sounds like an entirely different problem to what the OP
reported.

For me it sounds only like different timing.
Otherwise i don't see much difference in files vanished after a few
hours(of inactiviry) and a few minutes (while still beeing active).

Post by Dave Chinner
Did the filesystem have an error returned?

No.

Post by Dave Chinner
i.e. did it shut down (what's in dmesg)?

There's not much XFS could have done after the block-device vanished.

except to shut down...

Post by Matthias Schniedermeyer
A dis-/r-eappierung block-device gets a new name because the old name is
still "in use", the block-devic gets cleaned up after 'umount'ing and
closing the dm-crypt device.
When the USB3-HDD disconnected it reappered a moment later under a new
name, it bounced between sdc <-> sdf.
In syslog it's a plain "USB disconnect, device number XX" message.
Followed by a standard new device found message-bombardment. In between
there are some error-messages, but as it's pratically a yanked out and
replugged cable, a little complaing by the kernel is to be expected.

Sure, but Dave asked if the filesystem shut down. XFS messages would
tell you that; *were* there messages from XFS in the log from the event?
Sometimes "a little complaining" can be quite informative. :)

Post by Dave Chinner
Did you run repair in between the shutdown and remount?

No.
XFS (dm-3): Mounting Filesystem
XFS (dm-3): Starting recovery (logdev: internal)
XFS (dm-3): Ending recovery (logdev: internal)

Post by Dave Chinner
How many files in that 200GB of data?

At 0.9GB/file at least 220.

I don't know the inner workings auf dm-*, but shouldn't it behave
transparent and rely on the block-layer for buffering.

I think that's partly why Dave asked you to test it, to check
that theory ;)

Exactly as I'd expect.

Given that dm-crypt is the common factor here, I'd start by ruling
that out. i.e. reproduce the problem without dm-crypt being used.

That's a slight problem for me, pratically everything i have is
encrypted.

But this is an external drive; you could run a similar test with unencrypted
data on a different hard drive, to try to get to the bottom of this
problem, right?

Thanks,
-Eric

Post by Matthias Schniedermeyer
Now that i think about it, maybe dm-crypt really is to blame, up until a
few month ago i was using loop-AES. After dm-crypt got the capability to
emulate it i have moved over to dm-crypt because the loop-AES support in
Debian got worse over time. I didn't have any problems until after i
moved to dm-crypt, but OTOH i'm not the only one using dm-crypt. But
OTOOH maybe not so many people use the loop-AES compatibility-mode.

Matthias Schniedermeyer

2012-12-10 21:45:11 UTC

This sounds like an entirely different problem to what the OP
reported.

For me it sounds only like different timing.
Otherwise i don't see much difference in files vanished after a few
hours(of inactiviry) and a few minutes (while still beeing active).

Post by Dave Chinner
Did the filesystem have an error returned?

No.

Post by Dave Chinner
i.e. did it shut down (what's in dmesg)?

There's not much XFS could have done after the block-device vanished.

except to shut down...

Which it eventually did.

This is everything from the "disconnect" up to the point the syslog got
quiet again. It took XFS nearly a minute to realize the block-device
went away. And the impression of "a moment" that stuck in my mind was
actually 30 seconds. That's slight longer that "a moment". :-)

This is only from the first time.

- snip -
Dec 8 19:33:15 leeloo kernel: [4823478.632190] usb 2-4: USB disconnect, device number 9
Dec 8 19:33:25 leeloo kernel: [4823488.440268] quiet_error: 183252 callbacks suppressed
Dec 8 19:33:25 leeloo kernel: [4823488.440271] Buffer I/O error on device dm-5, logical block 116125685
Dec 8 19:33:25 leeloo kernel: [4823488.440272] lost page write due to I/O error on dm-5
Dec 8 19:33:25 leeloo kernel: [4823488.440274] Buffer I/O error on device dm-5, logical block 116125686
Dec 8 19:33:25 leeloo kernel: [4823488.440274] lost page write due to I/O error on dm-5
Dec 8 19:33:25 leeloo kernel: [4823488.440275] Buffer I/O error on device dm-5, logical block 116125687
Dec 8 19:33:25 leeloo kernel: [4823488.440276] lost page write due to I/O error on dm-5
Dec 8 19:33:25 leeloo kernel: [4823488.440277] Buffer I/O error on device dm-5, logical block 116125688
Dec 8 19:33:25 leeloo kernel: [4823488.440277] lost page write due to I/O error on dm-5
Dec 8 19:33:25 leeloo kernel: [4823488.440278] Buffer I/O error on device dm-5, logical block 116125689
Dec 8 19:33:25 leeloo kernel: [4823488.440279] lost page write due to I/O error on dm-5
Dec 8 19:33:25 leeloo kernel: [4823488.440280] Buffer I/O error on device dm-5, logical block 116125690
Dec 8 19:33:25 leeloo kernel: [4823488.440280] lost page write due to I/O error on dm-5
Dec 8 19:33:25 leeloo kernel: [4823488.440281] Buffer I/O error on device dm-5, logical block 116125691
Dec 8 19:33:25 leeloo kernel: [4823488.440282] lost page write due to I/O error on dm-5
Dec 8 19:33:25 leeloo kernel: [4823488.440282] Buffer I/O error on device dm-5, logical block 116125692
Dec 8 19:33:25 leeloo kernel: [4823488.440283] lost page write due to I/O error on dm-5
Dec 8 19:33:25 leeloo kernel: [4823488.440284] Buffer I/O error on device dm-5, logical block 116125693
Dec 8 19:33:25 leeloo kernel: [4823488.440284] lost page write due to I/O error on dm-5
Dec 8 19:33:25 leeloo kernel: [4823488.440285] Buffer I/O error on device dm-5, logical block 116125694
Dec 8 19:33:25 leeloo kernel: [4823488.440286] lost page write due to I/O error on dm-5
Dec 8 19:33:45 leeloo kernel: [4823509.007306] scsi 143:0:0:0: [sdc] Unhandled error code
Dec 8 19:33:45 leeloo kernel: [4823509.007308] scsi 143:0:0:0: [sdc]
Dec 8 19:33:45 leeloo kernel: [4823509.007309] Result: hostbyte=0x05 driverbyte=0x00
Dec 8 19:33:45 leeloo kernel: [4823509.007310] scsi 143:0:0:0: [sdc] CDB:
Dec 8 19:33:45 leeloo kernel: [4823509.007311] cdb[0]=0x2a: 2a 00 37 57 8e 00 00 00 f0 00
Dec 8 19:33:45 leeloo kernel: [4823509.007315] end_request: I/O error, dev sdc, sector 928484864
Dec 8 19:33:45 leeloo kernel: [4823509.007322] scsi 143:0:0:0: rejecting I/O to offline device
Dec 8 19:33:45 leeloo kernel: [4823509.007324] scsi 143:0:0:0: [sdc] killing request
Dec 8 19:33:45 leeloo kernel: [4823509.008018] scsi 143:0:0:0: [sdc] Unhandled error code
Dec 8 19:33:45 leeloo kernel: [4823509.008019] scsi 143:0:0:0: [sdc]
Dec 8 19:33:45 leeloo kernel: [4823509.008020] Result: hostbyte=0x01 driverbyte=0x00
Dec 8 19:33:45 leeloo kernel: [4823509.008021] scsi 143:0:0:0: [sdc] CDB:
Dec 8 19:33:45 leeloo kernel: [4823509.008021] cdb[0]=0x2a: 2a 00 37 57 8e f0 00 00 f0 00
Dec 8 19:33:45 leeloo kernel: [4823509.008024] end_request: I/O error, dev sdc, sector 928485104
Dec 8 19:33:45 leeloo kernel: [4823509.008032] quiet_error: 28666 callbacks suppressed
Dec 8 19:33:45 leeloo kernel: [4823509.008033] Buffer I/O error on device dm-5, logical block 116050587
Dec 8 19:33:45 leeloo kernel: [4823509.008033] lost page write due to I/O error on dm-5
Dec 8 19:33:45 leeloo kernel: [4823509.008035] Buffer I/O error on device dm-5, logical block 116050588
Dec 8 19:33:45 leeloo kernel: [4823509.008036] lost page write due to I/O error on dm-5
Dec 8 19:33:45 leeloo kernel: [4823509.008037] Buffer I/O error on device dm-5, logical block 116050589
Dec 8 19:33:45 leeloo kernel: [4823509.008037] lost page write due to I/O error on dm-5
Dec 8 19:33:45 leeloo kernel: [4823509.008038] Buffer I/O error on device dm-5, logical block 116050590
Dec 8 19:33:45 leeloo kernel: [4823509.008039] lost page write due to I/O error on dm-5
Dec 8 19:33:45 leeloo kernel: [4823509.008040] Buffer I/O error on device dm-5, logical block 116050591
Dec 8 19:33:45 leeloo kernel: [4823509.008040] lost page write due to I/O error on dm-5
Dec 8 19:33:45 leeloo kernel: [4823509.008041] Buffer I/O error on device dm-5, logical block 116050592
Dec 8 19:33:45 leeloo kernel: [4823509.008042] lost page write due to I/O error on dm-5
Dec 8 19:33:45 leeloo kernel: [4823509.008043] Buffer I/O error on device dm-5, logical block 116050593
Dec 8 19:33:45 leeloo kernel: [4823509.008043] lost page write due to I/O error on dm-5
Dec 8 19:33:45 leeloo kernel: [4823509.008044] Buffer I/O error on device dm-5, logical block 116050594
Dec 8 19:33:45 leeloo kernel: [4823509.008045] lost page write due to I/O error on dm-5
Dec 8 19:33:45 leeloo kernel: [4823509.008046] Buffer I/O error on device dm-5, logical block 116050595
Dec 8 19:33:45 leeloo kernel: [4823509.008046] lost page write due to I/O error on dm-5
Dec 8 19:33:45 leeloo kernel: [4823509.008047] Buffer I/O error on device dm-5, logical block 116050596
Dec 8 19:33:45 leeloo kernel: [4823509.008048] lost page write due to I/O error on dm-5
Dec 8 19:33:45 leeloo kernel: [4823509.224036] usb 2-4: new SuperSpeed USB device number 11 using xhci_hcd
Dec 8 19:33:45 leeloo kernel: [4823509.235665] usb 2-4: New USB device found, idVendor=174c, idProduct=5106
Dec 8 19:33:45 leeloo kernel: [4823509.235667] usb 2-4: New USB device strings: Mfr=2, Product=3, SerialNumber=1
Dec 8 19:33:45 leeloo kernel: [4823509.235669] usb 2-4: Product: AS2105
Dec 8 19:33:45 leeloo kernel: [4823509.235670] usb 2-4: Manufacturer: ASMedia
Dec 8 19:33:45 leeloo kernel: [4823509.235672] usb 2-4: SerialNumber: WD-WMXXXXXXXXXX
Dec 8 19:33:45 leeloo kernel: [4823509.236341] scsi145 : usb-storage 2-4:1.0
Dec 8 19:33:46 leeloo kernel: [4823510.238640] scsi 145:0:0:0: Direct-Access WDC WD30 EZRX-00DC0B0 80.0 PQ: 0 ANSI: 5
Dec 8 19:33:46 leeloo kernel: [4823510.238764] sd 145:0:0:0: Attached scsi generic sg2 type 0
Dec 8 19:33:46 leeloo kernel: [4823510.238916] sd 145:0:0:0: [sdf] Very big device. Trying to use READ CAPACITY(16).
Dec 8 19:33:46 leeloo kernel: [4823510.239036] sd 145:0:0:0: [sdf] 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB)
Dec 8 19:33:46 leeloo kernel: [4823510.239275] sd 145:0:0:0: [sdf] Write Protect is off
Dec 8 19:33:46 leeloo kernel: [4823510.239278] sd 145:0:0:0: [sdf] Mode Sense: 23 00 00 00
Dec 8 19:33:46 leeloo kernel: [4823510.239511] sd 145:0:0:0: [sdf] No Caching mode page present
Dec 8 19:33:46 leeloo kernel: [4823510.239513] sd 145:0:0:0: [sdf] Assuming drive cache: write through
Dec 8 19:33:46 leeloo kernel: [4823510.239773] sd 145:0:0:0: [sdf] Very big device. Trying to use READ CAPACITY(16).
Dec 8 19:33:46 leeloo kernel: [4823510.240372] sd 145:0:0:0: [sdf] No Caching mode page present
Dec 8 19:33:46 leeloo kernel: [4823510.240374] sd 145:0:0:0: [sdf] Assuming drive cache: write through
Dec 8 19:33:47 leeloo kernel: [4823510.897149] sdf: sdf1
Dec 8 19:33:47 leeloo kernel: [4823510.897492] sd 145:0:0:0: [sdf] Very big device. Trying to use READ CAPACITY(16).
Dec 8 19:33:47 leeloo kernel: [4823510.898087] sd 145:0:0:0: [sdf] No Caching mode page present
Dec 8 19:33:47 leeloo kernel: [4823510.898089] sd 145:0:0:0: [sdf] Assuming drive cache: write through
Dec 8 19:33:47 leeloo kernel: [4823510.898090] sd 145:0:0:0: [sdf] Attached SCSI disk
Dec 8 19:33:50 leeloo kernel: [4823514.018803] quiet_error: 630666 callbacks suppressed
Dec 8 19:33:50 leeloo kernel: [4823514.018805] Buffer I/O error on device dm-5, logical block 161908073
Dec 8 19:33:50 leeloo kernel: [4823514.018806] lost page write due to I/O error on dm-5
Dec 8 19:33:50 leeloo kernel: [4823514.018808] Buffer I/O error on device dm-5, logical block 161908074
Dec 8 19:33:50 leeloo kernel: [4823514.018808] lost page write due to I/O error on dm-5
Dec 8 19:33:50 leeloo kernel: [4823514.018809] Buffer I/O error on device dm-5, logical block 161908075
Dec 8 19:33:50 leeloo kernel: [4823514.018810] lost page write due to I/O error on dm-5
Dec 8 19:33:50 leeloo kernel: [4823514.018811] Buffer I/O error on device dm-5, logical block 161908076
Dec 8 19:33:50 leeloo kernel: [4823514.018811] lost page write due to I/O error on dm-5
Dec 8 19:33:50 leeloo kernel: [4823514.018812] Buffer I/O error on device dm-5, logical block 161908077
Dec 8 19:33:50 leeloo kernel: [4823514.018813] lost page write due to I/O error on dm-5
Dec 8 19:33:50 leeloo kernel: [4823514.018814] Buffer I/O error on device dm-5, logical block 161908078
Dec 8 19:33:50 leeloo kernel: [4823514.018814] lost page write due to I/O error on dm-5
Dec 8 19:33:50 leeloo kernel: [4823514.018815] Buffer I/O error on device dm-5, logical block 161908079
Dec 8 19:33:50 leeloo kernel: [4823514.018815] lost page write due to I/O error on dm-5
Dec 8 19:33:50 leeloo kernel: [4823514.018816] Buffer I/O error on device dm-5, logical block 161908080
Dec 8 19:33:50 leeloo kernel: [4823514.018817] lost page write due to I/O error on dm-5
Dec 8 19:33:50 leeloo kernel: [4823514.018818] Buffer I/O error on device dm-5, logical block 161908081
Dec 8 19:33:50 leeloo kernel: [4823514.018818] lost page write due to I/O error on dm-5
Dec 8 19:33:50 leeloo kernel: [4823514.018819] Buffer I/O error on device dm-5, logical block 161908082
Dec 8 19:33:50 leeloo kernel: [4823514.018820] lost page write due to I/O error on dm-5
Dec 8 19:33:58 leeloo kernel: [4823521.715578] quiet_error: 85723 callbacks suppressed
Dec 8 19:33:58 leeloo kernel: [4823521.715581] Buffer I/O error on device dm-5, logical block 184699823
Dec 8 19:33:58 leeloo kernel: [4823521.715581] lost page write due to I/O error on dm-5
Dec 8 19:33:58 leeloo kernel: [4823521.715583] Buffer I/O error on device dm-5, logical block 184699824
Dec 8 19:33:58 leeloo kernel: [4823521.715584] lost page write due to I/O error on dm-5
Dec 8 19:33:58 leeloo kernel: [4823521.715585] Buffer I/O error on device dm-5, logical block 184699825
Dec 8 19:33:58 leeloo kernel: [4823521.715585] lost page write due to I/O error on dm-5
Dec 8 19:33:58 leeloo kernel: [4823521.715586] Buffer I/O error on device dm-5, logical block 184699826
Dec 8 19:33:58 leeloo kernel: [4823521.715587] lost page write due to I/O error on dm-5
Dec 8 19:33:58 leeloo kernel: [4823521.715588] Buffer I/O error on device dm-5, logical block 184699827
Dec 8 19:33:58 leeloo kernel: [4823521.715588] lost page write due to I/O error on dm-5
Dec 8 19:33:58 leeloo kernel: [4823521.715589] Buffer I/O error on device dm-5, logical block 184699828
Dec 8 19:33:58 leeloo kernel: [4823521.715590] lost page write due to I/O error on dm-5
Dec 8 19:33:58 leeloo kernel: [4823521.715591] Buffer I/O error on device dm-5, logical block 184699829
Dec 8 19:33:58 leeloo kernel: [4823521.715591] lost page write due to I/O error on dm-5
Dec 8 19:33:58 leeloo kernel: [4823521.715592] Buffer I/O error on device dm-5, logical block 184699830
Dec 8 19:33:58 leeloo kernel: [4823521.715592] lost page write due to I/O error on dm-5
Dec 8 19:33:58 leeloo kernel: [4823521.715593] Buffer I/O error on device dm-5, logical block 184699831
Dec 8 19:33:58 leeloo kernel: [4823521.715594] lost page write due to I/O error on dm-5
Dec 8 19:33:58 leeloo kernel: [4823521.715595] Buffer I/O error on device dm-5, logical block 184699832
Dec 8 19:33:58 leeloo kernel: [4823521.715595] lost page write due to I/O error on dm-5
Dec 8 19:34:03 leeloo kernel: [4823526.789092] quiet_error: 322000 callbacks suppressed
Dec 8 19:34:03 leeloo kernel: [4823526.789095] Buffer I/O error on device dm-5, logical block 184786877
Dec 8 19:34:03 leeloo kernel: [4823526.789095] lost page write due to I/O error on dm-5
Dec 8 19:34:03 leeloo kernel: [4823526.789097] Buffer I/O error on device dm-5, logical block 184786878
Dec 8 19:34:03 leeloo kernel: [4823526.789098] lost page write due to I/O error on dm-5
Dec 8 19:34:03 leeloo kernel: [4823526.789099] Buffer I/O error on device dm-5, logical block 184786879
Dec 8 19:34:03 leeloo kernel: [4823526.789099] lost page write due to I/O error on dm-5
Dec 8 19:34:03 leeloo kernel: [4823526.789100] Buffer I/O error on device dm-5, logical block 184786880
Dec 8 19:34:03 leeloo kernel: [4823526.789101] lost page write due to I/O error on dm-5
Dec 8 19:34:03 leeloo kernel: [4823526.789101] Buffer I/O error on device dm-5, logical block 184786881
Dec 8 19:34:03 leeloo kernel: [4823526.789102] lost page write due to I/O error on dm-5
Dec 8 19:34:03 leeloo kernel: [4823526.789103] Buffer I/O error on device dm-5, logical block 184786882
Dec 8 19:34:03 leeloo kernel: [4823526.789103] lost page write due to I/O error on dm-5
Dec 8 19:34:03 leeloo kernel: [4823526.789104] Buffer I/O error on device dm-5, logical block 184786883
Dec 8 19:34:03 leeloo kernel: [4823526.789105] lost page write due to I/O error on dm-5
Dec 8 19:34:03 leeloo kernel: [4823526.789106] Buffer I/O error on device dm-5, logical block 184786884
Dec 8 19:34:03 leeloo kernel: [4823526.789106] lost page write due to I/O error on dm-5
Dec 8 19:34:03 leeloo kernel: [4823526.789107] Buffer I/O error on device dm-5, logical block 184786885
Dec 8 19:34:03 leeloo kernel: [4823526.789108] lost page write due to I/O error on dm-5
Dec 8 19:34:03 leeloo kernel: [4823526.789109] Buffer I/O error on device dm-5, logical block 184786886
Dec 8 19:34:03 leeloo kernel: [4823526.789109] lost page write due to I/O error on dm-5
Dec 8 19:34:07 leeloo kernel: [4823530.941221] XFS (dm-5): metadata I/O error: block 0x8 ("xfs_buf_iodone_callbacks") error 19 numblks 8
Dec 8 19:34:07 leeloo kernel: [4823530.970765] XFS (dm-5): metadata I/O error: block 0xaea85300 ("xlog_iodone") error 19 numblks 64
Dec 8 19:34:07 leeloo kernel: [4823530.970768] XFS (dm-5): xfs_do_force_shutdown(0x2) called from line 1074 of file /xssd/usr_src/linux/fs/xfs/xfs_log.c. Return address = 0xffffffff8128ee79
Dec 8 19:34:07 leeloo kernel: [4823530.970904] XFS (dm-5): Log I/O Error Detected. Shutting down filesystem
Dec 8 19:34:07 leeloo kernel: [4823530.970906] XFS (dm-5): xfs_log_force: error 5 returned.
Dec 8 19:34:07 leeloo kernel: [4823530.970906] XFS (dm-5): Please umount the filesystem and rectify the problem(s)
Dec 8 19:34:07 leeloo kernel: [4823530.971033] XFS (dm-5): metadata I/O error: block 0xaea85340 ("xlog_iodone") error 19 numblks 64
Dec 8 19:34:07 leeloo kernel: [4823530.971034] XFS (dm-5): xfs_do_force_shutdown(0x2) called from line 1074 of file /xssd/usr_src/linux/fs/xfs/xfs_log.c. Return address = 0xffffffff8128ee79
Dec 8 19:34:07 leeloo kernel: [4823530.971158] XFS (dm-5): metadata I/O error: block 0xaea85380 ("xlog_iodone") error 19 numblks 64
Dec 8 19:34:07 leeloo kernel: [4823530.971159] XFS (dm-5): xfs_do_force_shutdown(0x2) called from line 1074 of file /xssd/usr_src/linux/fs/xfs/xfs_log.c. Return address = 0xffffffff8128ee79
Dec 8 19:34:07 leeloo kernel: [4823530.971208] XFS (dm-5): metadata I/O error: block 0xaea853c0 ("xlog_iodone") error 19 numblks 64
Dec 8 19:34:07 leeloo kernel: [4823530.971209] XFS (dm-5): xfs_do_force_shutdown(0x2) called from line 1074 of file /xssd/usr_src/linux/fs/xfs/xfs_log.c. Return address = 0xffffffff8128ee79
Dec 8 19:34:07 leeloo kernel: [4823531.243692] XFS (dm-5): xfs_log_force: error 5 returned.
Dec 8 19:34:07 leeloo kernel: [4823531.243699] XFS (dm-5): xfs_do_force_shutdown(0x1) called from line 1160 of file /xssd/usr_src/linux/fs/xfs/xfs_buf.c. Return address = 0xffffffff8123f23f
- snip -

There was also a second, third and fourth time with that HDD/enclosure.
The third one was actually "interesting", i had to reboot the computer
to recover from that.
Reboot in that case also meant that the kernel got updated to 3.6.9 And
the fourth time was also kind of interesting, because the machine
spontaneously rebooted.

After that the copy went through. And another copy of >2TB from an other
set of HDDs went without a hitch through that night.
The verify-run of the above HDD went through without a hitch.

OK. See above.

Post by Dave Chinner
Did you run repair in between the shutdown and remount?

No.
XFS (dm-3): Mounting Filesystem
XFS (dm-3): Starting recovery (logdev: internal)
XFS (dm-3): Ending recovery (logdev: internal)

Post by Dave Chinner
How many files in that 200GB of data?

At 0.9GB/file at least 220.

I don't know the inner workings auf dm-*, but shouldn't it behave
transparent and rely on the block-layer for buffering.

I think that's partly why Dave asked you to test it, to check
that theory ;)

Currently i'm in the process of replacing a bunch of HDDs, so i won't
come to that for at least a few days. At even then i can't test it
EXACTLY because i don't have any free HDDs identical to the one that was
part of above log-messages (at the moment).

But i can test one of the old HDDs before i throw them out, with the
exact enclosure that was part of above Log messages.

Exactly as I'd expect.

Given that dm-crypt is the common factor here, I'd start by ruling
that out. i.e. reproduce the problem without dm-crypt being used.

That's a slight problem for me, pratically everything i have is
encrypted.

But this is an external drive; you could run a similar test with unencrypted
data on a different hard drive, to try to get to the bottom of this
problem, right?

Will try, but i guess i will have to "emulate" the disconnect by
physically yanking out the cable, it's not like random errors are
predicatable. ;-)

--
Matthias

Dave Chinner

2012-12-11 00:25:54 UTC

Post by Dave Chinner
Did you run repair in between the shutdown and remount?

No.
XFS (dm-3): Mounting Filesystem
XFS (dm-3): Starting recovery (logdev: internal)
XFS (dm-3): Ending recovery (logdev: internal)

Post by Dave Chinner
How many files in that 200GB of data?

At 0.9GB/file at least 220.

Post by Dave Chinner
http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

So I can make any sense of the errors, can you please post the
rest of the information this link asks for?

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Matthias Schniedermeyer

2012-12-11 08:46:06 UTC

Post by Dave Chinner
Did you run repair in between the shutdown and remount?

No.
XFS (dm-3): Mounting Filesystem
XFS (dm-3): Starting recovery (logdev: internal)
XFS (dm-3): Ending recovery (logdev: internal)

Post by Dave Chinner
How many files in that 200GB of data?

At 0.9GB/file at least 220.

Post by Dave Chinner
http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

So I can make any sense of the errors, can you please post the
rest of the information this link asks for?

- kernel version (uname -a)
For the reported case:
Linux zz 3.6.2 #128 SMP x86_64 GNU/Linux
Userspace is 32bit.

- xfsprogs version (xfs_repair -V)
xfs_repair version 3.1.7

- number of CPUs
1 Socket, 4 cores, HT (Core i7 3770)

- contents of /proc/meminfo
Too late for that
- contents of /proc/mounts
Too late for that 2:
But as all the external mounted HDDs are mounted identical so it would
be like this:
/dev/mapper/zz /misc/zz xfs rw,noatime,attr2,allocsize=64k,noquota 0 0

- contents of /proc/partitions
To late for that 3:

The relevant entries would be been like this:
8 80 2930266584 sdf
8 81 2147482623 sdf1
254 5 2930266580 dm-5

- RAID layout (hardware and/or software)
- LVM configuration
No RAID or LVM.
A dm-crypt layer though

keyfile=<file with the 65 encryption keys in clear text>
keysize=128
offset=8
device=/dev/sdf
map=zz
cryptsetup loopaesOpen --key-file=$keyfile --key-size 128 --offset $offset $device $map

I don't use a partition but the whole disc except for the first 8
sectors/4k.

- type of disks you are using
Western Digital WD30EZRX SATA drive.
Connected by SuperSpeed USB3-enclosure.

- write cache status of drives
This is from another one, it's an identical model & enclosure.
write-caching = 1 (on)
Interesting that hdparm -W works over USB.

- size of BBWC and mode it is running in
None

- xfs_info output on the filesystem in question
Technically this is from another drive, but i had formated it with
identical options:
mkfs.xfs -l size=1024b -s size=4096 ...

- snip -
meta-data=/dev/mapper/zz isize=256 agcount=32, agsize=22892708 blks
= sectsz=4096 attr=2
data = bsize=4096 blocks=732566645, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=1024, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
- snip -

- dmesg output showing all error messages and stack traces
See previous mail.

--
Matthias

Matthias Schniedermeyer

2012-12-14 11:19:24 UTC

I don't know the inner workings auf dm-*, but shouldn't it behave
transparent and rely on the block-layer for buffering.

I think that's partly why Dave asked you to test it, to check
that theory ;)

To test that theory.

Technically this is an other machine than the original but i tried to
recreate as much from the original cirumstances as possible.
Kernel is 3.6.7

First i recreated the circumstances.
I plugged a HDD i'm throwing out into the enclosure that was the most
problematic, created the dm-crypt-layer & filesystem as reported and
started copying.

In all testes i didn't supply any mount-options!

1)
After a few minutes i "emulated" the problem by unplugging the cable.
At that point about 40 files were copied, but only 25 where there after
i replugged the cable.

2)
BUT the directory-structure had changed in the meantime, the first 22
files were in an other directory i didn't have the first time. In the
first test all >=200 files were in the same directory.

So i retested by just copying the directory with which i had my original
trouble.
This time i used a timer and after a little over 5 minutes 23 files were
copied, after replugging only the same 3 files as from the first try
where retained.

3)
This time i ditched the dm-crypt-layer.
I mkfs'ed with the same parameters on a plain 100GB partition.

Copied the same files as in 2), after 5 minutes 24 files were copied and
after re-plugging the same 3 files were retained.

At this point the amateur in me says: dm-crypt is "transparent".

A new kernel was released, so a retry with 3.7.0/plain-partition.

4)
Same as 3)

The only difference is that 3.7.0 appears to be much quicker to pass on
the error, the rsync-process was "happyly" procedding with 3.6.7 until i
manually cancled it a few second after unplugging the cable.
With 3.7.0 it immediately stopped with Input/Output error.

5)
Same as 3/4)

A second before unplugging i 'ls -l'ed the directory, all files copied
were visible at that point.

6)
Same as 5)

But this time i issued a 'sync' at about the halfway-point.
This time a total of 13 files were retained, a ls -l just before the
sync showed 12 files. But the sync took 20 seconds, so the 13th file
must have been completed in the time between start/finished of the sync
command.

In conclusive the amateuer in me says:
The data is never send to the drive, as all this test DON'T include a
power-failure, only connection failure.

--
Matthias

Ric Wheeler

2012-12-14 13:57:11 UTC

I don't know the inner workings auf dm-*, but shouldn't it behave
transparent and rely on the block-layer for buffering.

I think that's partly why Dave asked you to test it, to check
that theory ;)

Just a note - depending on the drive and its firmware, unplugging a cable is
*not* the same as a power loss since the firmware detects the loss of link and
immediately writes back any volatile cache data to platter (and it has power, so
that is easy for it to do :)).

You really should drop power to the enclosure to get a "mean" test :)

Ric

Post by Matthias Schniedermeyer
2)
BUT the directory-structure had changed in the meantime, the first 22
files were in an other directory i didn't have the first time. In the
first test all >=200 files were in the same directory.
So i retested by just copying the directory with which i had my original
trouble.
This time i used a timer and after a little over 5 minutes 23 files were
copied, after replugging only the same 3 files as from the first try
where retained.
3)
This time i ditched the dm-crypt-layer.
I mkfs'ed with the same parameters on a plain 100GB partition.
Copied the same files as in 2), after 5 minutes 24 files were copied and
after re-plugging the same 3 files were retained.
At this point the amateur in me says: dm-crypt is "transparent".
A new kernel was released, so a retry with 3.7.0/plain-partition.
4)
Same as 3)
The only difference is that 3.7.0 appears to be much quicker to pass on
the error, the rsync-process was "happyly" procedding with 3.6.7 until i
manually cancled it a few second after unplugging the cable.
With 3.7.0 it immediately stopped with Input/Output error.
5)
Same as 3/4)
A second before unplugging i 'ls -l'ed the directory, all files copied
were visible at that point.
6)
Same as 5)
But this time i issued a 'sync' at about the halfway-point.
This time a total of 13 files were retained, a ls -l just before the
sync showed 12 files. But the sync took 20 seconds, so the 13th file
must have been completed in the time between start/finished of the sync
command.
The data is never send to the drive, as all this test DON'T include a
power-failure, only connection failure.

Matthias Schniedermeyer

2012-12-14 16:19:35 UTC

Post by Ric Wheeler

I don't know the inner workings auf dm-*, but shouldn't it behave
transparent and rely on the block-layer for buffering.

I think that's partly why Dave asked you to test it, to check
that theory ;)

Just a note - depending on the drive and its firmware, unplugging a
cable is *not* the same as a power loss since the firmware detects
the loss of link and immediately writes back any volatile cache data
to platter (and it has power, so that is easy for it to do :)).
You really should drop power to the enclosure to get a "mean" test :)

The Power loss case is irrelevant for me.

I know what i can expect for a power-loss, all my drives have
'write-cache on' and (up to) 64MB of cache. As a disabled write cache
isn't an option (performance sucks) the ensuing data-loss would be
expected and there are procedures in place to prevent any permanent
data-losses (Except for the right 2 discs dying at the same time).

--
Matthias

Dave Chinner

2012-12-15 22:16:22 UTC

I don't know the inner workings auf dm-*, but shouldn't it behave
transparent and rely on the block-layer for buffering.

I think that's partly why Dave asked you to test it, to check
that theory ;)

That's one for the spell-check fail file.... :P

Post by Matthias Schniedermeyer
1)
After a few minutes i "emulated" the problem by unplugging the cable.
At that point about 40 files were copied, but only 25 where there after
i replugged the cable.

Ok, so you've basically got a situation where there is a single
directory block being modified repeatedly as new files are created.
Which means that there are a significant number of changes being
aggregated in memory and may not have been written to the log. When
you pull the cable, those changes are lost because they can't be
written to disk any more.

Post by Matthias Schniedermeyer
2)
BUT the directory-structure had changed in the meantime, the first 22
files were in an other directory i didn't have the first time.

I don't understand what you are saying here. Can you please add "ls
-l" output of the directory structure before and after so that I can
observe exactly what you are trying to describe?

Post by Matthias Schniedermeyer
In the
first test all >=200 files were in the same directory.
So i retested by just copying the directory with which i had my original
trouble.
This time i used a timer and after a little over 5 minutes 23 files were
copied, after replugging only the same 3 files as from the first try
where retained.

Which usually means one of two things:
1. the metadata changes never got written to the log; or
2. log recovery discarded them.

.....

Post by Matthias Schniedermeyer
6)
Same as 5)
But this time i issued a 'sync' at about the halfway-point.
This time a total of 13 files were retained, a ls -l just before the
sync showed 12 files. But the sync took 20 seconds, so the 13th file
must have been completed in the time between start/finished of the sync
command.

Which doesn't rule out either possibility. Log recovery can discard
transactions because of many reasons, one of them being a 19 year
old bug that isn't yet fixed in 3.7 (only just merged into
3.8-rc).....

Which, given the same directory blocks are being continually
rewritten, it's entirely possible that this can occur - they keep
getting moved forward in the log, past the push target that is set
every 30s.

What you need to do is mount the filesystem after replugging it with
the mount options "-o ro,norecovery" to see what it on disk before
log recovery is run. If the files are all there, then it's a log
write or log recovery problem. If the files are not present, then the
metadata has not been written to disk and they aren't in the log,
either.

If the files are on disk prior to log recovery running, then you ned
to dump the log to a file using xfs_logprint and send it to me so I
can analyse the content of the log.

If the files are not in the directory on disk before recovery, then
I suspect we are going to need an event trace to determine the
sequence of events leading up to unplug event. Having a copy of the
log would be handy in that case, too.

Post by Matthias Schniedermeyer
The data is never send to the drive, as all this test DON'T include a
power-failure, only connection failure.

<sigh>

Like most amateurs you've jump to the obvious conclusion without
considering all the other possibilities that could give the same
result.

There's a reason that we ask for specific information in
bug reports - paraphrasing or describing problems by words is
ambiguous and tainted by your perception of what the problem is. Not
to mention that what a user thinks is irrelevant is the often
exactly the critical detail an expert is looking to find.

What I'm saying is that playing the "armchair expert" simply makes
it harder for the real experts (us) to understand what your problem
is. You may be right in the end that metadata hasn't been written,
but we have to understand *why* the metadata wasn't written to be
able to fix the problem, and that takes a whole lot more analysis
than just guessing...

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Matthias Schniedermeyer

2012-12-16 10:30:25 UTC

Post by Dave Chinner
Like most amateurs you've jump to the obvious conclusion without
considering all the other possibilities that could give the same
result.

I have a test-case, so you can cut out the amateur.
Partition is 100GB at the beginning of a 1,5TB SATA HDD connected by
USB3 enclosure:
Machine has 4GB of memory and is running a vanilla 3.7.0 kernel.

mkfs.xfs -l size=1024b -s size=4096 /dev/sdb1
mount /dev/sdb1 /mnt
mkdir /mnt/a
cd /mnt/a
for dat in `seq 1 40`; do dd if=/dev/zero of=$dat bs=1k count=900k ; done

Then i started a timer and waited for 5 minutes.
Then i yanked out the cable, my machine was writing to the 24th file at
that point.

umount /mnt
<replug cable>
mount /dev/sdb1 /mnt
ls -l /mnt
<In words: Nothing there, not even a>

I tried it 3 times, all times with the same result that not even "a" was
there.

In the last run i did a `ls -l /mnt/a` at the 4 minute-point, it showed
several files with 900MB.

--
Matthias

Matthias Schniedermeyer

2012-12-16 11:10:46 UTC

Post by Dave Chinner
Like most amateurs you've jump to the obvious conclusion without
considering all the other possibilities that could give the same
result.

I have a test-case, so you can cut out the amateur.
Partition is 100GB at the beginning of a 1,5TB SATA HDD connected by
Machine has 4GB of memory and is running a vanilla 3.7.0 kernel.
mkfs.xfs -l size=1024b -s size=4096 /dev/sdb1
mount /dev/sdb1 /mnt
mkdir /mnt/a
cd /mnt/a
for dat in `seq 1 40`; do dd if=/dev/zero of=$dat bs=1k count=900k ; done
Then i started a timer and waited for 5 minutes.
Then i yanked out the cable, my machine was writing to the 24th file at
that point.
umount /mnt
<replug cable>
mount /dev/sdb1 /mnt
ls -l /mnt
<In words: Nothing there, not even a>
I tried it 3 times, all times with the same result that not even "a" was
there.
In the last run i did a `ls -l /mnt/a` at the 4 minute-point, it showed
several files with 900MB.

And boy is it unsafe to use XFS currently.

Modifed test-case. Same as above, except for this loop:
for dat in `seq 1 10`; do dd if=/dev/zero of=$dat bs=1k count=900k ; done
After it ran through i waited 5 minutes, IOW there were 5 minutes of
nothing going on.
Yank out cable, yada, yada, yada.

Result: Same as above.

This proves(at last for the amateuer in me) that it is the same bug as
the OP had.

Variant 3.
Same as 2, but with a 'sync' after the for-loop.
After 'sync' returns, i immediatly yank out the cable.
After replugging the result is as expect, the directroy and all files
exist.

So as far as i can tell, currently there is a 'sync' missing somewhere.
It it is at least in kernels 3.6 & 3.7, maybe 3.5.
Could it be the removal of pdflush?

--
Matthias

Dave Chinner

2012-12-16 20:48:47 UTC

Post by Dave Chinner
Like most amateurs you've jump to the obvious conclusion without
considering all the other possibilities that could give the same
result.

I have a test-case, so you can cut out the amateur.

Yes, I understand your test case. That doesn't mean I can reproduce
your problem, or have the time to do so. I'm asking you to give me
information because you can obviously reproduce the problem reliably
and I've got a bunch of other stuff to do at the same time. There's
one of me, and a lot of you (users), and I simply don't scale to
doing everything myself.

Post by Matthias Schniedermeyer
Partition is 100GB at the beginning of a 1,5TB SATA HDD connected by
Machine has 4GB of memory and is running a vanilla 3.7.0 kernel.
mkfs.xfs -l size=1024b -s size=4096 /dev/sdb1

If you use the defaults, does the problem go away?

Post by Matthias Schniedermeyer
mount /dev/sdb1 /mnt
mkdir /mnt/a
cd /mnt/a
for dat in `seq 1 40`; do dd if=/dev/zero of=$dat bs=1k count=900k ; done
Then i started a timer and waited for 5 minutes.
Then i yanked out the cable, my machine was writing to the 24th file at
that point.
umount /mnt
<replug cable>
mount /dev/sdb1 /mnt
ls -l /mnt
<In words: Nothing there, not even a>

Can you provide the information I asked - what was there before the
unmount, what is there before recovery, what is there after
recovery? And depending on what occurred, send me the log file so I
can see what was actually supposed to be replayed by log recovery?

Post by Matthias Schniedermeyer
Variant 3.
Same as 2, but with a 'sync' after the for-loop.
After 'sync' returns, i immediatly yank out the cable.
After replugging the result is as expect, the directroy and all files
exist.

And when you tell the system to write everything, it works just
fine because it forces the log to a consistent state for recovery
purposes.

Post by Matthias Schniedermeyer
So as far as i can tell, currently there is a 'sync' missing somewhere.

No, we don't know what they problem is yet.

Post by Matthias Schniedermeyer
Could it be the removal of pdflush?

Which happened in 2.6.32, so is rather unlikely.

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Matthias Schniedermeyer

2012-12-16 23:22:51 UTC

If you use the defaults, does the problem go away?

No changes.

Can you provide the information I asked - what was there before the
unmount,

With time reduced to 2 minutes and everything put in a script (see below).

After about 2 minutes:
ls -lR /mnt
/mnt:
total 0
drwxr-xr-x 2 root root 105 Dec 17 00:13 a

/mnt/a:
total 11186176
-rw-r--r-- 1 root root 943718400 Dec 17 00:11 1
-rw-r--r-- 1 root root 943718400 Dec 17 00:13 10
-rw-r--r-- 1 root root 943718400 Dec 17 00:13 11
-rw-r--r-- 1 root root 891450368 Dec 17 00:13 12
-rw-r--r-- 1 root root 943718400 Dec 17 00:11 2
-rw-r--r-- 1 root root 943718400 Dec 17 00:11 3
-rw-r--r-- 1 root root 943718400 Dec 17 00:12 4
-rw-r--r-- 1 root root 943718400 Dec 17 00:12 5
-rw-r--r-- 1 root root 943718400 Dec 17 00:12 6
-rw-r--r-- 1 root root 943718400 Dec 17 00:12 7
-rw-r--r-- 1 root root 943718400 Dec 17 00:12 8
-rw-r--r-- 1 root root 943718400 Dec 17 00:12 9

I yanked out the cable immediatly after the ls.

Post by Dave Chinner
what is there before recovery

xfs_logprint:
data device: 0x811
log device: 0x811 daddr: 104857632 length: 102400

cycle: 1 version: 2 lsn: 1,0 tail_lsn: 1,0
length of Log Record: 20 prev offset: -1 num ops: 1
uuid: b22d154a-b580-4d3a-bc09-ac0c62d39d39 format: little endian linux
h_size: 32768
----------------------------------------------------------------------------
Oper (0): tid: b0c0d0d0 len: 8 clientid: LOG flags: UNMOUNT
Unmount filesystem

============================================================================
xfs_logprint: skipped 4096 cleared blocks in range: 2 - 4097
xfs_logprint: skipped 98302 zeroed blocks in range: 4098 - 102399
xfs_logprint: physical end of log
============================================================================
xfs_logprint: logical end of log
============================================================================

mount /dev/sdb1 /mnt -oro,norecovery
ls -lR /mnt
/mnt:
total 0

So nothing.

Post by Dave Chinner
what is there after recovery?

Same. Nothing.
The syslog-message the Filesystem was clean. (See below)

Post by Dave Chinner
And depending on what occurred, send me the log file so I
can see what was actually supposed to be replayed by log recovery?

Regarding to the 'event trace' from your other mail, i have no idea what
or how to do that.

This is the syslog from first mount till third mount.

Start:
Dec 17 00:11:26 localhost kernel: [ 204.152835] XFS (sdb1): Mounting Filesystem
Dec 17 00:11:26 localhost kernel: [ 204.233355] XFS (sdb1): Ending clean mount
Dec 17 00:11:33 localhost kernel: [ 211.471071] usb 3-1: reset SuperSpeed USB device number 2 using xhci_hcd
Dec 17 00:11:33 localhost kernel: [ 211.485718] usb 3-1: Parent hub missing LPM exit latency info. Power management will be impacted.
Dec 17 00:11:33 localhost kernel: [ 211.488182] xhci_hcd 0000:04:00.0: xHCI xhci_drop_endpoint called with disabled ep ffff880113106100
Dec 17 00:11:33 localhost kernel: [ 211.489358] xhci_hcd 0000:04:00.0: xHCI xhci_drop_endpoint called with disabled ep ffff880113106140
Dec 17 00:12:59 localhost kernel: [ 296.364776] usb 3-1: reset SuperSpeed USB device number 2 using xhci_hcd
Dec 17 00:12:59 localhost kernel: [ 296.379306] usb 3-1: Parent hub missing LPM exit latency info. Power management will be impacted.
Dec 17 00:12:59 localhost kernel: [ 296.381685] xhci_hcd 0000:04:00.0: xHCI xhci_drop_endpoint called with disabled ep ffff880113106100
Dec 17 00:12:59 localhost kernel: [ 296.382906] xhci_hcd 0000:04:00.0: xHCI xhci_drop_endpoint called with disabled ep ffff880113106140

Disconnect:
Dec 17 00:13:28 localhost kernel: [ 325.548129] usb 3-1: USB disconnect, device number 2
Dec 17 00:13:28 localhost kernel: [ 325.549906] sd 12:0:0:0: [sdb] Unhandled error code
Dec 17 00:13:28 localhost kernel: [ 325.551073] sd 12:0:0:0: [sdb]
Dec 17 00:13:28 localhost kernel: [ 325.552230] Result: hostbyte=0x01 driverbyte=0x00
Dec 17 00:13:28 localhost kernel: [ 325.553399] sd 12:0:0:0: [sdb] CDB:
Dec 17 00:13:28 localhost kernel: [ 325.554530] cdb[0]=0x2a: 2a 00 01 51 fd 98 00 00 f0 00
Dec 17 00:13:28 localhost kernel: [ 325.555708] end_request: I/O error, dev sdb, sector 22150552
Dec 17 00:13:28 localhost kernel: [ 325.556971] sd 12:0:0:0: [sdb] Unhandled error code
Dec 17 00:13:28 localhost kernel: [ 325.558118] sd 12:0:0:0: [sdb]
Dec 17 00:13:28 localhost kernel: [ 325.559229] Result: hostbyte=0x01 driverbyte=0x00
Dec 17 00:13:28 localhost kernel: [ 325.560340] sd 12:0:0:0: [sdb] CDB:
Dec 17 00:13:28 localhost kernel: [ 325.561449] cdb[0]=0x2a: 2a 00 01 51 fe 88 00 00 f0 00
Dec 17 00:13:28 localhost kernel: [ 325.562628] end_request: I/O error, dev sdb, sector 22150792
Dec 17 00:13:28 localhost kernel: [ 325.562636] Buffer I/O error on device sdb1, logical block 2613279
Dec 17 00:13:28 localhost kernel: [ 325.562637] lost page write due to I/O error on sdb1
Dec 17 00:13:28 localhost kernel: [ 325.562638] Buffer I/O error on device sdb1, logical block 2613280
Dec 17 00:13:28 localhost kernel: [ 325.562638] lost page write due to I/O error on sdb1
Dec 17 00:13:28 localhost kernel: [ 325.562639] Buffer I/O error on device sdb1, logical block 2613281
Dec 17 00:13:28 localhost kernel: [ 325.562639] lost page write due to I/O error on sdb1
Dec 17 00:13:28 localhost kernel: [ 325.562640] Buffer I/O error on device sdb1, logical block 2613282
Dec 17 00:13:28 localhost kernel: [ 325.562640] lost page write due to I/O error on sdb1
Dec 17 00:13:28 localhost kernel: [ 325.562641] Buffer I/O error on device sdb1, logical block 2613283
Dec 17 00:13:28 localhost kernel: [ 325.562641] lost page write due to I/O error on sdb1
Dec 17 00:13:28 localhost kernel: [ 325.562642] Buffer I/O error on device sdb1, logical block 2613284
Dec 17 00:13:28 localhost kernel: [ 325.562642] lost page write due to I/O error on sdb1
Dec 17 00:13:28 localhost kernel: [ 325.562643] Buffer I/O error on device sdb1, logical block 2613285
Dec 17 00:13:28 localhost kernel: [ 325.562643] lost page write due to I/O error on sdb1
Dec 17 00:13:28 localhost kernel: [ 325.562644] Buffer I/O error on device sdb1, logical block 2613286
Dec 17 00:13:28 localhost kernel: [ 325.562644] lost page write due to I/O error on sdb1
Dec 17 00:13:28 localhost kernel: [ 325.562645] Buffer I/O error on device sdb1, logical block 2613287
Dec 17 00:13:28 localhost kernel: [ 325.562645] lost page write due to I/O error on sdb1
Dec 17 00:13:28 localhost kernel: [ 325.562646] Buffer I/O error on device sdb1, logical block 2613288
Dec 17 00:13:28 localhost kernel: [ 325.562646] lost page write due to I/O error on sdb1
Dec 17 00:13:28 localhost kernel: [ 325.647364] XFS (sdb1): metadata I/O error: block 0x6400022 ("xlog_iodone") error 5 numblks 64
Dec 17 00:13:28 localhost kernel: [ 325.648559] XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 1074 of file /xssd/usr_src/linux/fs/xfs/xfs_log.c. Return address = 0xffffffff812a1e49
Dec 17 00:13:28 localhost kernel: [ 325.649804] XFS (sdb1): Log I/O Error Detected. Shutting down filesystem
Dec 17 00:13:28 localhost kernel: [ 325.651037] XFS (sdb1): Please umount the filesystem and rectify the problem(s)
Dec 17 00:13:28 localhost kernel: [ 325.652341] XFS (sdb1): xfs_log_force: error 5 returned.
Dec 17 00:13:28 localhost kernel: [ 325.653564] XFS (sdb1): xfs_do_force_shutdown(0x1) called from line 1160 of file /xssd/usr_src/linux/fs/xfs/xfs_buf.c. Return address = 0xffffffff8125198f
Dec 17 00:13:28 localhost kernel: [ 325.937819] XFS (sdb1): xfs_log_force: error 5 returned.
Dec 17 00:13:28 localhost kernel: [ 325.939105] XFS (sdb1): xfs_do_force_shutdown(0x1) called from line 1160 of file /xssd/usr_src/linux/fs/xfs/xfs_buf.c. Return address = 0xffffffff8125198f
Dec 17 00:13:28 localhost kernel: [ 325.940427] XFS (sdb1): xfs_log_force: error 5 returned.
Dec 17 00:13:28 localhost kernel: [ 325.941768] XFS (sdb1): xfs_log_force: error 5 returned.

Reconnect:
Dec 17 00:13:41 localhost kernel: [ 338.939079] usb 3-1: new SuperSpeed USB device number 3 using xhci_hcd
Dec 17 00:13:41 localhost kernel: [ 338.953410] usb 3-1: Parent hub missing LPM exit latency info. Power management will be impacted.
Dec 17 00:13:41 localhost kernel: [ 338.959394] usb 3-1: New USB device found, idVendor=174c, idProduct=5106
Dec 17 00:13:41 localhost kernel: [ 338.960791] usb 3-1: New USB device strings: Mfr=2, Product=3, SerialNumber=1
Dec 17 00:13:41 localhost kernel: [ 338.962135] usb 3-1: Product: AS2105
Dec 17 00:13:41 localhost kernel: [ 338.963470] usb 3-1: Manufacturer: ASMedia
Dec 17 00:13:41 localhost kernel: [ 338.964800] usb 3-1: SerialNumber: 5XW02D4M
Dec 17 00:13:41 localhost kernel: [ 338.966977] scsi13 : usb-storage 3-1:1.0
Dec 17 00:13:42 localhost kernel: [ 339.970156] scsi 13:0:0:0: Direct-Access ST315005 41AS CC32 PQ: 0 ANSI: 5
Dec 17 00:13:42 localhost kernel: [ 339.971635] sd 13:0:0:0: Attached scsi generic sg1 type 0
Dec 17 00:13:42 localhost kernel: [ 339.973202] sd 13:0:0:0: [sdb] 2930277168 512-byte logical blocks: (1.50 TB/1.36 TiB)
Dec 17 00:13:42 localhost kernel: [ 339.975256] sd 13:0:0:0: [sdb] Write Protect is off
Dec 17 00:13:42 localhost kernel: [ 339.976639] sd 13:0:0:0: [sdb] Mode Sense: 23 00 00 00
Dec 17 00:13:42 localhost kernel: [ 339.977358] sd 13:0:0:0: [sdb] No Caching mode page present
Dec 17 00:13:42 localhost kernel: [ 339.978700] sd 13:0:0:0: [sdb] Assuming drive cache: write through
Dec 17 00:13:42 localhost kernel: [ 339.981747] sd 13:0:0:0: [sdb] No Caching mode page present
Dec 17 00:13:42 localhost kernel: [ 339.983143] sd 13:0:0:0: [sdb] Assuming drive cache: write through
Dec 17 00:13:42 localhost kernel: [ 339.998266] sdb: sdb1
Dec 17 00:13:42 localhost kernel: [ 340.001383] sd 13:0:0:0: [sdb] No Caching mode page present
Dec 17 00:13:42 localhost kernel: [ 340.002734] sd 13:0:0:0: [sdb] Assuming drive cache: write through
Dec 17 00:13:42 localhost kernel: [ 340.004043] sd 13:0:0:0: [sdb] Attached SCSI disk

mount -oro,norecovery:
Dec 17 00:13:54 localhost kernel: [ 351.902543] XFS (sdb1): Mounting filesystem in no-recovery mode. Filesystem will be inconsistent.

mount:
Dec 17 00:14:02 localhost kernel: [ 359.860013] XFS (sdb1): Mounting Filesystem
Dec 17 00:14:02 localhost kernel: [ 359.908198] XFS (sdb1): Ending clean mount

- test.sh -
#!/bin/bash

mkfs.xfs -f /dev/sdb1
mount /dev/sdb1 /mnt
mkdir /mnt/a

(sleep 2m ; ls -lR /mnt)&

for dat in `seq 1 40`; do dd if=/dev/zero of=/mnt/a/$dat bs=1k count=900k ; done

ls -lR /mnt
umount /mnt

echo "Press Enter"
read ABC

xfs_logprint /dev/sdb1

mount /dev/sdb1 /mnt -oro,norecovery
ls -lR /mnt
umount /mnt

echo "Press Enter"
read ABC

mount /dev/sdb1 /mnt
ls -lR /mnt
umount /mnt
- test.sh -

--
Matthias

Dave Chinner

2012-12-17 23:24:41 UTC

Post by Dave Chinner
Can you provide the information I asked - what was there before the
unmount,

With time reduced to 2 minutes and everything put in a script (see below).
ls -lR /mnt
total 0
drwxr-xr-x 2 root root 105 Dec 17 00:13 a
total 11186176
-rw-r--r-- 1 root root 943718400 Dec 17 00:11 1
-rw-r--r-- 1 root root 943718400 Dec 17 00:13 10
-rw-r--r-- 1 root root 943718400 Dec 17 00:13 11
-rw-r--r-- 1 root root 891450368 Dec 17 00:13 12
-rw-r--r-- 1 root root 943718400 Dec 17 00:11 2
-rw-r--r-- 1 root root 943718400 Dec 17 00:11 3
-rw-r--r-- 1 root root 943718400 Dec 17 00:12 4
-rw-r--r-- 1 root root 943718400 Dec 17 00:12 5
-rw-r--r-- 1 root root 943718400 Dec 17 00:12 6
-rw-r--r-- 1 root root 943718400 Dec 17 00:12 7
-rw-r--r-- 1 root root 943718400 Dec 17 00:12 8
-rw-r--r-- 1 root root 943718400 Dec 17 00:12 9
I yanked out the cable immediatly after the ls.

Post by Dave Chinner
what is there before recovery

data device: 0x811
log device: 0x811 daddr: 104857632 length: 102400
cycle: 1 version: 2 lsn: 1,0 tail_lsn: 1,0
length of Log Record: 20 prev offset: -1 num ops: 1
uuid: b22d154a-b580-4d3a-bc09-ac0c62d39d39 format: little endian linux
h_size: 32768
----------------------------------------------------------------------------
Oper (0): tid: b0c0d0d0 len: 8 clientid: LOG flags: UNMOUNT
Unmount filesystem
============================================================================
xfs_logprint: skipped 4096 cleared blocks in range: 2 - 4097
xfs_logprint: skipped 98302 zeroed blocks in range: 4098 - 102399
xfs_logprint: physical end of log
============================================================================
xfs_logprint: logical end of log
============================================================================

Not what I was after (i.e. the raw dump of the log to a file via
xfs_logprint -C), but in this case serves to indicate that nothing
ever got written to the log. That's not exactly what I was
expecting, given that you've been reporting that some files were
there after remount.

Post by Matthias Schniedermeyer
mount /dev/sdb1 /mnt -oro,norecovery
ls -lR /mnt
total 0
So nothing.

If nothing got written to the log, then no metadata got written
to disk because it is pinned in memory until the log is written.

Post by Dave Chinner
And depending on what occurred, send me the log file so I
can see what was actually supposed to be replayed by log recovery?

Regarding to the 'event trace' from your other mail, i have no idea what
or how to do that.

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

Post by Matthias Schniedermeyer
Dec 17 00:13:28 localhost kernel: [ 325.562642] lost page write due to I/O error on sdb1
Dec 17 00:13:28 localhost kernel: [ 325.562643] Buffer I/O error on device sdb1, logical block 2613285
Dec 17 00:13:28 localhost kernel: [ 325.562643] lost page write due to I/O error on sdb1
Dec 17 00:13:28 localhost kernel: [ 325.562644] Buffer I/O error on device sdb1, logical block 2613286
Dec 17 00:13:28 localhost kernel: [ 325.562644] lost page write due to I/O error on sdb1
Dec 17 00:13:28 localhost kernel: [ 325.562645] Buffer I/O error on device sdb1, logical block 2613287
Dec 17 00:13:28 localhost kernel: [ 325.562645] lost page write due to I/O error on sdb1
Dec 17 00:13:28 localhost kernel: [ 325.562646] Buffer I/O error on device sdb1, logical block 2613288
Dec 17 00:13:28 localhost kernel: [ 325.562646] lost page write due to I/O error on sdb1
Dec 17 00:13:28 localhost kernel: [ 325.647364] XFS (sdb1): metadata I/O error: block 0x6400022 ("xlog_iodone") error 5 numblks 64
Dec 17 00:13:28 localhost kernel: [ 325.648559] XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 1074 of file /xssd/usr_src/linux/fs/xfs/xfs_log.c. Return address = 0xffffffff812a1e49
Dec 17 00:13:28 localhost kernel: [ 325.649804] XFS (sdb1): Log I/O Error Detected. Shutting down filesystem

And there's why nothing got written to the log - you pulled the plug
before it was flushed. But that also implies it wasn't flushed for a
couple of minutes, which is not correct. They should be occurring
every 30s when the filesystem is only slightly busy like this.

Hmmmm, that rigns a bell. I think I came across this when doing a
RHEL6 backport recently, and decided that nothing needed doing
upstream because there were already patches pending to fix it.
That is, there is a logic bug is in xfssyncd_worker(), and that was
fixed by the xfssyncd-die-die-die patchset that is now in 3.8-rc1.

So, two things: firstly, can you check that the oldest file that is
lost on 3.8-rc1 is less than 30s old, and secondly, whether this
patch gives the same result as 3.8-rc1 on 3.6/3.7:

diff --git a/fs/xfs/xfs_sync.c b/fs/xfs/xfs_sync.c
index 9500caf..7bf85e8 100644
--- a/fs/xfs/xfs_sync.c
+++ b/fs/xfs/xfs_sync.c
@@ -400,7 +400,7 @@ xfs_sync_worker(
* cancel_delayed_work_sync on this work queue before tearing down
* the ail and the log in xfs_log_unmount.
*/
- if (!(mp->m_super->s_flags & MS_ACTIVE) &&
+ if ((mp->m_super->s_flags & MS_ACTIVE) &&
!(mp->m_flags & XFS_MOUNT_RDONLY)) {
/* dgc: errors ignored here */
if (mp->m_super->s_writers.frozen == SB_UNFROZEN &&

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Matthias Schniedermeyer

2012-12-18 00:26:24 UTC

One 'bleeding edge' kernel, so new that it isn't even called 3.8-rc1.
;-)

I would call that successful.

ls -l /mnt/a
total 10661888
drwxr-xr-x 2 root root 105 Dec 18 00:55 .
drwxr-xr-x 3 root root 14 Dec 18 00:53 ..
-rw-r--r-- 1 root root 943718400 Dec 18 00:53 1
-rw-r--r-- 1 root root 943718400 Dec 18 00:54 10
-rw-r--r-- 1 root root 738263040 Dec 18 00:55 11
-rw-r--r-- 1 root root 322990080 Dec 18 00:55 12
-rw-r--r-- 1 root root 943718400 Dec 18 00:53 2
-rw-r--r-- 1 root root 943718400 Dec 18 00:53 3
-rw-r--r-- 1 root root 943718400 Dec 18 00:53 4
-rw-r--r-- 1 root root 943718400 Dec 18 00:53 5
-rw-r--r-- 1 root root 943718400 Dec 18 00:54 6
-rw-r--r-- 1 root root 943718400 Dec 18 00:54 7
-rw-r--r-- 1 root root 943718400 Dec 18 00:54 8
-rw-r--r-- 1 root root 943718400 Dec 18 00:54 9

The 12th file was beeing written while i yanked out the cable. Strange
that the 11th file isn't complete, i would expect correct ordering of
events. Is this normal/expected behaviour for XFS?

The 'ls -l/mnt/a' before i yanked out the cable was this:
/mnt/a:
total 11186176
-rw-r--r-- 1 root root 943718400 Dec 18 00:53 1
-rw-r--r-- 1 root root 943718400 Dec 18 00:54 10
-rw-r--r-- 1 root root 943718400 Dec 18 00:55 11
-rw-r--r-- 1 root root 665326592 Dec 18 00:55 12
-rw-r--r-- 1 root root 943718400 Dec 18 00:53 2
-rw-r--r-- 1 root root 943718400 Dec 18 00:53 3
-rw-r--r-- 1 root root 943718400 Dec 18 00:53 4
-rw-r--r-- 1 root root 943718400 Dec 18 00:53 5
-rw-r--r-- 1 root root 943718400 Dec 18 00:54 6
-rw-r--r-- 1 root root 943718400 Dec 18 00:54 7
-rw-r--r-- 1 root root 943718400 Dec 18 00:54 8
-rw-r--r-- 1 root root 943718400 Dec 18 00:54 9

Post by Dave Chinner
So, two things: firstly, can you check that the oldest file that is
lost on 3.8-rc1 is less than 30s old, and secondly, whether this

I would say: A definite yes
3.7.1 + patch

/mnt/a:
total 8425472
-rw-r--r-- 1 root root 943718400 Dec 18 01:12 1
-rw-r--r-- 1 root root 92282880 Dec 18 01:13 10
-rw-r--r-- 1 root root 943718400 Dec 18 01:12 2
-rw-r--r-- 1 root root 943718400 Dec 18 01:12 3
-rw-r--r-- 1 root root 943718400 Dec 18 01:12 4
-rw-r--r-- 1 root root 943718400 Dec 18 01:12 5
-rw-r--r-- 1 root root 943718400 Dec 18 01:13 6
-rw-r--r-- 1 root root 943718400 Dec 18 01:13 7
-rw-r--r-- 1 root root 943718400 Dec 18 01:13 8
-rw-r--r-- 1 root root 507555840 Dec 18 01:13 9

Looks the same to me, only different timing in regards to 30s windows.

This was just before the yanking:
/mnt/a:
total 11186176
-rw-r--r-- 1 root root 943718400 Dec 18 01:12 1
-rw-r--r-- 1 root root 943718400 Dec 18 01:13 10
-rw-r--r-- 1 root root 943718400 Dec 18 01:13 11
-rw-r--r-- 1 root root 869729280 Dec 18 01:14 12
-rw-r--r-- 1 root root 943718400 Dec 18 01:12 2
-rw-r--r-- 1 root root 943718400 Dec 18 01:12 3
-rw-r--r-- 1 root root 943718400 Dec 18 01:12 4
-rw-r--r-- 1 root root 943718400 Dec 18 01:12 5
-rw-r--r-- 1 root root 943718400 Dec 18 01:13 6
-rw-r--r-- 1 root root 943718400 Dec 18 01:13 7
-rw-r--r-- 1 root root 943718400 Dec 18 01:13 8
-rw-r--r-- 1 root root 943718400 Dec 18 01:13 9

Post by Dave Chinner
diff --git a/fs/xfs/xfs_sync.c b/fs/xfs/xfs_sync.c
index 9500caf..7bf85e8 100644
--- a/fs/xfs/xfs_sync.c
+++ b/fs/xfs/xfs_sync.c
@@ -400,7 +400,7 @@ xfs_sync_worker(
* cancel_delayed_work_sync on this work queue before tearing down
* the ail and the log in xfs_log_unmount.
*/
- if (!(mp->m_super->s_flags & MS_ACTIVE) &&
+ if ((mp->m_super->s_flags & MS_ACTIVE) &&
!(mp->m_flags & XFS_MOUNT_RDONLY)) {
/* dgc: errors ignored here */
if (mp->m_super->s_writers.frozen == SB_UNFROZEN &&

Do you need any logs or other things?

--
Matthias

Dave Chinner

2012-12-18 20:20:43 UTC

One 'bleeding edge' kernel, so new that it isn't even called 3.8-rc1.
;-)
I would call that successful.

Ok, thank you for testing it :)

Post by Matthias Schniedermeyer
ls -l /mnt/a
total 10661888
drwxr-xr-x 2 root root 105 Dec 18 00:55 .
drwxr-xr-x 3 root root 14 Dec 18 00:53 ..
-rw-r--r-- 1 root root 943718400 Dec 18 00:53 1
-rw-r--r-- 1 root root 943718400 Dec 18 00:54 10
-rw-r--r-- 1 root root 738263040 Dec 18 00:55 11
-rw-r--r-- 1 root root 322990080 Dec 18 00:55 12
-rw-r--r-- 1 root root 943718400 Dec 18 00:53 2
-rw-r--r-- 1 root root 943718400 Dec 18 00:53 3
-rw-r--r-- 1 root root 943718400 Dec 18 00:53 4
-rw-r--r-- 1 root root 943718400 Dec 18 00:53 5
-rw-r--r-- 1 root root 943718400 Dec 18 00:54 6
-rw-r--r-- 1 root root 943718400 Dec 18 00:54 7
-rw-r--r-- 1 root root 943718400 Dec 18 00:54 8
-rw-r--r-- 1 root root 943718400 Dec 18 00:54 9
The 12th file was beeing written while i yanked out the cable. Strange
that the 11th file isn't complete, i would expect correct ordering of
events. Is this normal/expected behaviour for XFS?

Yes, it is. The file size is not updated until the data is written,
and data from previous files is still being written when the next
file is created and the application starts to write to it. So you
get file size updates for previous files occurring while you are
creating new files, and the result is as you see above - the file
sizes reflect the data that is guarantee to be written to disk...

Post by Matthias Schniedermeyer
total 11186176
-rw-r--r-- 1 root root 943718400 Dec 18 00:53 1
-rw-r--r-- 1 root root 943718400 Dec 18 00:54 10
-rw-r--r-- 1 root root 943718400 Dec 18 00:55 11
-rw-r--r-- 1 root root 665326592 Dec 18 00:55 12
-rw-r--r-- 1 root root 943718400 Dec 18 00:53 2
-rw-r--r-- 1 root root 943718400 Dec 18 00:53 3
-rw-r--r-- 1 root root 943718400 Dec 18 00:53 4
-rw-r--r-- 1 root root 943718400 Dec 18 00:53 5
-rw-r--r-- 1 root root 943718400 Dec 18 00:54 6
-rw-r--r-- 1 root root 943718400 Dec 18 00:54 7
-rw-r--r-- 1 root root 943718400 Dec 18 00:54 8
-rw-r--r-- 1 root root 943718400 Dec 18 00:54 9

Yup, makes sense. File sizes in memory indicate that there is about
500MB of dirty data in memory that was lost when you pulled the
plug. That doesn't seem unreasonable for a periodic 30s update.

Post by Dave Chinner
So, two things: firstly, can you check that the oldest file that is
lost on 3.8-rc1 is less than 30s old, and secondly, whether this

I would say: A definite yes

Ok, I'll send a patch for review for the stable kernels.

Post by Matthias Schniedermeyer
Looks the same to me, only different timing in regards to 30s windows.

*nod*

Post by Matthias Schniedermeyer
Do you need any logs or other things?

No, the change of behaviour is sufficient to identify this as the
source of your problem. Thank you for sticking at this long enough
for me ot isolate the bug!

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Matthias Schniedermeyer

2012-12-18 00:34:38 UTC

This also appears to fix the other case.
When the activity ceases sharply and the log is still not written after
minutes.

After writing 10 files, waiting a minute, yanking ... all 10 files where
there.
So the OP-case MIGHT have been this same error.
But that's the amateuer talking again.

--
Matthias

Dave Chinner

2012-12-18 20:29:14 UTC

This also appears to fix the other case.
When the activity ceases sharply and the log is still not written after
minutes.
After writing 10 files, waiting a minute, yanking ... all 10 files where
there.
So the OP-case MIGHT have been this same error.
But that's the amateuer talking again.

I kinda deserved that, didn't I? ;)

But now I understand the problem, I agree with you that the OP was
probably seeing the same bug. I understand the cause, and can
explain exactly how it would cause both sets of symptoms reported...

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Matthias Schniedermeyer

2012-12-19 01:04:45 UTC

This also appears to fix the other case.
When the activity ceases sharply and the log is still not written after
minutes.
After writing 10 files, waiting a minute, yanking ... all 10 files where
there.
So the OP-case MIGHT have been this same error.
But that's the amateuer talking again.

I kinda deserved that, didn't I? ;)
But now I understand the problem, I agree with you that the OP was
probably seeing the same bug. I understand the cause, and can
explain exactly how it would cause both sets of symptoms reported...

Great.

That means less lost time in the future, when a USB-disc "decides" to go
MIA.
The record was about 45 minutes lost, or something over 200GB just
going up in smoke (without the smoke).

At least until such a bug is reintroduced in the future.
This Bug was introduced in 3.5(*) and existed up to 3.7 and if i
understand you correctly was fixed more or less by accident for 3.8.

I'd say there is definitely something amiss in the test-suite, this is
basic functionality that appears untested to me. (I don't know what the
test-suite contains, only that it exists)

At least i'd count a dropped connection or power failure (The only
difference is that in the latter case the cache MAY get dropped,
otherwise i'd say both cases are basically the same) among the basic
functionality that should be assured by a journaling fileystem.

*:
This is the snipped that introduced the part of the if that you changed
above.
This is from `git diff v3.4..v3.5 -- fs/xfs/xfs_sync.c`

+ * We shouldn't write/force the log if we are in the mount/unmount
+ * process or on a read only filesystem. The workqueue still needs to be
+ * active in both cases, however, because it is used for inode reclaim
+ * during these times. Use the MS_ACTIVE flag to avoid doing anything
+ * during mount. Doing work during unmount is avoided by calling
+ * cancel_delayed_work_sync on this work queue before tearing down
+ * the ail and the log in xfs_log_unmount.
+ */
+ if (!(mp->m_super->s_flags & MS_ACTIVE) &&
+ !(mp->m_flags & XFS_MOUNT_RDONLY)) {

git blame dates the lines at 2012-05-21

--
Matthias

Dave Chinner

2012-12-19 22:43:28 UTC

This also appears to fix the other case.
When the activity ceases sharply and the log is still not written after
minutes.
After writing 10 files, waiting a minute, yanking ... all 10 files where
there.
So the OP-case MIGHT have been this same error.
But that's the amateuer talking again.

I kinda deserved that, didn't I? ;)
But now I understand the problem, I agree with you that the OP was
probably seeing the same bug. I understand the cause, and can
explain exactly how it would cause both sets of symptoms reported...

.....

Post by Matthias Schniedermeyer
git blame dates the lines at 2012-05-21

Sure, and the review discussion of this very change identified the
whole startup/shutdown code as needing to be rewritten to fix a
systemic problems that we'd been patching repeatedly:

http://oss.sgi.com/archives/xfs/2012-06/msg00064.html

IOWs, while the bug was introduced by accident due to the code being
nasty and difficult to follow, it wasn't fixed by accident. It was
fixed as a result of review, reflection and redesign - three
fundamental properties of the iterative engineering process we use
here...

Post by Matthias Schniedermeyer
I'd say there is definitely something amiss in the test-suite, this is
basic functionality that appears untested to me. (I don't know what the
test-suite contains, only that it exists)

This failure behaviour is definitely tested - the tests that were
triggering all the unmount problems that we were trying to fix were
the ones that simulate filesystem shutdowns and check that the data
is still there after the filesystem is remounted. i.e. very similar
tests to what you are running here.

The thing is, no matter how much testing we do, there are always
cases that will elude us. In this case, it's the fact that you have
to slowly create files and write lots of data to trigger the
problem. If you create enough files quickly enough, then the log
gets written automatically, and the problem you are seeing is not
exposed.

Many of these tests were written before delayed logging existed, so
file creations were comparitively slow compared to now, so the
balance of tests changes as we change the kernel code. That is, a
test that once may have exposed this problem due to the timing of
operations may not expose it now because the timing is different.
Hence what we once tested may not get exercised effectively now.
Once we identify issues like this, we generally write new tests to
reproduce and cover the issue...

That's the reality of the situation - testing is never going to be
100% reliable, even if we have tests that specifically cover
behaviour like this....

Post by Matthias Schniedermeyer
At least i'd count a dropped connection or power failure (The only
difference is that in the latter case the cache MAY get dropped,
otherwise i'd say both cases are basically the same) among the basic
functionality that should be assured by a journaling fileystem.

A journalling filesystem doesn't guarantee that you won't lose any
data on crash, power fail or permanent IO errors. All journalling
guarantees is that the filesystem is *consistent* after recovery.
i.e. you don't have to run xfs_repair after such a failure to
ensure it is not corrupted. In all your testing, you have not seen
the filesystem become corrupted, so the journalling has fulfilled
it's guarantee.

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Matthias Schniedermeyer

2012-12-20 01:40:14 UTC

I know that the basic property of the Journaling is to prevent
corruption of the metadata, IOW to prevent exessive time to check fs
after for e.g. a power-failure.

I'm not arguing that a Journaling FS can't loose any data, only that it
stays "within reason". I think you can agree that hunderds of files
containing hundreds of gigabytes of data is well outside "reasonable
looses". It doesn't have to be a pony or unicorn, but it should at least
be a mule. :-)

Overall i'm quite satisfied with the performance of XFS over the years,
otherwise it wouldn't be the filesystem for over 99.999% of all the
storage capacity i have. The only thing for which i don't use XFS are
"/boot"-partitions, which amount to a few hundreds of MB, whereas XFS
accounts for well over 100TB.

About corruptions: I haven't had any corruption in my dozens of tests,
BUT once my metdadata got corrupted by this bug! (Or mayby the
umount-thing that you meantioned)

It was the first time ever that i had to use xfs_repair (XFS refused to
mount the fs) and i have been using XFS pretty much since it got ported
to Linux.

I "lost" a few files that i had to recover from my last backup (i didn't
bother to look through lost+found). Which luckily i made just minutes
before rebooting the machine in question. I had thought about issuing a
'sync', because the previous incarnation of the bug flashed before my
inner eye. But by the time i had shutdown X and got to the
command-prompt i had already forgotten to type 'sync' and went straight
for 'reboot' and was slightly irritated when my machine did finished
booting correctly, because a secondary-filesystem was MIA. (The
root-filesystem was OK, but my /home was MIA)

The most "pain" i can remeber before this episode was the 0-fill
"thing", which bit me at least once. Other than that it's been smooth
sailing all these years.

I other words: Overall it's still very good work and i will rely on XFS
for the forseable future.

--
Matthias

Stan Hoeppner

2012-12-20 05:48:48 UTC

Post by Matthias Schniedermeyer
I had thought about issuing a
'sync', because the previous incarnation of the bug flashed before my
inner eye. But by the time i had shutdown X and got to the
command-prompt i had already forgotten to type 'sync' and went straight
for 'reboot'

When did you last read the man pages for reboot and shutdown? reboot
issues a sync automatically unless you specify

-n Don't sync before reboot or halt. Note that the kernel and
storage drivers may still sync.

(note the disclaimer)

and even though the shutdown man page doesn't state so explicitly, I'm
pretty darn sure that shutdown issues a sync before halting or rebooting.

Thus, you shouldn't have to manually call sync before issuing your
favorite system restart command. If sync isn't being called, maybe
there's an issue with the versions of these commands shipped with your
distro, or other type of problem.

--
Stan

Matthias Schniedermeyer

2012-12-20 10:18:23 UTC

Post by Stan Hoeppner

When did you last read the man pages for reboot and shutdown? reboot
issues a sync automatically unless you specify
-n Don't sync before reboot or halt. Note that the kernel and
storage drivers may still sync.
(note the disclaimer)
and even though the shutdown man page doesn't state so explicitly, I'm
pretty darn sure that shutdown issues a sync before halting or rebooting.
Thus, you shouldn't have to manually call sync before issuing your
favorite system restart command. If sync isn't being called, maybe
there's an issue with the versions of these commands shipped with your
distro, or other type of problem.

That's strange, i looked into the source of /sbin/halt and it does
contain a call to 'sync'.

So it either didn't work, or i hit the umount-bug that David mentioned.

When i rebooted to 3.7(.0), from a buggy 3.6.2, i 'sync'ed manually
before rebooting and also manually 'umount'ed all "physical" FSes except
the rootfs and this time i didn't get a corruption. All
mount-syslog-lines for the 3.7 boot say the filesystem was clean.

So when it is time to reboot to 3.7.2, which hopefully contains the
bugfix, i still intent to do a wetware (half-)shutdown before the actual
'reboot'. The statistic is 2 out of 3 in favor of corruptions in the
last 3 reboots of this particular computer.

--
Matthias

Dave Chinner

2012-12-10 00:45:14 UTC