[PATCH 2/6] xfs: don't zero partial page cache pages during O_DIRECT writes

Discussion:

[PATCH 2/6] xfs: don't zero partial page cache pages during O_DIRECT writes

Dave Chinner

2014-08-21 05:09:10 UTC

From: Chris Mason <***@fb.com>

xfs is using truncate_pagecache_range to invalidate the page cache
during DIO reads. This is different from the other filesystems who
only invalidate pages during DIO writes.

truncate_pagecache_range is meant to be used when we are freeing the
underlying data structs from disk, so it will zero any partial
ranges in the page. This means a DIO read can zero out part of the
page cache page, and it is possible the page will stay in cache.

buffered reads will find an up to date page with zeros instead of
the data actually on disk.

This patch fixes things by using invalidate_inode_pages2_range
instead. It preserves the page cache invalidation, but won't zero
any pages.

[dchinner: catch error and warn if it fails. Comment.]

cc: ***@vger.kernel.org
Signed-off-by: Chris Mason <***@fb.com>
Reviewed-by: Dave Chinner <***@redhat.com>
Signed-off-by: Dave Chinner <***@fromorbit.com>
---
fs/xfs/xfs_file.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 076b170..827cfb2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -296,7 +296,16 @@ xfs_file_read_iter(
xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL);
return ret;
}
- truncate_pagecache_range(VFS_I(ip), pos, -1);
+
+ /*
+ * Invalidate whole pages. This can return an error if
+ * we fail to invalidate a page, but this should never
+ * happen on XFS. Warn if it does fail.
+ */
+ ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
+ pos >> PAGE_CACHE_SHIFT, -1);
+ WARN_ON_ONCE(ret);
+ ret = 0;
}
xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
}

--
2.0.0

Dave Chinner

2014-08-21 05:09:14 UTC

From: Dave Chinner <***@redhat.com>

If we have delalloc extents on a file before we run a collapse range
opertaion, we sync the range that we are going to collapse to
convert delalloc extents in that region to real extents to simplify
the shift operation.

However, the shift operation then assumes that the extent list is
not going to change as it iterates over the extent list moving
things about. Unfortunately, this isn't true because we can't hold
the ILOCK over all the operations. We can prevent new IO from
modifying the extent list by holding the IOLOCK, but that doesn't
prevent writeback from running....

And when writeback runs, it can convert delalloc extents is the
range of the file prior to the region being collapsed, and this
changes the indexes of all the extents in the file. That causes the
collapse range operation to Go Bad.

The right fix is to rewrite the extent shift operation not to be
dependent on the extent list not changing across the entire
operation, but this is a fairly significant piece of work to do.
Hence, as a short-term workaround for the problem, sync the entire
file before starting a collapse operation to remove all delalloc
ranges from the file and so avoid the problem of concurrent
writeback changing the extent list.

Diagnosed-and-Reported-by: Brian Foster <***@redhat.com>
Signed-off-by: Dave Chinner <***@redhat.com>
---
fs/xfs/xfs_bmap_util.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index c53cc03..035041d 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1460,6 +1460,19 @@ xfs_collapse_file_space(
start_fsb = XFS_B_TO_FSB(mp, offset + len);
shift_fsb = XFS_B_TO_FSB(mp, len);

+ /*
+ * writeback the entire file to prevent concurrent writeback of ranges
+ * outside the collapsing region from changing the extent list.
+ *
+ * XXX: This is a temporary fix until the extent shift loop below is
+ * converted to use offsets and lookups within the ILOCK rather than
+ * carrying around the index into the extent list for the next
+ * iteration.
+ */
+ error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
+ if (error)
+ return error;
+
error = xfs_free_file_space(ip, offset, len);
if (error)
return error;

--
2.0.0

Brian Foster

2014-08-21 13:09:25 UTC

Post by Dave Chinner
If we have delalloc extents on a file before we run a collapse range
opertaion, we sync the range that we are going to collapse to
convert delalloc extents in that region to real extents to simplify
the shift operation.
However, the shift operation then assumes that the extent list is
not going to change as it iterates over the extent list moving
things about. Unfortunately, this isn't true because we can't hold
the ILOCK over all the operations. We can prevent new IO from
modifying the extent list by holding the IOLOCK, but that doesn't
prevent writeback from running....
And when writeback runs, it can convert delalloc extents is the
range of the file prior to the region being collapsed, and this
changes the indexes of all the extents in the file. That causes the
collapse range operation to Go Bad.
The right fix is to rewrite the extent shift operation not to be
dependent on the extent list not changing across the entire
operation, but this is a fairly significant piece of work to do.
Hence, as a short-term workaround for the problem, sync the entire
file before starting a collapse operation to remove all delalloc
ranges from the file and so avoid the problem of concurrent
writeback changing the extent list.
---
fs/xfs/xfs_bmap_util.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index c53cc03..035041d 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1460,6 +1460,19 @@ xfs_collapse_file_space(
start_fsb = XFS_B_TO_FSB(mp, offset + len);
shift_fsb = XFS_B_TO_FSB(mp, len);
+ /*
+ * writeback the entire file to prevent concurrent writeback of ranges
+ * outside the collapsing region from changing the extent list.
+ *
+ * XXX: This is a temporary fix until the extent shift loop below is
+ * converted to use offsets and lookups within the ILOCK rather than
+ * carrying around the index into the extent list for the next
+ * iteration.
+ */
+ error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
+ if (error)
+ return error;
+
error = xfs_free_file_space(ip, offset, len);
if (error)
return error;
--
2.0.0
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs

Dave Chinner

2014-08-21 05:09:09 UTC

From: Dave Chinner <***@redhat.com>

generic/263 is failing fsx at this point with a page spanning
EOF that cannot be invalidated. The operations are:

1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes)
1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes)
1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes)

where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
fails with -EBUSY and so we fire this assert:

XFS: Assertion failed: ret < 0 || ret == count, file: fs/xfs/xfs_file.c, line: 676

because the kernel is trying to fall back to buffered IO on the
direct IO path (which XFS does not do).

The real question is this: Why can't that page be invalidated after
it has been written to disk an cleaned?

Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say
what?

OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.

This is messy because the buffer code is not meant to be shared
and it has interesting locking issues on the buffer dirty bits.
So just copy and paste it and then modify it to suit what we need.

cc: <***@kernel.org>
Signed-off-by: Dave Chinner <***@redhat.com>
---
fs/xfs/xfs_aops.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 58 insertions(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 11e9b4c..2a316ad 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1753,11 +1753,69 @@ xfs_vm_readpages(
return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
}

+/*
+ * This is basically a copy of __set_page_dirty_buffers() with one
+ * small tweak: buffers beyond EOF do not get marked dirty. If we mark them
+ * dirty, we'll never be able to clean them because we don't write buffers
+ * beyond EOF, and that means we can't invalidate pages that span EOF
+ * that have been marked dirty. Further, the dirty state can leak into
+ * the file interior if the file is extended, resulting in all sorts of
+ * bad things happening as the state does not match the unerlying data.
+ */
+STATIC int
+xfs_vm_set_page_dirty(
+ struct page *page)
+{
+ struct address_space *mapping = page_mapping(page);
+ struct inode *inode = mapping->host;
+ loff_t end_offset;
+ loff_t offset;
+ int newly_dirty;
+
+ if (unlikely(!mapping))
+ return !TestSetPageDirty(page);
+
+ end_offset = i_size_read(inode);
+ offset = end_offset & PAGE_CACHE_MASK;
+
+ spin_lock(&mapping->private_lock);
+ if (page_has_buffers(page)) {
+ struct buffer_head *head = page_buffers(page);
+ struct buffer_head *bh = head;
+
+ do {
+ if (offset < end_offset)
+ set_buffer_dirty(bh);
+ bh = bh->b_this_page;
+ offset += 1 << inode->i_blkbits;
+ } while (bh != head);
+ }
+ newly_dirty = !TestSetPageDirty(page);
+ spin_unlock(&mapping->private_lock);
+
+ if (newly_dirty) {
+ /* sigh - __set_page_dirty() is static, so copy it here, too */
+ unsigned long flags;
+
+ spin_lock_irqsave(&mapping->tree_lock, flags);
+ if (page->mapping) { /* Race with truncate? */
+ WARN_ON_ONCE(!PageUptodate(page));
+ account_page_dirtied(page, mapping);
+ radix_tree_tag_set(&mapping->page_tree,
+ page_index(page), PAGECACHE_TAG_DIRTY);
+ }
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+ }
+ return newly_dirty;
+}
+
const struct address_space_operations xfs_address_space_operations = {
.readpage = xfs_vm_readpage,
.readpages = xfs_vm_readpages,
.writepage = xfs_vm_writepage,
.writepages = xfs_vm_writepages,
+ .set_page_dirty = xfs_vm_set_page_dirty,
.releasepage = xfs_vm_releasepage,
.invalidatepage = xfs_vm_invalidatepage,
.write_begin = xfs_vm_write_begin,

--
2.0.0

Brian Foster

2014-08-21 12:48:24 UTC

Post by Dave Chinner
generic/263 is failing fsx at this point with a page spanning
1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes)
1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes)
1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes)
where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
XFS: Assertion failed: ret < 0 || ret == count, file: fs/xfs/xfs_file.c, line: 676
because the kernel is trying to fall back to buffered IO on the
direct IO path (which XFS does not do).
The real question is this: Why can't that page be invalidated after
it has been written to disk an cleaned?
Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say
what?
OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.
This is messy because the buffer code is not meant to be shared
and it has interesting locking issues on the buffer dirty bits.
So just copy and paste it and then modify it to suit what we need.
---
fs/xfs/xfs_aops.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 58 insertions(+)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 11e9b4c..2a316ad 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1753,11 +1753,69 @@ xfs_vm_readpages(
return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
}
+/*
+ * This is basically a copy of __set_page_dirty_buffers() with one
+ * small tweak: buffers beyond EOF do not get marked dirty. If we mark them
+ * dirty, we'll never be able to clean them because we don't write buffers
+ * beyond EOF, and that means we can't invalidate pages that span EOF
+ * that have been marked dirty. Further, the dirty state can leak into
+ * the file interior if the file is extended, resulting in all sorts of
+ * bad things happening as the state does not match the unerlying data.
+ */
+STATIC int
+xfs_vm_set_page_dirty(
+ struct page *page)
+{
+ struct address_space *mapping = page_mapping(page);

This breaks xfs as a kernel module:

$ make -j 8 M=fs/xfs
Building modules, stage 2.
MODPOST 1 modules
WARNING: "page_mapping" [fs/xfs/xfs.ko] undefined!
...

I suppose we could export that symbol, but why wouldn't we just propose
this change to __set_page_dirty_buffers()?

Brian

Post by Dave Chinner
+ struct inode *inode = mapping->host;
+ loff_t end_offset;
+ loff_t offset;
+ int newly_dirty;
+
+ if (unlikely(!mapping))
+ return !TestSetPageDirty(page);
+
+ end_offset = i_size_read(inode);
+ offset = end_offset & PAGE_CACHE_MASK;
+
+ spin_lock(&mapping->private_lock);
+ if (page_has_buffers(page)) {
+ struct buffer_head *head = page_buffers(page);
+ struct buffer_head *bh = head;
+
+ do {
+ if (offset < end_offset)
+ set_buffer_dirty(bh);
+ bh = bh->b_this_page;
+ offset += 1 << inode->i_blkbits;
+ } while (bh != head);
+ }
+ newly_dirty = !TestSetPageDirty(page);
+ spin_unlock(&mapping->private_lock);
+
+ if (newly_dirty) {
+ /* sigh - __set_page_dirty() is static, so copy it here, too */
+ unsigned long flags;
+
+ spin_lock_irqsave(&mapping->tree_lock, flags);
+ if (page->mapping) { /* Race with truncate? */
+ WARN_ON_ONCE(!PageUptodate(page));
+ account_page_dirtied(page, mapping);
+ radix_tree_tag_set(&mapping->page_tree,
+ page_index(page), PAGECACHE_TAG_DIRTY);
+ }
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+ }
+ return newly_dirty;
+}
+
const struct address_space_operations xfs_address_space_operations = {
.readpage = xfs_vm_readpage,
.readpages = xfs_vm_readpages,
.writepage = xfs_vm_writepage,
.writepages = xfs_vm_writepages,
+ .set_page_dirty = xfs_vm_set_page_dirty,
.releasepage = xfs_vm_releasepage,
.invalidatepage = xfs_vm_invalidatepage,
.write_begin = xfs_vm_write_begin,
--
2.0.0
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs

Dave Chinner

2014-08-21 22:38:45 UTC

Post by Brian Foster

Post by Dave Chinner
+++ b/fs/xfs/xfs_aops.c
@@ -1753,11 +1753,69 @@ xfs_vm_readpages(
return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
}
+/*
+ * This is basically a copy of __set_page_dirty_buffers() with one
+ * small tweak: buffers beyond EOF do not get marked dirty. If we mark them
+ * dirty, we'll never be able to clean them because we don't write buffers
+ * beyond EOF, and that means we can't invalidate pages that span EOF
+ * that have been marked dirty. Further, the dirty state can leak into
+ * the file interior if the file is extended, resulting in all sorts of
+ * bad things happening as the state does not match the unerlying data.
+ */
+STATIC int
+xfs_vm_set_page_dirty(
+ struct page *page)
+{
+ struct address_space *mapping = page_mapping(page);

$ make -j 8 M=fs/xfs
Building modules, stage 2.
MODPOST 1 modules
WARNING: "page_mapping" [fs/xfs/xfs.ko] undefined!
...

Oh, that should just be:

+ struct address_space *mapping = page->mapping;

Post by Brian Foster
I suppose we could export that symbol, but why wouldn't we just propose
this change to __set_page_dirty_buffers()?

I'm not going to risk breaking other filesystems that have implicit
dependencies on buffers beyond EOF being dirtied.

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Christoph Hellwig

2014-08-21 13:08:12 UTC

Post by Dave Chinner
generic/263 is failing fsx at this point with a page spanning
1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes)
1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes)
1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes)
where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
XFS: Assertion failed: ret < 0 || ret == count, file: fs/xfs/xfs_file.c, line: 676
because the kernel is trying to fall back to buffered IO on the
direct IO path (which XFS does not do).
The real question is this: Why can't that page be invalidated after
it has been written to disk an cleaned?
Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say
what?
OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.

Shouldn't this be fixed in __set_page_buffers_dirty itself? This
doesn't seem an XFS-specific issue.

Anton Altaparmakov

2014-08-21 13:54:30 UTC

Hi,

Post by Christoph Hellwig

Post by Dave Chinner
generic/263 is failing fsx at this point with a page spanning
1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes)
1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes)
1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes)
where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
XFS: Assertion failed: ret < 0 || ret == count, file: fs/xfs/xfs_file.c, line: 676
because the kernel is trying to fall back to buffered IO on the
direct IO path (which XFS does not do).
The real question is this: Why can't that page be invalidated after
it has been written to disk an cleaned?
Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say
what?
OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.

Shouldn't this be fixed in __set_page_buffers_dirty itself? This
doesn't seem an XFS-specific issue.

It is perfectly normal to have dirty buffers beyond EOF. The file system should just clean such buffers so the page then becomes reclaimable (or if the file system is using generic functionality this would be done by the generic functions but XFS is implementing writepage(s) itself and missing this bit).

The generic function you want to see is fs/buffer.c::__block_write_full_page() which does:

if (block > last_block) {
/*
* mapped buffers outside i_size will occur, because
* this page can be outside i_size when there is a
* truncate in progress.
*/
/*
* The buffer was zeroed by block_write_full_page()
*/
clear_buffer_dirty(bh);
set_buffer_uptodate(bh);

Why do you want to add complexity to __set_page_buffers_dirty() given that they are no big deal and we need to be able to cope with their existance due to concurrent truncate anyway?

Best regards,

Anton

--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
University of Cambridge Information Services, Roger Needham Building
7 JJ Thomson Avenue, Cambridge, CB3 0RB, UK

Chris Mason

2014-08-21 15:21:34 UTC

Post by Christoph Hellwig

Post by Dave Chinner
generic/263 is failing fsx at this point with a page spanning
1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes)
1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes)
1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes)
where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
XFS: Assertion failed: ret < 0 || ret == count, file: fs/xfs/xfs_file.c, line: 676
because the kernel is trying to fall back to buffered IO on the
direct IO path (which XFS does not do).
The real question is this: Why can't that page be invalidated after
it has been written to disk an cleaned?
Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say
what?
OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.

Shouldn't this be fixed in __set_page_buffers_dirty itself? This
doesn't seem an XFS-specific issue.

block_write_full_page() is invalidating buffers past eof. Probably
because we used to be able to dirty buffers without holding the page
lock, and it's much easier to trust i_size at writepage time.

I think we have the page locked for all the dirties now, so this isn't
as important as in the past?

-chris

Jan Kara

2014-08-21 19:56:32 UTC

Post by Dave Chinner
generic/263 is failing fsx at this point with a page spanning
1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes)
1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes)
1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes)
where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
XFS: Assertion failed: ret < 0 || ret == count, file: fs/xfs/xfs_file.c, line: 676
because the kernel is trying to fall back to buffered IO on the
direct IO path (which XFS does not do).
The real question is this: Why can't that page be invalidated after
it has been written to disk an cleaned?
Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say
what?
OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.
This is messy because the buffer code is not meant to be shared
and it has interesting locking issues on the buffer dirty bits.
So just copy and paste it and then modify it to suit what we need.

Well, I'm not sure this is the cleanest way to fix your problem. The
thing is that inode size can change (decrease) after set_page_dirty() has
run and writeback can find the page before truncate_inode_pages() calls
do_invalidatepage() on the last partial page. Now I agree that given
truncate and direct IO are both synchronized using IOLOCK, this change
still fixes your problem. I just wanted to point out that your change
doesn't really make sure won't see dirty buffers in a tail page beyond
i_size.

As Anton has pointed out other filesystems solve the same issue by clearing
the dirty bits beyond EOF in their writepage() function. Also since we
zero-out the tail of the page in writepage() (even in XFS as I checked),
this kind of makes sense to me and has smaller overhead than special
set_page_dirty()...

Honza

Post by Dave Chinner
---
fs/xfs/xfs_aops.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 58 insertions(+)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 11e9b4c..2a316ad 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1753,11 +1753,69 @@ xfs_vm_readpages(
return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
}
+/*
+ * This is basically a copy of __set_page_dirty_buffers() with one
+ * small tweak: buffers beyond EOF do not get marked dirty. If we mark them
+ * dirty, we'll never be able to clean them because we don't write buffers
+ * beyond EOF, and that means we can't invalidate pages that span EOF
+ * that have been marked dirty. Further, the dirty state can leak into
+ * the file interior if the file is extended, resulting in all sorts of
+ * bad things happening as the state does not match the unerlying data.
+ */
+STATIC int
+xfs_vm_set_page_dirty(
+ struct page *page)
+{
+ struct address_space *mapping = page_mapping(page);
+ struct inode *inode = mapping->host;
+ loff_t end_offset;
+ loff_t offset;
+ int newly_dirty;
+
+ if (unlikely(!mapping))
+ return !TestSetPageDirty(page);
+
+ end_offset = i_size_read(inode);
+ offset = end_offset & PAGE_CACHE_MASK;
+
+ spin_lock(&mapping->private_lock);
+ if (page_has_buffers(page)) {
+ struct buffer_head *head = page_buffers(page);
+ struct buffer_head *bh = head;
+
+ do {
+ if (offset < end_offset)
+ set_buffer_dirty(bh);
+ bh = bh->b_this_page;
+ offset += 1 << inode->i_blkbits;
+ } while (bh != head);
+ }
+ newly_dirty = !TestSetPageDirty(page);
+ spin_unlock(&mapping->private_lock);
+
+ if (newly_dirty) {
+ /* sigh - __set_page_dirty() is static, so copy it here, too */
+ unsigned long flags;
+
+ spin_lock_irqsave(&mapping->tree_lock, flags);
+ if (page->mapping) { /* Race with truncate? */
+ WARN_ON_ONCE(!PageUptodate(page));
+ account_page_dirtied(page, mapping);
+ radix_tree_tag_set(&mapping->page_tree,
+ page_index(page), PAGECACHE_TAG_DIRTY);
+ }
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+ }
+ return newly_dirty;
+}
+
const struct address_space_operations xfs_address_space_operations = {
.readpage = xfs_vm_readpage,
.readpages = xfs_vm_readpages,
.writepage = xfs_vm_writepage,
.writepages = xfs_vm_writepages,
+ .set_page_dirty = xfs_vm_set_page_dirty,
.releasepage = xfs_vm_releasepage,
.invalidatepage = xfs_vm_invalidatepage,
.write_begin = xfs_vm_write_begin,
--
2.0.0
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs

--
Jan Kara <***@suse.cz>
SUSE Labs, CR

Dave Chinner

2014-08-21 22:33:32 UTC

Post by Jan Kara

Post by Dave Chinner
generic/263 is failing fsx at this point with a page spanning
1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes)
1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes)
1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes)
where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
XFS: Assertion failed: ret < 0 || ret == count, file: fs/xfs/xfs_file.c, line: 676
because the kernel is trying to fall back to buffered IO on the
direct IO path (which XFS does not do).
The real question is this: Why can't that page be invalidated after
it has been written to disk an cleaned?
Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say
what?
OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.
This is messy because the buffer code is not meant to be shared
and it has interesting locking issues on the buffer dirty bits.
So just copy and paste it and then modify it to suit what we need.

Well, I'm not sure this is the cleanest way to fix your problem. The
thing is that inode size can change (decrease) after set_page_dirty() has
run and writeback can find the page before truncate_inode_pages() calls
do_invalidatepage() on the last partial page. Now I agree that given
truncate and direct IO are both synchronized using IOLOCK, this change
still fixes your problem. I just wanted to point out that your change
doesn't really make sure won't see dirty buffers in a tail page beyond
i_size.

Truncate is not the problem. The issue is that we can't
*invalidate* pages that have dirty buffers, and I soon realised that
we don't see this problem with truncate because *truncate ignores
invalidation failures*. So now I went looking for how other code
handled this.

Post by Jan Kara
As Anton has pointed out other filesystems solve the same issue by clearing
the dirty bits beyond EOF in their writepage() function. Also since we
zero-out the tail of the page in writepage() (even in XFS as I checked),
this kind of makes sense to me and has smaller overhead than special
set_page_dirty()...

Yes, and that's where I started - block_write_full_page and so
I ended up with this first:

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 2a316ad..a9d6afc 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1057,8 +1073,24 @@ xfs_vm_writepage(
do {
int new_ioend = 0;

- if (offset >= end_offset)
+ /*
+ * If the buffer is fully beyond EOF, we need to mark it clean
+ * otherwise we'll leave stale dirty buffers in memory. See the
+ * comment above in the EOF handling about Charlie Foxtrots.
+ */
+ if (offset >= end_offset) {
+ clear_buffer_dirty(bh);
+ ASSERT(!buffer_delay(bh));
+ ASSERT(!buffer_mapped(bh));
+
+ /*
+ * Page is not uptodate until buffers under IO have been
+ * fully processed.
+ */
+ uptodate = 0;
+ continue;
- break;
+ }
+
if (!buffer_uptodate(bh))
uptodate = 0;

But this doesn't solve the invalidation failures - it merely covers
up a symptom of the underlying problem. i.e. fsx went from
failing invalidate on the EOF page at operation #1192 to failing
invalidate on the EOF page at operation #9378.

The I realised that flush/wait/invalidate is actually racy against
mmap, so writepage clearing dirty buffer flags on buffers beyond EOF
is not a solution to the problem - it can still occur. So I added
this check to releasepage() to catch stale dirty buffers beyond EOF
preventing invalidation from succeeding:

@@ -1209,9 +1245,16 @@ xfs_vm_writepages(

/*
* Called to move a page into cleanable state - and from there
- * to be released. The page should already be clean. We always
+ * to be released. We always
* have buffer heads in this call.
*
+ * The page should already be clean, except in the case where EOF falls within
+ * the page and then we can have dirty buffers beyond EOF that nobody can
+ * actually clean. These dirty buffers will cause invalidation failures, but
+ * because they can't be written the should not prevent us from tossing the page
+ * out of cache. Hence if we span EOF, walk the buffers on the page and make
+ * sure they are in a state where try_to_free_buffers() will succeed.
+ *
* Returns 1 if the page is ok to release, 0 otherwise.
*/
STATIC int
@@ -1219,7 +1262,10 @@ xfs_vm_releasepage(
struct page *page,
gfp_t gfp_mask)
{
+ struct inode *inode = page->mapping->host;
int delalloc, unwritten;
+ loff_t end_offset, offset;
+ pgoff_t end_index;

trace_xfs_releasepage(page->mapping->host, page, 0, 0);

@@ -1230,5 +1276,21 @@ xfs_vm_releasepage(
if (WARN_ON_ONCE(unwritten))
return 0;

+ end_offset = i_size_read(inode);
+ end_index = end_offset >> PAGE_CACHE_SHIFT;
+ if (page->index >= end_index) {
+ struct buffer_head *head = page_buffers(page);
+ struct buffer_head *bh;
+
+ offset = end_index << PAGE_CACHE_SHIFT;
+ bh = head;
+ do {
+ if (offset > end_offset)
+ clear_buffer_dirty(bh);
+ bh = bh->b_this_page;
+ offset += 1 << inode->i_blkbits;
+ } while (bh != head);
+ }
+
return try_to_free_buffers(page);
}

Which then moved the fsx invalidation failure out to somewhere
around 105,000 operations.

At which point I realised that I'm playing whack-a-mole with a
fundamental problem: buffers beyond EOF cannot be written, so
dirtying them in the first place is just fundamentally wrong. In XFS we'll
zero them on disk during an extend operation (either in write,
truncate or prealloc), so we're not going to leaak stale data by not
marking them dirty. They may not even be allocated, so we can't
assume that we can write them. So, rather than trying to handle this
dirty-buffers-beyond-EOF case in every situation where we might trip
over it, let's just prevent it from happening in the first place.

That's where the .set_page_dirty() method came about. The "mmap
dirties buffers beyond the EOF" problem is gone. Now, truncate might
have a similar problem with leaving dirty buffers beyond EOF as you
suggest, but I just can't seem to trip over that problem and it
hasn't shown up in the ~500 million fsx ops and ~30 hours of
fsstress that I've run in the past few days. So without being able
to reproduce a problem, I'm extremely wary of trying to "fix" it.

Quite frankly, preventing data corruption is far, far more important
than optimisation and efficiency....

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Jan Kara

2014-08-26 16:06:30 UTC

Post by Dave Chinner

Post by Jan Kara

Post by Dave Chinner
generic/263 is failing fsx at this point with a page spanning
1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes)
1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes)
1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes)
where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
XFS: Assertion failed: ret < 0 || ret == count, file: fs/xfs/xfs_file.c, line: 676
because the kernel is trying to fall back to buffered IO on the
direct IO path (which XFS does not do).
The real question is this: Why can't that page be invalidated after
it has been written to disk an cleaned?
Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say
what?
OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.
This is messy because the buffer code is not meant to be shared
and it has interesting locking issues on the buffer dirty bits.
So just copy and paste it and then modify it to suit what we need.

Well, I'm not sure this is the cleanest way to fix your problem. The
thing is that inode size can change (decrease) after set_page_dirty() has
run and writeback can find the page before truncate_inode_pages() calls
do_invalidatepage() on the last partial page. Now I agree that given
truncate and direct IO are both synchronized using IOLOCK, this change
still fixes your problem. I just wanted to point out that your change
doesn't really make sure won't see dirty buffers in a tail page beyond
i_size.

Truncate is not the problem. The issue is that we can't
*invalidate* pages that have dirty buffers, and I soon realised that
we don't see this problem with truncate because *truncate ignores
invalidation failures*. So now I went looking for how other code
handled this.

Post by Jan Kara
As Anton has pointed out other filesystems solve the same issue by clearing
the dirty bits beyond EOF in their writepage() function. Also since we
zero-out the tail of the page in writepage() (even in XFS as I checked),
this kind of makes sense to me and has smaller overhead than special
set_page_dirty()...

Yes, and that's where I started - block_write_full_page and so
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 2a316ad..a9d6afc 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1057,8 +1073,24 @@ xfs_vm_writepage(
do {
int new_ioend = 0;
- if (offset >= end_offset)
+ /*
+ * If the buffer is fully beyond EOF, we need to mark it clean
+ * otherwise we'll leave stale dirty buffers in memory. See the
+ * comment above in the EOF handling about Charlie Foxtrots.
+ */
+ if (offset >= end_offset) {
+ clear_buffer_dirty(bh);
+ ASSERT(!buffer_delay(bh));
+ ASSERT(!buffer_mapped(bh));
+
+ /*
+ * Page is not uptodate until buffers under IO have been
+ * fully processed.
+ */
+ uptodate = 0;
+ continue;
- break;
+ }
+
if (!buffer_uptodate(bh))
uptodate = 0;
But this doesn't solve the invalidation failures - it merely covers
up a symptom of the underlying problem. i.e. fsx went from
failing invalidate on the EOF page at operation #1192 to failing
invalidate on the EOF page at operation #9378.
The I realised that flush/wait/invalidate is actually racy against
mmap, so writepage clearing dirty buffer flags on buffers beyond EOF
is not a solution to the problem - it can still occur. So I added

Hum, I have to say I don't quite understand this. I agree with you that
clearing dirty bits in .writepage() is racy wrt mmap. However I don't see
how fsx can ever trigger this. For flush/wait/invalidate to fail, someone
would have to write to the mapping somewhere between flush and invalidate.
However fsx is singlethreaded so I don't see how that can happen.

Post by Dave Chinner
this check to releasepage() to catch stale dirty buffers beyond EOF
@@ -1209,9 +1245,16 @@ xfs_vm_writepages(
/*
* Called to move a page into cleanable state - and from there
- * to be released. The page should already be clean. We always
+ * to be released. We always
* have buffer heads in this call.
*
+ * The page should already be clean, except in the case where EOF falls within
+ * the page and then we can have dirty buffers beyond EOF that nobody can
+ * actually clean. These dirty buffers will cause invalidation failures, but
+ * because they can't be written the should not prevent us from tossing the page
+ * out of cache. Hence if we span EOF, walk the buffers on the page and make
+ * sure they are in a state where try_to_free_buffers() will succeed.
+ *
* Returns 1 if the page is ok to release, 0 otherwise.
*/
STATIC int
@@ -1219,7 +1262,10 @@ xfs_vm_releasepage(
struct page *page,
gfp_t gfp_mask)
{
+ struct inode *inode = page->mapping->host;
int delalloc, unwritten;
+ loff_t end_offset, offset;
+ pgoff_t end_index;
trace_xfs_releasepage(page->mapping->host, page, 0, 0);
@@ -1230,5 +1276,21 @@ xfs_vm_releasepage(
if (WARN_ON_ONCE(unwritten))
return 0;
+ end_offset = i_size_read(inode);
+ end_index = end_offset >> PAGE_CACHE_SHIFT;
+ if (page->index >= end_index) {
+ struct buffer_head *head = page_buffers(page);
+ struct buffer_head *bh;
+
+ offset = end_index << PAGE_CACHE_SHIFT;

Here should be page->index instead of end_index I suppose. However it
shouldn't make a difference wrt your experiment.

Post by Dave Chinner
+ bh = head;
+ do {
+ if (offset > end_offset)
+ clear_buffer_dirty(bh);
+ bh = bh->b_this_page;
+ offset += 1 << inode->i_blkbits;
+ } while (bh != head);
+ }
+
return try_to_free_buffers(page);
}
Which then moved the fsx invalidation failure out to somewhere
around 105,000 operations.
At which point I realised that I'm playing whack-a-mole with a
fundamental problem: buffers beyond EOF cannot be written, so
dirtying them in the first place is just fundamentally wrong. In XFS we'll
zero them on disk during an extend operation (either in write,
truncate or prealloc), so we're not going to leaak stale data by not
marking them dirty. They may not even be allocated, so we can't
assume that we can write them. So, rather than trying to handle this
dirty-buffers-beyond-EOF case in every situation where we might trip
over it, let's just prevent it from happening in the first place.

Sure I understand that. I'm now trying to understand why noone else sees
the problem (we do run xfstests for ext4 and blocksize < page size) and
whether there's something we might want to do for other filesystems as
well. I wonder if there isn't something special going on with XFS
redirtying the page because xfs_imap_valid() returned EAGAIN but looking
into the code I don't see how that could happen.

Post by Dave Chinner
That's where the .set_page_dirty() method came about. The "mmap
dirties buffers beyond the EOF" problem is gone. Now, truncate might
have a similar problem with leaving dirty buffers beyond EOF as you
suggest, but I just can't seem to trip over that problem and it
hasn't shown up in the ~500 million fsx ops and ~30 hours of
fsstress that I've run in the past few days. So without being able
to reproduce a problem, I'm extremely wary of trying to "fix" it.

So truncate is introducing dirty pages beyond EOF only temporarily -
between the moment i_size is set and the moment page is invalidated. Since
direct IO is synchronized with truncate, it cannot see them and you are
fine. Only stuff like writeback or page reclaim can see pages in such
state...

Honza

--
Jan Kara <***@suse.cz>
SUSE Labs, CR

Dave Chinner

2014-08-26 21:38:08 UTC

Post by Jan Kara

Post by Dave Chinner

Post by Jan Kara

Post by Dave Chinner
generic/263 is failing fsx at this point with a page spanning
1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes)
1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes)
1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes)
where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
XFS: Assertion failed: ret < 0 || ret == count, file: fs/xfs/xfs_file.c, line: 676
because the kernel is trying to fall back to buffered IO on the
direct IO path (which XFS does not do).
The real question is this: Why can't that page be invalidated after
it has been written to disk an cleaned?

....

Post by Jan Kara

Post by Dave Chinner

Post by Jan Kara
As Anton has pointed out other filesystems solve the same issue by clearing
the dirty bits beyond EOF in their writepage() function. Also since we
zero-out the tail of the page in writepage() (even in XFS as I checked),
this kind of makes sense to me and has smaller overhead than special
set_page_dirty()...

Yes, and that's where I started - block_write_full_page and so
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 2a316ad..a9d6afc 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1057,8 +1073,24 @@ xfs_vm_writepage(
do {
int new_ioend = 0;
- if (offset >= end_offset)
+ /*
+ * If the buffer is fully beyond EOF, we need to mark it clean
+ * otherwise we'll leave stale dirty buffers in memory. See the
+ * comment above in the EOF handling about Charlie Foxtrots.
+ */
+ if (offset >= end_offset) {
+ clear_buffer_dirty(bh);
+ ASSERT(!buffer_delay(bh));
+ ASSERT(!buffer_mapped(bh));
+
+ /*
+ * Page is not uptodate until buffers under IO have been
+ * fully processed.
+ */
+ uptodate = 0;
+ continue;
- break;
+ }
+
if (!buffer_uptodate(bh))
uptodate = 0;
But this doesn't solve the invalidation failures - it merely covers
up a symptom of the underlying problem. i.e. fsx went from
failing invalidate on the EOF page at operation #1192 to failing
invalidate on the EOF page at operation #9378.
The I realised that flush/wait/invalidate is actually racy against
mmap, so writepage clearing dirty buffer flags on buffers beyond EOF
is not a solution to the problem - it can still occur. So I added

Hum, I have to say I don't quite understand this. I agree with you that
clearing dirty bits in .writepage() is racy wrt mmap. However I don't see
how fsx can ever trigger this. For flush/wait/invalidate to fail, someone
would have to write to the mapping somewhere between flush and invalidate.
However fsx is singlethreaded so I don't see how that can happen.

Right, I was talking about the general problem, not necessarily the
specific fsx problem I was seeing. i.e. clearing dirty buffers
beyond EOF in writepage is not a robust solution, so regardless of
whatever other problems there are, it's not a solution we can rely
on.

....

Post by Jan Kara

Post by Dave Chinner
At which point I realised that I'm playing whack-a-mole with a
fundamental problem: buffers beyond EOF cannot be written, so
dirtying them in the first place is just fundamentally wrong. In XFS we'll
zero them on disk during an extend operation (either in write,
truncate or prealloc), so we're not going to leaak stale data by not
marking them dirty. They may not even be allocated, so we can't
assume that we can write them. So, rather than trying to handle this
dirty-buffers-beyond-EOF case in every situation where we might trip
over it, let's just prevent it from happening in the first place.

Sure I understand that. I'm now trying to understand why noone else sees
the problem (we do run xfstests for ext4 and blocksize < page size) and
whether there's something we might want to do for other filesystems as
well.

Right, I don't understand where it's coming from, either.

I know there are still problems with stray delalloc blocks even
after making this change because xfs/297 is still tripping over
delayed allocation blocks where there should be none during direct
IO. i.e. this avoids the EOF issues that the direct IO invalidation
changes expose, but I still haven't been able to find the underlying
cause that is still triggering problems within EOF.

This is the disadvantage of smart people writing smart code - most
people (me included) are simply too dumb to be able to debug it. And
so when shit goes wrong, we're left to wonder where the paddle was
hidden...

Post by Jan Kara
I wonder if there isn't something special going on with XFS
redirtying the page because xfs_imap_valid() returned EAGAIN but looking
into the code I don't see how that could happen.

To tell the truth, I'm fed up with having to deal repeatedly with
bufferhead coherency problems and not being able to find the root
cause. We just keep finding problem after problem after problem and
there's no end in sight. So I'm much more inclined to spend time
removing bufferheads from XFS than I am to try to find a needle in a
very large haystack....

Cheers,

Dave.

--
Dave Chinner
***@fromorbit.com

Dave Chinner

2014-08-21 05:09:13 UTC

From: Brian Foster <***@redhat.com>

The file collapse mechanism uses xfs_bmap_shift_extents() to collapse
all subsequent extents down into the specified, previously punched out,
region. This function performs some validation, such as whether a
sufficient hole exists in the target region of the collapse, then shifts
the remaining exents downward.

The exit path of the function currently logs the inode unconditionally.
While we must log the inode (and abort) if an error occurs and the
transaction is dirty, the initial validation paths can generate errors
before the transaction has been dirtied. This creates an unnecessary
filesystem shutdown scenario, as the caller will cancel a transaction
that has been marked dirty.

Modify xfs_bmap_shift_extents() to OR the logflags bits as modifications
are made to the inode bmap. Only log the inode in the exit path if
logflags has been set. This ensures we only have to cancel a dirty
transaction if modifications have been made and prevents an unnecessary
filesystem shutdown otherwise.

Signed-off-by: Brian Foster <***@redhat.com>
---
fs/xfs/libxfs/xfs_bmap.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 3a6a700..e5c2518 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5422,7 +5422,7 @@ xfs_bmap_shift_extents(
struct xfs_bmap_free *flist,
int num_exts)
{
- struct xfs_btree_cur *cur;
+ struct xfs_btree_cur *cur = NULL;
struct xfs_bmbt_rec_host *gotp;
struct xfs_bmbt_irec got;
struct xfs_bmbt_irec left;
@@ -5433,7 +5433,7 @@ xfs_bmap_shift_extents(
int error = 0;
int i;
int whichfork = XFS_DATA_FORK;
- int logflags;
+ int logflags = 0;
xfs_filblks_t blockcount = 0;
int total_extents;

@@ -5476,16 +5476,11 @@ xfs_bmap_shift_extents(
}
}

- /* We are going to change core inode */
- logflags = XFS_ILOG_CORE;
if (ifp->if_flags & XFS_IFBROOT) {
cur = xfs_bmbt_init_cursor(mp, tp, ip, whichfork);
cur->bc_private.b.firstblock = *firstblock;
cur->bc_private.b.flist = flist;
cur->bc_private.b.flags = 0;
- } else {
- cur = NULL;
- logflags |= XFS_ILOG_DEXT;
}

/*
@@ -5543,11 +5538,14 @@ xfs_bmap_shift_extents(
blockcount = left.br_blockcount +
got.br_blockcount;
xfs_iext_remove(ip, *current_ext, 1, 0);
+ logflags |= XFS_ILOG_CORE;
if (cur) {
error = xfs_btree_delete(cur, &i);
if (error)
goto del_cursor;
XFS_WANT_CORRUPTED_GOTO(i == 1, del_cursor);
+ } else {
+ logflags |= XFS_ILOG_DEXT;
}
XFS_IFORK_NEXT_SET(ip, whichfork,
XFS_IFORK_NEXTENTS(ip, whichfork) - 1);
@@ -5573,6 +5571,7 @@ xfs_bmap_shift_extents(
got.br_startoff = startoff;
}

+ logflags |= XFS_ILOG_CORE;
if (cur) {
error = xfs_bmbt_update(cur, got.br_startoff,
got.br_startblock,
@@ -5580,6 +5579,8 @@ xfs_bmap_shift_extents(
got.br_state);
if (error)
goto del_cursor;
+ } else {
+ logflags |= XFS_ILOG_DEXT;
}

(*current_ext)++;
@@ -5595,6 +5596,7 @@ del_cursor:
xfs_btree_del_cursor(cur,
error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);

- xfs_trans_log_inode(tp, ip, logflags);
+ if (logflags)
+ xfs_trans_log_inode(tp, ip, logflags);
return error;
}

--
2.0.0

Dave Chinner

2014-08-21 05:09:12 UTC

From: Dave Chinner <***@redhat.com>

Now we are not doing silly things with dirtying buffers beyond EOF
and using invalidation correctly, we can finally reduce the ranges of
writeback and invalidation used by direct IO to match that of the IO
being issued.

Bring the writeback and invalidation ranges back to match the
generic direct IO code - this will greatly reduce the perturbation
of cached data when direct IO and buffered IO are mixed, but still
provide the same buffered vs direct IO coherency behaviour we
currently have.

Signed-off-by: Dave Chinner <***@redhat.com>
---
fs/xfs/xfs_file.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 19917fa..de5368c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -291,7 +291,7 @@ xfs_file_read_iter(
if (inode->i_mapping->nrpages) {
ret = filemap_write_and_wait_range(
VFS_I(ip)->i_mapping,
- pos, -1);
+ pos, pos + size - 1);
if (ret) {
xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL);
return ret;
@@ -303,7 +303,8 @@ xfs_file_read_iter(
* happen on XFS. Warn if it does fail.
*/
ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
- pos >> PAGE_CACHE_SHIFT, -1);
+ pos >> PAGE_CACHE_SHIFT,
+ (pos + size - 1) >> PAGE_CACHE_SHIFT);
WARN_ON_ONCE(ret);
ret = 0;
}
@@ -641,7 +642,7 @@ xfs_file_dio_aio_write(

if (mapping->nrpages) {
ret = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
- pos, -1);
+ pos, pos + count - 1);
if (ret)
goto out;
/*
@@ -650,7 +651,8 @@ xfs_file_dio_aio_write(
* happen on XFS. Warn if it does fail.
*/
ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
- pos >> PAGE_CACHE_SHIFT, -1);
+ pos >> PAGE_CACHE_SHIFT,
+ (pos + count - 1) >> PAGE_CACHE_SHIFT);
WARN_ON_ONCE(ret);
ret = 0;
}

--
2.0.0

Brian Foster

2014-08-21 13:09:18 UTC

Post by Dave Chinner
Now we are not doing silly things with dirtying buffers beyond EOF
and using invalidation correctly, we can finally reduce the ranges of
writeback and invalidation used by direct IO to match that of the IO
being issued.
Bring the writeback and invalidation ranges back to match the
generic direct IO code - this will greatly reduce the perturbation
of cached data when direct IO and buffered IO are mixed, but still
provide the same buffered vs direct IO coherency behaviour we
currently have.
---
fs/xfs/xfs_file.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 19917fa..de5368c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -291,7 +291,7 @@ xfs_file_read_iter(
if (inode->i_mapping->nrpages) {
ret = filemap_write_and_wait_range(
VFS_I(ip)->i_mapping,
- pos, -1);
+ pos, pos + size - 1);
if (ret) {
xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL);
return ret;
@@ -303,7 +303,8 @@ xfs_file_read_iter(
* happen on XFS. Warn if it does fail.
*/
ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
- pos >> PAGE_CACHE_SHIFT, -1);
+ pos >> PAGE_CACHE_SHIFT,
+ (pos + size - 1) >> PAGE_CACHE_SHIFT);
WARN_ON_ONCE(ret);
ret = 0;
}
@@ -641,7 +642,7 @@ xfs_file_dio_aio_write(
if (mapping->nrpages) {
ret = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
- pos, -1);
+ pos, pos + count - 1);
if (ret)
goto out;
/*
@@ -650,7 +651,8 @@ xfs_file_dio_aio_write(
* happen on XFS. Warn if it does fail.
*/
ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
- pos >> PAGE_CACHE_SHIFT, -1);
+ pos >> PAGE_CACHE_SHIFT,
+ (pos + count - 1) >> PAGE_CACHE_SHIFT);
WARN_ON_ONCE(ret);
ret = 0;
}
--
2.0.0
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs

Dave Chinner

2014-08-21 05:09:11 UTC

From: Dave Chinner <***@redhat.com>

Similar to direct IO reads, direct IO writes are using
truncate_pagecache_range to invalidate the page cache. This is
incorrect due to the sub-block zeroing in the page cache that
truncate_pagecache_range() triggers.

This patch fixes things by using invalidate_inode_pages2_range
instead. It preserves the page cache invalidation, but won't zero
any pages.

cc: ***@vger.kernel.org
Signed-off-by: Dave Chinner <***@redhat.com>
---
fs/xfs/xfs_file.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 827cfb2..19917fa 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -644,7 +644,15 @@ xfs_file_dio_aio_write(
pos, -1);
if (ret)
goto out;
- truncate_pagecache_range(VFS_I(ip), pos, -1);
+ /*
+ * Invalidate whole pages. This can return an error if
+ * we fail to invalidate a page, but this should never
+ * happen on XFS. Warn if it does fail.
+ */
+ ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
+ pos >> PAGE_CACHE_SHIFT, -1);
+ WARN_ON_ONCE(ret);
+ ret = 0;
}

/*

--
2.0.0

Brian Foster

2014-08-21 13:09:10 UTC

Post by Dave Chinner
Similar to direct IO reads, direct IO writes are using
truncate_pagecache_range to invalidate the page cache. This is
incorrect due to the sub-block zeroing in the page cache that
truncate_pagecache_range() triggers.
This patch fixes things by using invalidate_inode_pages2_range
instead. It preserves the page cache invalidation, but won't zero
any pages.
---
fs/xfs/xfs_file.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 827cfb2..19917fa 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -644,7 +644,15 @@ xfs_file_dio_aio_write(
pos, -1);
if (ret)
goto out;
- truncate_pagecache_range(VFS_I(ip), pos, -1);
+ /*
+ * Invalidate whole pages. This can return an error if
+ * we fail to invalidate a page, but this should never
+ * happen on XFS. Warn if it does fail.
+ */
+ ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
+ pos >> PAGE_CACHE_SHIFT, -1);
+ WARN_ON_ONCE(ret);
+ ret = 0;
}
/*
--
2.0.0
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs

Brian Foster

2014-08-21 13:08:47 UTC

Post by Dave Chinner
xfs is using truncate_pagecache_range to invalidate the page cache
during DIO reads. This is different from the other filesystems who
only invalidate pages during DIO writes.
truncate_pagecache_range is meant to be used when we are freeing the
underlying data structs from disk, so it will zero any partial
ranges in the page. This means a DIO read can zero out part of the
page cache page, and it is possible the page will stay in cache.
buffered reads will find an up to date page with zeros instead of
the data actually on disk.
This patch fixes things by using invalidate_inode_pages2_range
instead. It preserves the page cache invalidation, but won't zero
any pages.
[dchinner: catch error and warn if it fails. Comment.]
---
fs/xfs/xfs_file.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 076b170..827cfb2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -296,7 +296,16 @@ xfs_file_read_iter(
xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL);
return ret;
}
- truncate_pagecache_range(VFS_I(ip), pos, -1);
+
+ /*
+ * Invalidate whole pages. This can return an error if
+ * we fail to invalidate a page, but this should never
+ * happen on XFS. Warn if it does fail.
+ */
+ ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
+ pos >> PAGE_CACHE_SHIFT, -1);
+ WARN_ON_ONCE(ret);
+ ret = 0;
}
xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
}
--
2.0.0
_______________________________________________
xfs mailing list
http://oss.sgi.com/mailman/listinfo/xfs

18 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Dave Chinner 2014-08-21 05:09:10 UTC

Dave Chinner 2014-08-21 05:09:14 UTC

Brian Foster 2014-08-21 13:09:25 UTC

Dave Chinner 2014-08-21 05:09:09 UTC

Brian Foster 2014-08-21 12:48:24 UTC

Dave Chinner 2014-08-21 22:38:45 UTC

Christoph Hellwig 2014-08-21 13:08:12 UTC

Anton Altaparmakov 2014-08-21 13:54:30 UTC

Chris Mason 2014-08-21 15:21:34 UTC

Jan Kara 2014-08-21 19:56:32 UTC

Dave Chinner 2014-08-21 22:33:32 UTC

Jan Kara 2014-08-26 16:06:30 UTC

Dave Chinner 2014-08-26 21:38:08 UTC

Dave Chinner 2014-08-21 05:09:13 UTC

Dave Chinner 2014-08-21 05:09:12 UTC

Brian Foster 2014-08-21 13:09:18 UTC

Dave Chinner 2014-08-21 05:09:11 UTC

Brian Foster 2014-08-21 13:09:10 UTC

Brian Foster 2014-08-21 13:08:47 UTC

about - legalese

Loading...