OOM kills when running fsstress on CIFS

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

OOM kills when running fsstress on CIFS

Jeff Layton-2
Since 2.6.34, I've been able to consistently reproduce OOM kills when running fsstress (from the LTP suite) on CIFS. I spent some time yesterday and bisected it down to this patch:

---------------------[snip]---------------------
commit 315e995c63a15cb4d4efdbfd70fe2db191917f7a
Author: Nick Piggin <[hidden email]>
Date:   Wed Apr 21 03:18:28 2010 +0000

    [CIFS] use add_to_page_cache_lru
   
    add_to_page_cache_lru is exported, so it should be used. Benefits over
    using a private pagevec: neater code, 128 bytes fewer stack used, percpu
    lru ordering is preserved, and finally don't need to flush pagevec
    before returning so batching may be shared with other LRU insertions.
   
    Signed-off-by: Nick Piggin <[hidden email]>
    Reviewed-by: Dave Kleikamp <[hidden email]>
    Signed-off-by: Steve French <[hidden email]>
---------------------[snip]---------------------

Here's how I've been reproducing it:

Mount up a samba share with -o sec=krb5i,nounix,noserverino

Run: fsstress -d /path/to/dir/on/cifs/ -n 1000 -l0 -p8

...within an hour or two, I start getting OOM kills. After backing out
the patch above, I was able to run the test overnight. I'm not sure yet
what the actual problem is, but there seems to be something wrong with
that patch.

Thoughts?
--
Jeff Layton <[hidden email]>
_______________________________________________
linux-cifs-client mailing list
[hidden email]
https://lists.samba.org/mailman/listinfo/linux-cifs-client
Reply | Threaded
Open this post in threaded view
|

Re: OOM kills when running fsstress on CIFS

Nick Piggin-4
On Tue, May 25, 2010 at 06:57:05AM -0400, Jeff Layton wrote:

> Since 2.6.34, I've been able to consistently reproduce OOM kills when running fsstress (from the LTP suite) on CIFS. I spent some time yesterday and bisected it down to this patch:
>
> ---------------------[snip]---------------------
> commit 315e995c63a15cb4d4efdbfd70fe2db191917f7a
> Author: Nick Piggin <[hidden email]>
> Date:   Wed Apr 21 03:18:28 2010 +0000
>
>     [CIFS] use add_to_page_cache_lru
>    
>     add_to_page_cache_lru is exported, so it should be used. Benefits over
>     using a private pagevec: neater code, 128 bytes fewer stack used, percpu
>     lru ordering is preserved, and finally don't need to flush pagevec
>     before returning so batching may be shared with other LRU insertions.
>    
>     Signed-off-by: Nick Piggin <[hidden email]>
>     Reviewed-by: Dave Kleikamp <[hidden email]>
>     Signed-off-by: Steve French <[hidden email]>
> ---------------------[snip]---------------------
>
> Here's how I've been reproducing it:
>
> Mount up a samba share with -o sec=krb5i,nounix,noserverino
>
> Run: fsstress -d /path/to/dir/on/cifs/ -n 1000 -l0 -p8
>
> ...within an hour or two, I start getting OOM kills. After backing out
> the patch above, I was able to run the test overnight. I'm not sure yet
> what the actual problem is, but there seems to be something wrong with
> that patch.
>
> Thoughts?

Yep, it's my fault. The problem is the refcounting. Previously the
code hands off the references to the LRU, wheras now the lru takes
a new reference. (the other filesystems converted to use this
function seemed to more conventionally open-code lru_cache_add).

Can we get rid of a refcount increment anywhere? Otherwise we'll
need to just drop the references after adding the pages.

_______________________________________________
linux-cifs-client mailing list
[hidden email]
https://lists.samba.org/mailman/listinfo/linux-cifs-client
Reply | Threaded
Open this post in threaded view
|

Re: OOM kills when running fsstress on CIFS

Jeff Layton-2
On Tue, 25 May 2010 21:16:39 +1000
Nick Piggin <[hidden email]> wrote:

> On Tue, May 25, 2010 at 06:57:05AM -0400, Jeff Layton wrote:
> > Since 2.6.34, I've been able to consistently reproduce OOM kills when running fsstress (from the LTP suite) on CIFS. I spent some time yesterday and bisected it down to this patch:
> >
> > ---------------------[snip]---------------------
> > commit 315e995c63a15cb4d4efdbfd70fe2db191917f7a
> > Author: Nick Piggin <[hidden email]>
> > Date:   Wed Apr 21 03:18:28 2010 +0000
> >
> >     [CIFS] use add_to_page_cache_lru
> >    
> >     add_to_page_cache_lru is exported, so it should be used. Benefits over
> >     using a private pagevec: neater code, 128 bytes fewer stack used, percpu
> >     lru ordering is preserved, and finally don't need to flush pagevec
> >     before returning so batching may be shared with other LRU insertions.
> >    
> >     Signed-off-by: Nick Piggin <[hidden email]>
> >     Reviewed-by: Dave Kleikamp <[hidden email]>
> >     Signed-off-by: Steve French <[hidden email]>
> > ---------------------[snip]---------------------
> >
> > Here's how I've been reproducing it:
> >
> > Mount up a samba share with -o sec=krb5i,nounix,noserverino
> >
> > Run: fsstress -d /path/to/dir/on/cifs/ -n 1000 -l0 -p8
> >
> > ...within an hour or two, I start getting OOM kills. After backing out
> > the patch above, I was able to run the test overnight. I'm not sure yet
> > what the actual problem is, but there seems to be something wrong with
> > that patch.
> >
> > Thoughts?
>
> Yep, it's my fault. The problem is the refcounting. Previously the
> code hands off the references to the LRU, wheras now the lru takes
> a new reference. (the other filesystems converted to use this
> function seemed to more conventionally open-code lru_cache_add).
>
> Can we get rid of a refcount increment anywhere? Otherwise we'll
> need to just drop the references after adding the pages.
>

The only caller of this function is cifs_readpages, and I don't see
where it takes any references. I'm guessing that the pages come from
the VFS with the refcount already incremented?

--
Jeff Layton <[hidden email]>
_______________________________________________
linux-cifs-client mailing list
[hidden email]
https://lists.samba.org/mailman/listinfo/linux-cifs-client
Reply | Threaded
Open this post in threaded view
|

Re: OOM kills when running fsstress on CIFS

Nick Piggin-4
On Tue, May 25, 2010 at 07:49:29AM -0400, Jeff Layton wrote:

> On Tue, 25 May 2010 21:16:39 +1000
> Nick Piggin <[hidden email]> wrote:
>
> > On Tue, May 25, 2010 at 06:57:05AM -0400, Jeff Layton wrote:
> > > Since 2.6.34, I've been able to consistently reproduce OOM kills when running fsstress (from the LTP suite) on CIFS. I spent some time yesterday and bisected it down to this patch:
> > >
> > > ---------------------[snip]---------------------
> > > commit 315e995c63a15cb4d4efdbfd70fe2db191917f7a
> > > Author: Nick Piggin <[hidden email]>
> > > Date:   Wed Apr 21 03:18:28 2010 +0000
> > >
> > >     [CIFS] use add_to_page_cache_lru
> > >    
> > >     add_to_page_cache_lru is exported, so it should be used. Benefits over
> > >     using a private pagevec: neater code, 128 bytes fewer stack used, percpu
> > >     lru ordering is preserved, and finally don't need to flush pagevec
> > >     before returning so batching may be shared with other LRU insertions.
> > >    
> > >     Signed-off-by: Nick Piggin <[hidden email]>
> > >     Reviewed-by: Dave Kleikamp <[hidden email]>
> > >     Signed-off-by: Steve French <[hidden email]>
> > > ---------------------[snip]---------------------
> > >
> > > Here's how I've been reproducing it:
> > >
> > > Mount up a samba share with -o sec=krb5i,nounix,noserverino
> > >
> > > Run: fsstress -d /path/to/dir/on/cifs/ -n 1000 -l0 -p8
> > >
> > > ...within an hour or two, I start getting OOM kills. After backing out
> > > the patch above, I was able to run the test overnight. I'm not sure yet
> > > what the actual problem is, but there seems to be something wrong with
> > > that patch.
> > >
> > > Thoughts?
> >
> > Yep, it's my fault. The problem is the refcounting. Previously the
> > code hands off the references to the LRU, wheras now the lru takes
> > a new reference. (the other filesystems converted to use this
> > function seemed to more conventionally open-code lru_cache_add).
> >
> > Can we get rid of a refcount increment anywhere? Otherwise we'll
> > need to just drop the references after adding the pages.
> >
>
> The only caller of this function is cifs_readpages, and I don't see
> where it takes any references. I'm guessing that the pages come from
> the VFS with the refcount already incremented?

Yep. I think we should just page_cache_release after doing the
add_to_page_cache_lru, like the generic code does.

It's a little suboptimal tinkering with refcounts like that, but
a cleanup pass to fix it up all over the tree and allow
add_to_page_cache_lru to take over a reference would be better
place to fix that.

_______________________________________________
linux-cifs-client mailing list
[hidden email]
https://lists.samba.org/mailman/listinfo/linux-cifs-client
Reply | Threaded
Open this post in threaded view
|

Re: OOM kills when running fsstress on CIFS

Jeff Layton-2
On Tue, 25 May 2010 21:54:31 +1000
Nick Piggin <[hidden email]> wrote:

> On Tue, May 25, 2010 at 07:49:29AM -0400, Jeff Layton wrote:
> > On Tue, 25 May 2010 21:16:39 +1000
> > Nick Piggin <[hidden email]> wrote:
> >
> > > On Tue, May 25, 2010 at 06:57:05AM -0400, Jeff Layton wrote:
> > > > Since 2.6.34, I've been able to consistently reproduce OOM kills when running fsstress (from the LTP suite) on CIFS. I spent some time yesterday and bisected it down to this patch:
> > > >
> > > > ---------------------[snip]---------------------
> > > > commit 315e995c63a15cb4d4efdbfd70fe2db191917f7a
> > > > Author: Nick Piggin <[hidden email]>
> > > > Date:   Wed Apr 21 03:18:28 2010 +0000
> > > >
> > > >     [CIFS] use add_to_page_cache_lru
> > > >    
> > > >     add_to_page_cache_lru is exported, so it should be used. Benefits over
> > > >     using a private pagevec: neater code, 128 bytes fewer stack used, percpu
> > > >     lru ordering is preserved, and finally don't need to flush pagevec
> > > >     before returning so batching may be shared with other LRU insertions.
> > > >    
> > > >     Signed-off-by: Nick Piggin <[hidden email]>
> > > >     Reviewed-by: Dave Kleikamp <[hidden email]>
> > > >     Signed-off-by: Steve French <[hidden email]>
> > > > ---------------------[snip]---------------------
> > > >
> > > > Here's how I've been reproducing it:
> > > >
> > > > Mount up a samba share with -o sec=krb5i,nounix,noserverino
> > > >
> > > > Run: fsstress -d /path/to/dir/on/cifs/ -n 1000 -l0 -p8
> > > >
> > > > ...within an hour or two, I start getting OOM kills. After backing out
> > > > the patch above, I was able to run the test overnight. I'm not sure yet
> > > > what the actual problem is, but there seems to be something wrong with
> > > > that patch.
> > > >
> > > > Thoughts?
> > >
> > > Yep, it's my fault. The problem is the refcounting. Previously the
> > > code hands off the references to the LRU, wheras now the lru takes
> > > a new reference. (the other filesystems converted to use this
> > > function seemed to more conventionally open-code lru_cache_add).
> > >
> > > Can we get rid of a refcount increment anywhere? Otherwise we'll
> > > need to just drop the references after adding the pages.
> > >
> >
> > The only caller of this function is cifs_readpages, and I don't see
> > where it takes any references. I'm guessing that the pages come from
> > the VFS with the refcount already incremented?
>
> Yep. I think we should just page_cache_release after doing the
> add_to_page_cache_lru, like the generic code does.
>
> It's a little suboptimal tinkering with refcounts like that, but
> a cleanup pass to fix it up all over the tree and allow
> add_to_page_cache_lru to take over a reference would be better
> place to fix that.
>

Ok, I tested out the patch that I just sent and it seems to have fixed
the problem. Look ok?

--
Jeff Layton <[hidden email]>
_______________________________________________
linux-cifs-client mailing list
[hidden email]
https://lists.samba.org/mailman/listinfo/linux-cifs-client