[PATCH] RFC: vfs cachedir

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH] RFC: vfs cachedir

Samba - samba-technical mailing list
Hi,

A customer is re-exporting a NFS mount through a samba share and is
having some strange issues e.g. when mounting the share on a Window 7
system and running "del /q *" in it, only half the file are deleted.

I know re-exporting NFS mounts is not recommended but I've tried to
debug this anyway.

As it turns out the reason why it fails is because smbd assumes that
directory offsets given by telldir() can be reused and set again by
seekdir().

More specifically when doing FIND_FIRST/FIND_NEXT requests against a
directory with a lot of files, several FIND_NEXT requests have to be
sent to get the full listing, and each one provides an offset into the
results. For SMB2+ code, samba "fakes" smb1 requests out of the smb2
ones and the same code is used.

On a NFS mount (and I suspect on other non-trivial fs), the DIR handle
can be updated after being opened. That means that setting the offset to
the supposedly same value might end up skipping files.

That means that:

- if you take the current offset with telldir
- delete files in the dir
- and seek to that same offset with seekdir

the same offset might point to a different file.

          DIR*
     off.    
        0 file a
       10 file b
   ->  20 file c  
       30 file d

telldir(dir) == 20
...something somewhere delete file a...
seekdir(dir, 20)

          DIR*
     off.  
        0 file b
       10 file c      
   ->  20 file d

next readdir() will return file d instead of file c.

Here is what POSIX[1] says about readdir():

> If a file is removed from or added to the directory after the most
> recent call to opendir() or rewinddir(), whether a subsequent call to
> readdir() returns an entry for that file is unspecified.

So I've written a small VFS (attached) that just caches a DIR handle
when it's opened. It reads all the directory entries in a structure that
is used in place of the DIR handle (which is closed as soon as we read
all the entries). It makes a "snapshot" of the directory, so to
speak. The rewind operation just resets the internal cursor and does not
read the directory again.

This VFS seems to solve the issue in this specific scenario. But I'm
sure it could fail in others (which ones?). Would you let customers use
this VFS? Should it be merged?

1: http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir.html

--
Aurélien Aptel / SUSE Labs Samba Team
GPG: 1839 CB5F 9F5B FB9B AA97  8C99 03C8 A49B 521B D5D3
SUSE Linux GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] RFC: vfs cachedir

Samba - samba-technical mailing list
On Wed, Apr 19, 2017 at 03:49:24PM +0200, Aurélien Aptel via samba-technical wrote:

> A customer is re-exporting a NFS mount through a samba share and is
> having some strange issues e.g. when mounting the share on a Window 7
> system and running "del /q *" in it, only half the file are deleted.
>
> I know re-exporting NFS mounts is not recommended but I've tried to
> debug this anyway.
>
> As it turns out the reason why it fails is because smbd assumes that
> directory offsets given by telldir() can be reused and set again by
> seekdir().
>
> More specifically when doing FIND_FIRST/FIND_NEXT requests against a
> directory with a lot of files, several FIND_NEXT requests have to be
> sent to get the full listing, and each one provides an offset into the
> results. For SMB2+ code, samba "fakes" smb1 requests out of the smb2
> ones and the same code is used.

Your module looks interesting! :-)

Brainstorming....

Dumb question (I haven't looked at that code for ages): Why do we play
telldir/seekdir tricks at all? Why don't we just opendir/readdir/closedir
in the highlevel code?

If we inevitably have to do the seekdir/telldir (Jeremy?? :-), shouldn't
we put your code into core Samba and do the caching always?

And -- would it help if we didn't do the telldir/seekdir thingy? Also,
isn't readdir() inherently racy? Caching the whole thing will solve
the race against ourselves (which is very likely the 99.9% solution),
but will it also help against others modifying the directory concurrently?

Volker

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] RFC: vfs cachedir

Samba - samba-technical mailing list
In reply to this post by Samba - samba-technical mailing list
On Wed, Apr 19, 2017 at 03:49:24PM +0200, Aurélien Aptel wrote:
> debug this anyway.
>
> As it turns out the reason why it fails is because smbd assumes that
> directory offsets given by telldir() can be reused and set again by
> seekdir().

Yes, that's POSIX.

> More specifically when doing FIND_FIRST/FIND_NEXT requests against a
> directory with a lot of files, several FIND_NEXT requests have to be
> sent to get the full listing, and each one provides an offset into the
> results. For SMB2+ code, samba "fakes" smb1 requests out of the smb2
> ones and the same code is used.
>
> On a NFS mount

> (and I suspect on other non-trivial fs),

No, you would be wrong there. Non-trivial file systems work
hard to get POSIX right. Just not NFS :-).

> the DIR handle
> can be updated after being opened. That means that setting the offset to
> the supposedly same value might end up skipping files.

That's broken w.r.t. POSIX. It's one of the reasons we don't
recommend NFS re-exports. If a a directory is being written
into whilst we have an open handle then yes, you might
miss files. If no one is writing into it then telldir/seekdir
should be reliable.

That's not to say that your VFS module isn't a good idea,
just that the reason for it is non-compliance with POSIX :-).

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] RFC: vfs cachedir

Samba - samba-technical mailing list
In reply to this post by Samba - samba-technical mailing list
On Wed, Apr 19, 2017 at 03:49:24PM +0200, Aurélien Aptel wrote:

>
> So I've written a small VFS (attached) that just caches a DIR handle
> when it's opened. It reads all the directory entries in a structure that
> is used in place of the DIR handle (which is closed as soon as we read
> all the entries). It makes a "snapshot" of the directory, so to
> speak. The rewind operation just resets the internal cursor and does not
> read the directory again.
>
> This VFS seems to solve the issue in this specific scenario. But I'm
> sure it could fail in others (which ones?). Would you let customers use
> this VFS? Should it be merged?
>
> 1: http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir.html

So this module needs to avoid direct calls to
fdopendir/opendir/readdir/closedir and call into
SMB_VFS_NEXT_XXXXX() functions instead.

Without that it's not stackable.

If you're coming to SambaXP I'm planning to cover
this in my talk on writing VFS modules :-).

Cheers,

        Jeremy.


> SUSE Linux GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
> GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)

> From 484d854add9499161ec6539dc570d24e40699459 Mon Sep 17 00:00:00 2001
> From: Aurelien Aptel <[hidden email]>
> Date: Mon, 10 Apr 2017 18:44:30 +0200
> Subject: [PATCH] vfs: add vfs_cachedir
>
> Signed-off-by: Aurelien Aptel <[hidden email]>
> ---
>  source3/modules/vfs_cachedir.c | 273 +++++++++++++++++++++++++++++++++++++++++
>  source3/modules/wscript_build  |   7 ++
>  source3/wscript                |   2 +-
>  3 files changed, 281 insertions(+), 1 deletion(-)
>  create mode 100644 source3/modules/vfs_cachedir.c
>
> diff --git a/source3/modules/vfs_cachedir.c b/source3/modules/vfs_cachedir.c
> new file mode 100644
> index 00000000000..c0113ce5627
> --- /dev/null
> +++ b/source3/modules/vfs_cachedir.c
> @@ -0,0 +1,273 @@
> +/*
> +   Unix SMB/CIFS implementation.
> +   Cache dir listing operations to have stable enumeration
> +   Copyright (C) 2017 Aurélien Aptel <[hidden email]>
> +
> +   This program is free software; you can redistribute it and/or modify
> +   it under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3 of the License, or
> +   (at your option) any later version.
> +
> +   This program is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +   GNU General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with this program.  If not, see <http://www.gnu.org/licenses/>.
> +*/
> +
> +#include "includes.h"
> +#include "smbd/smbd.h"
> +#include <dirent.h>
> +#include <sys/statvfs.h>
> +
> +#undef DBGC_CLASS
> +#define DBGC_CLASS DBGC_VFS
> +
> +#define INITIAL_CACHE_SIZE 32
> +
> +struct cachedir {
> + size_t count;
> + size_t offset;
> + /* talloc allocated array, actual size embedded in talloc */
> + struct dirent *entries;
> +};
> +
> +static struct cachedir* cachedir_alloc(TALLOC_CTX* ctx)
> +{
> + struct cachedir *dc;
> + dc = talloc_zero(ctx, struct cachedir);
> + if (!dc)
> + return NULL;
> + dc->entries = talloc_array(dc, struct dirent, INITIAL_CACHE_SIZE);
> + if (!dc->entries) {
> + TALLOC_FREE(dc);
> + return NULL;
> + }
> +
> + return dc;
> +}
> +
> +static int cachedir_add(struct cachedir *dc, const struct dirent *dentry)
> +{
> + if (dc->count >= talloc_array_length(dc->entries)) {
> + struct dirent *p = talloc_realloc(dc, dc->entries, struct dirent, dc->count*2);
> + if (!p) {
> + return -1;
> + }
> + dc->entries = p;
> + }
> +
> + memcpy(dc->entries + dc->count, dentry, sizeof(*dentry));
> + dc->count++;
> + return 0;
> +}
> +
> +static int cachedir_read_whole_dir(struct cachedir* dc, DIR* dir)
> +{
> + int r;
> + struct dirent *tmp;
> + int old_errno;
> +
> + old_errno = errno;
> + if (errno) {
> + DBG_WARNING("[cachedir] resetting non-zero errno (%d) before readdir calls\n", errno);
> + errno = 0;
> + }
> +
> + while (1) {
> + DBG_DEBUG("[cachedir] readdir called\n");
> + tmp = readdir(dir);
> + if (tmp == NULL) {
> + if (errno) {
> + DBG_ERR("[cachedir] readdir failed (errno %d)\n", errno);
> + return -1;
> + }
> + break;
> + }
> +
> + r = cachedir_add(dc, tmp);
> + if (r < 0) {
> + DBG_ERR("[cachedir] cachedir_add failed\n");
> + return r;
> + }
> + }
> +
> + errno = old_errno;
> +
> + return 0;
> +}
> +
> +static size_t cachedir_total_size(struct cachedir *dc)
> +{
> + return sizeof(*dc->entries) * dc->count;
> +}
> +
> +/* Directory operations */
> +
> +static DIR *cachedirwrap_opendir(vfs_handle_struct *handle,
> +    const struct smb_filename *smb_fname,
> +    const char *mask,
> +    uint32_t attr)
> +{
> + struct cachedir *dc = NULL;
> + DIR *dh = NULL;
> + DIR *result = NULL;
> + int r;
> +
> + DBG_DEBUG("[cachedir] opendir called\n");
> + dh = opendir(smb_fname->base_name);
> + if (!dh) {
> + DBG_ERR("[cachedir] opendir failed (errno %d)\n", errno);
> + return NULL;
> + }
> +
> + dc = cachedir_alloc(NULL);
> + if (!dc) {
> + DBG_ERR("[cachedir] cachedir_alloc failed\n");
> + goto out;
> + }
> +
> + r = cachedir_read_whole_dir(dc, dh);
> + if (r) {
> + DBG_ERR("[cachedir] cachedir_read_whole_dir failed\n");
> + goto out;
> + }
> +
> + result = (DIR*)dc;
> +
> +out:
> + if (!result) {
> + TALLOC_FREE(dc);
> + }
> +
> + DBG_DEBUG("[cachedir] closedir called\n");
> + r = closedir(dh);
> + if (r) {
> + DBG_ERR("[cachedir] closedir failed (errno %d)\n", errno);
> + }
> + return result;
> +}
> +
> +static DIR *cachedirwrap_fdopendir(vfs_handle_struct *handle,
> + files_struct *fsp,
> + const char *mask,
> + uint32_t attr)
> +{
> + DIR *result = NULL;
> + DIR *dh = NULL;
> + struct cachedir *dc = NULL;
> + int r;
> +
> + DBG_DEBUG("[cachedir] sys_fdopendir called\n");
> + dh = sys_fdopendir(fsp->fh->fd);
> + if (!dh) {
> + DBG_ERR("[cachedir] sys_fdopendir failed (errno %d)\n", errno);
> + return NULL;
> + }
> +
> + dc = cachedir_alloc(NULL);
> + if (!dc) {
> + DBG_ERR("[cachedir] cachedir_alloc failed\n");
> + goto out;
> + }
> +
> + r = cachedir_read_whole_dir(dc, dh);
> + if (r) {
> + DBG_ERR("[cachedir] cachedir_read_whole_dir failed\n");
> + goto out;
> + }
> +
> + result = (DIR*)dc;
> +
> +out:
> + if (!result) {
> + TALLOC_FREE(dc);
> + }
> +
> + DBG_DEBUG("[cachedir] closedir called\n");
> + r = closedir(dh);
> + if (r) {
> + DBG_ERR("[cachedir] closedir failed (errno %d)\n", errno);
> + }
> + return result;
> +}
> +
> +
> +static struct dirent *cachedirwrap_readdir(vfs_handle_struct *handle,
> +          DIR *dirp,
> +  SMB_STRUCT_STAT *sbuf)
> +{
> + struct dirent *result = NULL;
> + struct cachedir *dc = (struct cachedir *)dirp;
> +
> + if (dc->offset >= cachedir_total_size(dc)) {
> + goto out;
> + }
> +
> +
> + result = (struct dirent *)(((uint8_t*)dc->entries) + dc->offset);
> + dc->offset += sizeof(*dc->entries);
> +
> +out:
> + if (sbuf) {
> + /* Default Posix readdir() does not give us stat info.
> + * Set to invalid to indicate we didn't return this info. */
> + SET_STAT_INVALID(*sbuf);
> + }
> +
> + return result;
> +}
> +
> +static NTSTATUS cachedirwrap_readdir_attr(struct vfs_handle_struct *handle,
> +     const struct smb_filename *fname,
> +     TALLOC_CTX *mem_ctx,
> +     struct readdir_attr_data **attr_data)
> +{
> + return NT_STATUS_NOT_SUPPORTED;
> +}
> +
> +static void cachedirwrap_seekdir(vfs_handle_struct *handle, DIR *dirp, long offset)
> +{
> + struct cachedir *dc = (struct cachedir *)dirp;
> + dc->offset = offset;
> +}
> +
> +static long cachedirwrap_telldir(vfs_handle_struct *handle, DIR *dirp)
> +{
> + struct cachedir *dc = (struct cachedir *)dirp;
> + return (long)dc->offset;
> +}
> +
> +static void cachedirwrap_rewinddir(vfs_handle_struct *handle, DIR *dirp)
> +{
> + struct cachedir *dc = (struct cachedir *)dirp;
> + dc->offset = 0;
> +}
> +
> +
> +static int cachedirwrap_closedir(vfs_handle_struct *handle, DIR *dirp)
> +{
> + struct cachedir *dc = (struct cachedir *)dirp;
> + TALLOC_FREE(dc);
> + return 0;
> +}
> +
> +static struct vfs_fn_pointers vfs_cachedir_fns = {
> + .opendir_fn = cachedirwrap_opendir,
> + .fdopendir_fn = cachedirwrap_fdopendir,
> + .readdir_fn = cachedirwrap_readdir,
> + .readdir_attr_fn = cachedirwrap_readdir_attr,
> + .seekdir_fn = cachedirwrap_seekdir,
> + .telldir_fn = cachedirwrap_telldir,
> + .rewind_dir_fn = cachedirwrap_rewinddir,
> + .closedir_fn = cachedirwrap_closedir,
> +};
> +
> +NTSTATUS vfs_cachedir_init(void);
> +NTSTATUS vfs_cachedir_init(void)
> +{
> + return smb_register_vfs(SMB_VFS_INTERFACE_VERSION,
> + "cachedir", &vfs_cachedir_fns);
> +}
> diff --git a/source3/modules/wscript_build b/source3/modules/wscript_build
> index a5d84075872..2b3db888c41 100644
> --- a/source3/modules/wscript_build
> +++ b/source3/modules/wscript_build
> @@ -502,3 +502,10 @@ bld.SAMBA3_MODULE('vfs_fake_dfq',
>                   init_function='',
>                   internal_module=bld.SAMBA3_IS_STATIC_MODULE('vfs_fake_dfq'),
>                   enabled=bld.SAMBA3_IS_ENABLED_MODULE('vfs_fake_dfq'))
> +
> +bld.SAMBA3_MODULE('vfs_cachedir',
> +                 subsystem='vfs',
> +                 source='vfs_cachedir.c',
> +                 init_function='',
> +                 internal_module=bld.SAMBA3_IS_STATIC_MODULE('vfs_cachedir'),
> +                 enabled=bld.SAMBA3_IS_ENABLED_MODULE('vfs_cachedir'))
> diff --git a/source3/wscript b/source3/wscript
> index 78753687431..24c2aa677c8 100644
> --- a/source3/wscript
> +++ b/source3/wscript
> @@ -1683,7 +1683,7 @@ main() {
>                                        vfs_preopen vfs_catia
>                                        vfs_media_harmony vfs_unityed_media vfs_fruit vfs_shell_snap
>                                        vfs_commit vfs_worm vfs_crossrename vfs_linux_xfs_sgid
> -                                      vfs_time_audit vfs_offline
> +                                      vfs_time_audit vfs_offline vfs_cachedir
>                                    '''))
>      default_shared_modules.extend(TO_LIST('auth_script idmap_tdb2 idmap_script'))
>      # these have broken dependencies
> --
> 2.12.0
>


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] RFC: vfs cachedir

Samba - samba-technical mailing list
In reply to this post by Samba - samba-technical mailing list
On Wed, Apr 19, 2017 at 03:49:24PM +0200, Aurélien Aptel via samba-technical wrote:

>
> That means that:
>
> - if you take the current offset with telldir
> - delete files in the dir
> - and seek to that same offset with seekdir
>
> the same offset might point to a different file.
>
>  DIR*
>      off.    
> 0 file a
>        10 file b
>    ->  20 file c  
>        30 file d
>
> telldir(dir) == 20
> ...something somewhere delete file a...
> seekdir(dir, 20)
>
>  DIR*
>      off.  
> 0 file b
>        10 file c      
>    ->  20 file d
>
> next readdir() will return file d instead of file c.
>
> Here is what POSIX[1] says about readdir():
>
> > If a file is removed from or added to the directory after the most
> > recent call to opendir() or rewinddir(), whether a subsequent call to
> > readdir() returns an entry for that file is unspecified.

Oh sorry, missed this. I assumed that you were
having problems on NFS with no one else writing
or deleting in the directory.

Yes of course if someone is writing into the directory
then all bets are off w.r.t. getting coherence out of
readdir etc.

This is a race condition you simply can't win.

Even if you opendir/readdir..../closedir
there's no guarentee that someone didn't
delete one of the names you read with
readdir.

The only way around this is directory
leases, and break the lease to let the
client know someone else is writing into
the directory.

But even with that there's still no way to fix a race
condition like this - all you're doing is
exposing a frozen snapshot that can be
out of date from the moment you return it.

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] RFC: vfs cachedir

Samba - samba-technical mailing list
On 04/19/2017 01:14 PM, Jeremy Allison via samba-technical wrote:

> On Wed, Apr 19, 2017 at 03:49:24PM +0200, Aurélien Aptel via samba-technical wrote:
>>
>> That means that:
>>
>> - if you take the current offset with telldir
>> - delete files in the dir
>> - and seek to that same offset with seekdir
>>
>> the same offset might point to a different file.
>>
>>  DIR*
>>      off.    
>> 0 file a
>>        10 file b
>>    ->  20 file c  
>>        30 file d
>>
>> telldir(dir) == 20
>> ...something somewhere delete file a...
>> seekdir(dir, 20)
>>
>>  DIR*
>>      off.  
>> 0 file b
>>        10 file c      
>>    ->  20 file d
>>
>> next readdir() will return file d instead of file c.
>>
>> Here is what POSIX[1] says about readdir():
>>
>>> If a file is removed from or added to the directory after the most
>>> recent call to opendir() or rewinddir(), whether a subsequent call to
>>> readdir() returns an entry for that file is unspecified.
>
> Oh sorry, missed this. I assumed that you were
> having problems on NFS with no one else writing
> or deleting in the directory.
>
> Yes of course if someone is writing into the directory
> then all bets are off w.r.t. getting coherence out of
> readdir etc.
>
> This is a race condition you simply can't win.
In this case, windows is racing itself.  "del *" or "move *" triggers
it.  As each chunk is returned, all those files are deleted.  _then_ the
search is resumed, but we're lost.

>
> Even if you opendir/readdir..../closedir
> there's no guarentee that someone didn't
> delete one of the names you read with
> readdir.
>
> The only way around this is directory
> leases, and break the lease to let the
> client know someone else is writing into
> the directory.
>
> But even with that there's still no way to fix a race
> condition like this - all you're doing is
> exposing a frozen snapshot that can be
> out of date from the moment you return it.
>
Yep, but frankly we're simply lucky that "del *" works on a local fs.
Apparently ext<n>, btrfs, xfs, etc. are just not updating the directory
from the viewpoint of our open handle as frequently.  Windows cmd.exe
clearly expects the behavior of a frozen snapshot.

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] RFC: vfs cachedir

Samba - samba-technical mailing list
On Wed, Apr 19, 2017 at 03:48:35PM -0400, Jim McDonough wrote:
> >
> Yep, but frankly we're simply lucky that "del *" works on a local fs.
> Apparently ext<n>, btrfs, xfs, etc. are just not updating the directory
> from the viewpoint of our open handle as frequently.  Windows cmd.exe
> clearly expects the behavior of a frozen snapshot.

OK, fair enough. But in that case the complete
directory cache belongs in the upper level smbd
code hanging off the struct smb_Dir pointer I
think. We already have struct name_cache_entry *name_cache
attached there, maybe we should just populate on
open.

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] RFC: vfs cachedir

Samba - samba-technical mailing list
In reply to this post by Samba - samba-technical mailing list
Jim McDonough <[hidden email]> writes:
> Yep, but frankly we're simply lucky that "del *" works on a local fs.
> Apparently ext<n>, btrfs, xfs, etc. are just not updating the directory
> from the viewpoint of our open handle as frequently.  Windows cmd.exe
> clearly expects the behavior of a frozen snapshot.

More on this topic from the kernel ml for those interested:

https://lwn.net/Articles/291114/

--
Aurélien Aptel / SUSE Labs Samba Team
GPG: 1839 CB5F 9F5B FB9B AA97  8C99 03C8 A49B 521B D5D3
SUSE Linux GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)