Huge directory tree: Get files to sync via tools like sysdig

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Huge directory tree: Get files to sync via tools like sysdig

Thomas Güttler
Hi,

we have a huge directory tree.


  * 17M files (number of files)
  * 2.2TBytes of data.
  * Only 0.1% changes per day

Current pain: rsyncs directory tree traversal needs to long to discover the changed files. Only few files change.

I discovered the tool sysdig which could be used to monitor the files which were changed.

Then we could feed the list of changed files to rsync and avoid the long directory traversal of rsync.

Has someone experience with collecting the changed files with a third party tool which detects which
files were changed?

Regards,
  Thomas Güttler



--
Thomas Guettler http://www.thomas-guettler.de/

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|

Re: Huge directory tree: Get files to sync via tools like sysdig

Axel Kittenberger
> Has someone experience with collecting the changed files
> with a third party tool which detects which files were changed?

I don't know of sysdig but am the developer of Lsyncd which does exactly that, collect file changes via inotify event mechanism and then calls rsync with a matching filter mask.

However, since you say, your directory tree is hugh, the main issue is that for every directory an inotify watch must be created, taking about 1KB of kernel memory per watch. If you got a million directories this is a GB of unswapable memory use.

Unfortunally the Linux kernel doesn't provide a better way yet, and I suppose other tools like sysdig suffer from the same issue. There is fanotify, but that doesn't report move event and thus is not useable for this task.

Kind regards, Axel

On Thu, Feb 9, 2017 at 10:05 AM, Thomas Güttler <[hidden email]> wrote:
Hi,

we have a huge directory tree.


 * 17M files (number of files)
 * 2.2TBytes of data.
 * Only 0.1% changes per day

Current pain: rsyncs directory tree traversal needs to long to discover the changed files. Only few files change.

I discovered the tool sysdig which could be used to monitor the files which were changed.

Then we could feed the list of changed files to rsync and avoid the long directory traversal of rsync.

Has someone experience with collecting the changed files with a third party tool which detects which
files were changed?

Regards,
 Thomas Güttler



--
Thomas Guettler http://www.thomas-guettler.de/

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|

Re: Huge directory tree: Get files to sync via tools like sysdig

Ben RUBSON
In reply to this post by Thomas Güttler
> On 09 Feb 2017, at 10:05, Thomas Güttler <[hidden email]> wrote:
>
> Hi,
>
> we have a huge directory tree.
>
>
> * 17M files (number of files)
> * 2.2TBytes of data.
> * Only 0.1% changes per day
>
> Current pain: rsyncs directory tree traversal needs to long to discover the changed files.

Hi,

On which type of FS is this directory ?

Ben
--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|

Re: Huge directory tree: Get files to sync via tools like sysdig

Karl O. Pinc
In reply to this post by Axel Kittenberger
On Thu, 9 Feb 2017 10:55:51 +0100
Axel Kittenberger <[hidden email]> wrote:

> > Has someone experience with collecting the changed files
> > with a third party tool which detects which files were changed?  
>
> I don't know of sysdig but am the developer of Lsyncd which does
> exactly that, collect file changes via inotify event mechanism and
> then calls rsync with a matching filter mask.
>
> However, since you say, your directory tree is hugh, the main issue
> is that for every directory an inotify watch must be created, taking
> about 1KB of kernel memory per watch.

Not only that, but inotify is not guaranteed.  (At least not on
3.16.0.  Can't say regards later versions.)  So you might miss some
changes.


Karl <[hidden email]>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|

Re: Huge directory tree: Get files to sync via tools like sysdig

Axel Kittenberger
Not only that, but inotify is not guaranteed.  (At least not on
3.16.0.  Can't say regards later versions.)  So you might miss some
changes.

Got any info on that?

I noted that MOVE_FROM and MOVE_TO events are not guaranted to arrive in order, or even the file descriptor might briefly close with "no more events" inbetween them, but I never ever heared of anybody encountering an issue of an event in a watched directory on not being correctly reported, without getting the information of an overlfow with an OVERFLOW event, which results in case of Lsyncd in a full rescan of everything.

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|

Re: Huge directory tree: Get files to sync via tools like sysdig

Karl O. Pinc
On Thu, 9 Feb 2017 14:43:57 +0100
Axel Kittenberger <[hidden email]> wrote:

> >
> > Not only that, but inotify is not guaranteed.  (At least not on
> > 3.16.0.  Can't say regards later versions.)  So you might miss some
> > changes.
> >  
>
> Got any info on that?
>
> I noted that MOVE_FROM and MOVE_TO events are not guaranted to arrive
> in order, or even the file descriptor might briefly close with "no
> more events" inbetween them, but I never ever heared of anybody
> encountering an issue of an event in a watched directory on not being
> correctly reported, without getting the information of an overlfow
> with an OVERFLOW event, which results in case of Lsyncd in a full
> rescan of everything.

Not much.  inotify(7) on my system says:

       With careful programming, an application can use inotify to
       efficiently monitor and cache the state of a set of filesystem
       objects.   However, robust applications should allow for the
       fact that bugs in the monitor‐ ing logic or races of the kind
       described  below  may  leave  the  cache inconsistent  with  the
       filesystem state.  It is probably wise to to do some consistency
       checking, and rebuild the cache  when  inconsistencies are
       detected.

I think one of the pretty much unavoidable race conditions is
sub-directory creation; the sub-directory can have files added
to it before the monitoring process is able to set a watch
on it.  Of course this is an application level race.

I've had incron (which uses inotify) regularly fail to
catch all monitored fs changes on a busy system.  And
the monitored system does not involve creating sub-directories --
and I don't think I'm exceeding the system's inotify event limit
either.  But I could be wrong about either of these.

So perhaps the take-away is that inotify is "hard", or even
"impossible" to rely on as the sole method for change monitoring.
It may not be right to say it's "unreliable" as I did above.
I'm not the expert here.  But I can say that my limited
experience with it makes me want to look very closely
before relying on it.

Regards,

Karl <[hidden email]>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|

Re: Huge directory tree: Get files to sync via tools like sysdig

Axel Kittenberger
Directory creation is not a race condition when done properly.

The application (like Lsyncd) gets a directory creation event, creates a watch for the directory and scans the new directory for files or subdirectories in there, subdirectories are handled recursevly.

This way nothing can be missed.

The general warning of "bugs may be possible" is a no-brainer. Yes, they are always possible, everywhere.

As said, there are some issues with the "move" (aka rename) event to be detected as such, sometimes it may be detected as a create / delete without proper acknowleding the move within the watched tree. And events may not arrive in the same order as they happened, due to multi-core nature of modern systems. But otherwise than that, I'm convinced it is fine. And all of this is not a real issue with event based filter list creation to minify rsyncs work.

The only other issue I know of is hard links. Create a hard link outside the watched directory to a file within the watched directory tree and altering will not create an event. In that case you just must not do them. This has hardly been an issue in most usecases tough.



--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|

Re: Huge directory tree: Get files to sync via tools like sysdig

Thomas Güttler
In reply to this post by Ben RUBSON


Am 09.02.2017 um 11:05 schrieb Ben RUBSON:

>> On 09 Feb 2017, at 10:05, Thomas Güttler <[hidden email]> wrote:
>>
>> Hi,
>>
>> we have a huge directory tree.
>>
>>
>> * 17M files (number of files)
>> * 2.2TBytes of data.
>> * Only 0.1% changes per day
>>
>> Current pain: rsyncs directory tree traversal needs to long to discover the changed files.
>
> Hi,
>
> On which type of FS is this directory ?

ext4


--
Thomas Guettler http://www.thomas-guettler.de/

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|

Re: Huge directory tree: Get files to sync via tools like sysdig

Ben RUBSON

> On 09 Feb 2017, at 16:10, Thomas Güttler <[hidden email]> wrote:
>
> Am 09.02.2017 um 11:05 schrieb Ben RUBSON:
>>> On 09 Feb 2017, at 10:05, Thomas Güttler <[hidden email]> wrote:
>>>
>>> Hi,
>>>
>>> we have a huge directory tree.
>>>
>>>
>>> * 17M files (number of files)
>>> * 2.2TBytes of data.
>>> * Only 0.1% changes per day
>>>
>>> Current pain: rsyncs directory tree traversal needs to long to discover the changed files.
>>
>> Hi,
>>
>> On which type of FS is this directory ?
>
> ext4

Any way to prefer snapshots in your backup strategy ?
Or to use a ZFS ready OS to benefit from a SSD cache (which would store your metadata) ?
--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|

Re: Huge directory tree: Get files to sync via tools like sysdig

Henri Shustak
As Ben mentioned, ZFS snapshots is one possible approach. Another approach is to have a faster storage system. I have seen considerable speed improvements with rsync on similar data sets by say upgrading the storage sub system.

--------------------------------------------------------------------
This email is protected by LBackup, an open source backup solution
http://www.lbackup.org



--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|

Re: Huge directory tree: Get files to sync via tools like sysdig

Steven Levine
In <[hidden email]>, on 02/10/17
   at 12:38 PM, Henri Shustak <[hidden email]> said:

>As Ben mentioned, ZFS snapshots is one possible approach. Another
>approach is to have a faster storage system. I have seen considerable
>speed improvements with rsync on similar data sets by say upgrading the
>storage sub system.

This is true.  In addition different file systems has different
performance wrt stat().

A lot depends on what kind of backup that is required.  If a full backup
that is accurate to a point in time is required, then something like ZFS
makes sense.  If the system is servers that do in memory cachinng, there
will probably be a need to ensure that their on-disk state is consistent
before the snapshot is taken.

If the only requirment is to ensure that everything known to have changed
on disk is backed up, the event based solution will avoid the high cost of
stat-ing every file.

It might be interesting to evaluate a mixed solution.  Use events to track
directory changes and let rsync sort out what to do for each directory.

Steven

--
----------------------------------------------------------------------
"Steven Levine" <[hidden email]>  Warp/DIY/BlueLion etc.
www.scoug.com www.arcanoae.com www.warpcave.com
----------------------------------------------------------------------


--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|

Re: Huge directory tree: Get files to sync via tools like sysdig

Karl O. Pinc
In reply to this post by Henri Shustak
On Fri, 10 Feb 2017 12:38:32 +1300
Henri Shustak <[hidden email]> wrote:

> As Ben mentioned, ZFS snapshots is one possible approach. Another
> approach is to have a faster storage system. I have seen considerable
> speed improvements with rsync on similar data sets by say upgrading
> the storage sub system.

Another possibility could be to use lvm and lvmcache to throw a ssd in
front of the spinning disks.  This would only improve things if
you didn't otherwise fill up the cache with data -- you want
the cache to contain inodes.  So this might work only if your
ssd cache was larger than whatever amount of data you typically
write between rsync runs, plus enough to hold all the inodes
in your rsync-ed fs.

I've not tried this.  I'm not even certain it's a good idea.  It's
just a thought.

Regards,

Karl <[hidden email]>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|

Re: Huge directory tree: Get files to sync via tools like sysdig

Henri Shustak
That sounds like it certinally would not hurt!

--------------------------------------------------------------------
This email is protected by LBackup, an open source backup solution
http://www.lbackup.org


--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|

Re: Huge directory tree: Get files to sync via tools like sysdig

Ben RUBSON
In reply to this post by Karl O. Pinc
> On 10 Feb 2017, at 01:21, Karl O. Pinc <[hidden email]> wrote:
>
> On Fri, 10 Feb 2017 12:38:32 +1300
> Henri Shustak <[hidden email]> wrote:
>
>> As Ben mentioned, ZFS snapshots is one possible approach. Another
>> approach is to have a faster storage system. I have seen considerable
>> speed improvements with rsync on similar data sets by say upgrading
>> the storage sub system.
>
> Another possibility could be to use lvm and lvmcache to throw a ssd in
> front of the spinning disks.  This would only improve things if
> you didn't otherwise fill up the cache with data -- you want
> the cache to contain inodes.  So this might work only if your
> ssd cache was larger than whatever amount of data you typically
> write between rsync runs, plus enough to hold all the inodes
> in your rsync-ed fs.
>
> I've not tried this.  I'm not even certain it's a good idea.  It's
> just a thought.

It's also possible to have a SSD cache with ZFS (called the L2ARC).
You can even ask this cache to only store your metadata.

Some (same ?) changes may also be needed on receiver/server side too
(depending on its current setting) to see a performance improvement.

Ben

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|

Alternatives to rsync. Was: Huge directory tree: Get files to sync via tools like sysdig

Thomas Güttler
In reply to this post by Ben RUBSON


Am 09.02.2017 um 16:21 schrieb Ben RUBSON:

>
>> On 09 Feb 2017, at 16:10, Thomas Güttler <[hidden email]> wrote:
>>
>> Am 09.02.2017 um 11:05 schrieb Ben RUBSON:
>>>> On 09 Feb 2017, at 10:05, Thomas Güttler <[hidden email]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> we have a huge directory tree.
>>>>
>>>>
>>>> * 17M files (number of files)
>>>> * 2.2TBytes of data.
>>>> * Only 0.1% changes per day
>>>>
>>>> Current pain: rsyncs directory tree traversal needs to long to discover the changed files.
>>>
>>> Hi,
>>>
>>> On which type of FS is this directory ?
>>
>> ext4
>
> Any way to prefer snapshots in your backup strategy ?
> Or to use a ZFS ready OS to benefit from a SSD cache (which would store your metadata) ?

Yes, I think rsync is coming to the edge of its capabilities here. I guess a different strategy is needed.

I see these alternatives to rsync:

  - Incremental Snapshots at block-level device is one of them.
  - We get the application ported to access a storage server, and not file server any more.
  - ....

Do you see other alternatives?

Regards,
   Thomas Güttler



--
Thomas Guettler http://www.thomas-guettler.de/

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|

Re: Alternatives to rsync. Was: Huge directory tree: Get files to sync via tools like sysdig

Karl O. Pinc
On Fri, 10 Feb 2017 11:38:34 +0100
Thomas Güttler <[hidden email]> wrote:

> Yes, I think rsync is coming to the edge of its capabilities here. I
> guess a different strategy is needed.
>
> I see these alternatives to rsync:
>
>   - Incremental Snapshots at block-level device is one of them.
>   - We get the application ported to access a storage server, and not
> file server any more.
>   - ....
>
> Do you see other alternatives?

I thought the Lsyncd + rsync alternative proposed by
Axel Kittenberger workable.  Depending on your level of
paranoia you might want to be sure that you could also
run a plain rsync now and again to validate that what
you put in place is doing what you think it is.  I
say this because I've no experience at all with
Lsyncd.

Regards,

Karl <[hidden email]>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html