Quantcast

Nice little performance improvement

classic Classic list List threaded Threaded
9 messages Options
mfc
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Nice little performance improvement

mfc
Hi,
 
In my situation I'm using rsync to backup a server with (currently) about 570,000 files.
These are all little files and maybe .1% of them change or new ones are added in
any 15 minute period.
 
I've split the main tree up so rsync can run on sub sub directories of the main tree.
It does each of these sub sub directories sequentially. I would have liked to run
some of these in parallel, but that seems to increase i/o on the main server too much.
 
 
Today I tried the following:
 
For all subsub directories
    a) Fork a "du -s subsubdirectory" on the destination subsubdirectory
    b) Run rsync on the subsubdirectory
    c) repeat untill done
 
Seems to have improved the time it takes by about 25-30%. It looks like the du can
run ahead of the rsync...so that while rsync is building its file list, the du is warming up
the file cache on the destination. Then when rsync looks to see what it needs to do
on the destination, it can do this more efficiently.
 
Looks like a keeper so far. Any other suggestions? (was thinking of a previous
suggestion of setting /proc/sys/vm/vfs_cache_pressure to a low value).
 
Thanks,
 
Mike

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Nice little performance improvement

Darryl Dixon
> Hi,
>
> In my situation I'm using rsync to backup a server with (currently) about
> 570,000 files.
> These are all little files and maybe .1% of them change or new ones are
> added in
> any 15 minute period.
>

Hi Mike,

We have three filesystems that between them have approx 22 million files,
and around 10-20,000 new or changed files every business day.

In order to expeditiously move these new files offsite, we use a modified
version of pyinotify to log all added/altered files across the entire
filesystem(s) and then every five minutes feed the list to rsync with the
--files-from option. This works very effectively and quickly.

regards,
Darryl Dixon
Winterhouse Consulting Ltd
http://www.winterhouseconsulting.com
--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Nice little performance improvement

Matt McCutchen-7
In reply to this post by mfc
On Thu, 2009-10-15 at 19:07 -0700, Mike Connell wrote:

> Today I tried the following:
>  
> For all subsub directories
>     a) Fork a "du -s subsubdirectory" on the destination
> subsubdirectory
>     b) Run rsync on the subsubdirectory
>     c) repeat untill done
>  
> Seems to have improved the time it takes by about 25-30%. It looks
> like the du can
> run ahead of the rsync...so that while rsync is building its file
> list, the du is warming up
> the file cache on the destination. Then when rsync looks to see what
> it needs to do
> on the destination, it can do this more efficiently.

Interesting.  If you're not using incremental recursion (the default in
rsync >= 3.0.0), I can see that the "du" would help by forcing the
destination I/O to overlap the file-list building in time.  But with
incremental recursion, the "du" shouldn't be necessary because rsync
actually overlaps the checking of destination files with the file-list
building on the source.

--
Matt

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
mfc
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Nice little performance improvement

mfc
In reply to this post by Darryl Dixon
Hi,

> In order to expeditiously move these new files offsite, we use a modified
> version of pyinotify to log all added/altered files across the entire
> filesystem(s) and then every five minutes feed the list to rsync with the
> --files-from option. This works very effectively and quickly.

Interesting...

How do you tell rsync to delete files that were deleted from the source,
or is that not part of your use case?

Thanks,

Mike
--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
mfc
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Nice little performance improvement

mfc
In reply to this post by Matt McCutchen-7

Hi,

> Interesting.  If you're not using incremental recursion (the default in
> rsync >= 3.0.0), I can see that the "du" would help by forcing the
> destination I/O to overlap the file-list building in time.  But with
> incremental recursion, the "du" shouldn't be necessary because rsync
> actually overlaps the checking of destination files with the file-list
> building on the source.
>
Ignoring incremental recursion for a moment. It seems to me that anything
that can warm up the file cache before it is needed would be beneficial?
--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Nice little performance improvement

Darryl Dixon
In reply to this post by mfc
> Hi,
>
>> In order to expeditiously move these new files offsite, we use a
>> modified
>> version of pyinotify to log all added/altered files across the entire
>> filesystem(s) and then every five minutes feed the list to rsync with
>> the
>> --files-from option. This works very effectively and quickly.
>
> Interesting...
>
> How do you tell rsync to delete files that were deleted from the source,
> or is that not part of your use case?

For us, that is not a necessary part of our use-case. It would certainly
however be possible to capture the delete events and remove the files with
some other helper script, rather than use rsync directly (rsync doesn't
give any advantage in that scenario except to be able to re-use the
existing network transport mechanism).

regards,
Darryl Dixon
Winterhouse Consulting Ltd
http://www.winterhouseconsulting.com
--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Nice little performance improvement

Jamie Lokier
In reply to this post by mfc
Mike Connell wrote:

>
> Hi,
>
> >Interesting.  If you're not using incremental recursion (the default in
> >rsync >= 3.0.0), I can see that the "du" would help by forcing the
> >destination I/O to overlap the file-list building in time.  But with
> >incremental recursion, the "du" shouldn't be necessary because rsync
> >actually overlaps the checking of destination files with the file-list
> >building on the source.
> >
> Ignoring incremental recursion for a moment. It seems to me that anything
> that can warm up the file cache before it is needed would be beneficial?

No, not if the file cache isn't large enough for the number of files.
E.g. if you have 20 million files and only 256MB RAM, it's likely a bad idea.

Personally I use a program that I wrote about 11 years ago, called
treescan, which pulls in the inodes to cache about twice as fast as
du by using inode number sorting.

-- Jamie
--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
mfc
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Nice little performance improvement

mfc
> No, not if the file cache isn't large enough for the number of files.
> E.g. if you have 20 million files and only 256MB RAM, it's likely a bad
> idea.
>
Splitting down to the subsub (2-levels down) directory level allows a single
subsub rsync to fit for me. Warming the cache is beneficial here, I didn't
say
it was in every situation.

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Nice little performance improvement

Matt McCutchen-7
In reply to this post by mfc
On Sat, 2009-10-17 at 12:13 -0700, Mike Connell wrote:
> > Interesting.  If you're not using incremental recursion (the default in
> > rsync >= 3.0.0), I can see that the "du" would help by forcing the
> > destination I/O to overlap the file-list building in time.  But with
> > incremental recursion, the "du" shouldn't be necessary because rsync
> > actually overlaps the checking of destination files with the file-list
> > building on the source.
>
> Ignoring incremental recursion for a moment.

Don't ignore it, it makes a difference.

> It seems to me that anything
> that can warm up the file cache before it is needed would be beneficial?

I didn't reason it out carefully enough; let's try again...

Warming up the destination file cache decreases the amount of time the
generator spends blocked on I/O.  So the answer is yes, provided that
the generator is the bottleneck.

If incremental recursion is not used, that's almost certainly the case
during the main phase of the rsync run, since the generator is checking
all the destination files but the sender is only processing the small
number of source files that need a transfer.  But with incremental
recursion, the sender and generator are checking files in parallel, so
the sender may be the bottleneck depending on the relative speeds or
disk configurations of the machines.  (I take it that your rsync run is
local.  For remote runs, the network could be the bottleneck.)

--
Matt

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Loading...