rsync very slow with large include/exclude file list

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

rsync very slow with large include/exclude file list

ray vantassle
I have a sensor collector system (very low-powered slow ARM cpu), and another system which daily pulls the data files from it for processing.  There are about 1000 new files each day.  As part of the processing it decides that certain of the files are of no interest, and adds them to an exclude file, which is used in future rsyncs.  No files are ever deleted from the source system, just the receiver system.  All done in cron jobs at night.

After about 18 months, there are about 350,000 files and the exclude list has about 72,000 filenames.  I recently ran the pulling script manually and thought the system must have died.
Rsync took almost 3 hours.  Trying to narrow down the problem, I removed the "--exclude-file=" option from the rsync command -- and it took only 16 minutes -- including the time to transfer the 72,000 files that are of no interest.
 (con't)


--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|

Re: rsync very slow with large include/exclude file list

ray vantassle
I investigated the rsync code and found the reason why.
For every file in the source, it searches the entire filter-list looking to see if that filename is on the exclude/include list.  Most aren't, so it compares (350K - 72K) * 72K names (the non-listed files) plus (72K * 72K/2) names (the ones that are listed), for a total of about  22,608,000,000 strcmp's.  That's 22 BILLION comparisons. (I may have left off a zero there, it might be 220 B).

I'm working on a fix to improve this.  The first phase was to just improve the existing code without changing the methodology.
The set I've been testing with is local-local machine, dry-run, 216K files in the source directory, 25,000 files in the exclude-from list.
The original rsync takes 488 seconds.
The improved code takes 300 seconds.

The next phase was to improve the algorithm of handling large filter_lists.  Change the unsorted linear search to a sorted binary search (skiplist).
This improved code takes 2 seconds.

The original code does 4,492,304,682 strcmp's.
The fully improved code does 6,472,564.  98.5% fewer.

I am cleaning up the code and will submit a patchfile soon.


--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|

Re: rsync very slow with large include/exclude file list

Ken Chase
This is similar to using fuzzy / -y in a large directory. O(n^2) behaviour
occurs and can be incredibly slow. No caching of md5's for the directory
occurs, it would seem (or even so, there are O(N^2) comparisons).

/kc


On Mon, Jun 15, 2015 at 06:02:14PM -0500, ray vantassle said:
  >   I investigated the rsync code and found the reason why.
  >   For every file in the source, it searches the entire filter-list looking
  >   to see if that filename is on the exclude/include list.** Most aren't, so
  >   it compares (350K - 72K) * 72K names (the non-listed files) plus (72K *
  >   72K/2) names (the ones that are listed), for a total of about**
  >   22,608,000,000 strcmp's.** That's 22 BILLION comparisons. (I may have left
  >   off a zero there, it might be 220 B).
  >
  >   I'm working on a fix to improve this.** The first phase was to just
  >   improve the existing code without changing the methodology.
  >   The set I've been testing with is local-local machine, dry-run, 216K files
  >   in the source directory, 25,000 files in the exclude-from list.
  >   The original rsync takes 488 seconds.
  >   The improved code takes 300 seconds.
  >
  >   The next phase was to improve the algorithm of handling large
  >   filter_lists.** Change the unsorted linear search to a sorted binary
  >   search (skiplist).
  >   This improved code takes 2 seconds.
  >
  >   The original code does 4,492,304,682 strcmp's.
  >   The fully improved code does 6,472,564.** 98.5% fewer.
  >
  >   I am cleaning up the code and will submit a patchfile soon.

  >--
  >Please use reply-all for most replies to avoid omitting the mailing list.
  >To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
  >Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


--
Ken Chase - [hidden email] skype:kenchase23 +1 416 897 6284 Toronto Canada
Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front St. W.
--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|

Re: rsync very slow with large include/exclude file list

Jukka Jaatinen
This post has NOT been accepted by the mailing list yet.
In reply to this post by ray vantassle
I have a similar problem rsyncing terabytes and millions of files and if I add excludes the process seems to slow done significantly (source files > 50M and excludes < 10). Has this patch been released somewhere?