deduplicate mode?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

deduplicate mode?

Harald Dunkel-3
Have you considered to introduce a "deduplicate mode" for
rsync, replacing duplicate files in the destination directory
by hard links?

Of course there might be a lot of problems together with
this feature, but on creating backups it could help to
save a lot of disk space.

--link-dest might be interesting in this context, too.


Regards
Harri

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|

Re: deduplicate mode?

Bernd Hohmann
On 12.12.2016 14:31, Harald Dunkel wrote:

> Have you considered to introduce a "deduplicate mode" for
> rsync, replacing duplicate files in the destination directory
> by hard links?

-> http://rsnapshot.org/

Bernd

--
Bernd Hohmann
Organisationsprogrammierer
Höhenstrasse 2 * 61130 Nidderau
Telefon: 06187/900495 * Telefax: 06187/900493
Blog: http://blog.harddiskcafe.de


--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reply | Threaded
Open this post in threaded view
|

Re: deduplicate mode?

andy smith-10
In reply to this post by Harald Dunkel-3
Hi Harald,

On Mon, Dec 12, 2016 at 02:31:03PM +0100, Harald Dunkel wrote:
> Have you considered to introduce a "deduplicate mode" for
> rsync, replacing duplicate files in the destination directory
> by hard links?

For a month now I have been successfully using the offline
deduplication feature that is currently experimental in XFS to
reduce the size of my rsnapshot backups. Some more info:

    http://strugglers.net/~andy/blog/2017/01/10/xfs-reflinks-and-deduplication/

rsync is hardlinking together files that do not change between two
backup runs, but reflinks are also allowing me to deduplicate files
that cycle between known content, also partially-identical files and
identical regions across multiple different directories (so from
different hosts, for example).

At the moment it is saving me about 27% volume. This is of course
totally dependent on what you are backing up.

Also do note that examining the whole tree of files is really hard
on the storage as it hits it with a large amount of random IO.
Especially with slow rotational storage it may well be cheaper just
to buy more capacity. Personally I am using SSD so the performance
vs. capacity trade-off is different.

Not speaking for the rsync developers but deduplicating all files
within a directory would need rsync to read all files in a directory
which is something it wouldn't normally do unless they are going to
be the target for a file transfer. Since other utilities already
exist for examining files and hardlinking dupes together, or indeed
doing it inside the filesystem on a block/extent level basis, maybe
it is not appropriate to put the feature inside rsync.

Cheers,
Andy

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html