freedups

Comment: (by AnonymousGnome in 2005 March)

Dirvish and rsync are not smart enough to know that filesystems in different branches are similar, but freedups is: http://www.stearns.org/freedups/ . If you have many duplicate files on different hosts that are backed up to the same volume (not necessarily even the same vault), you can save a lot of disk space by running freedups after dirvish. Example:

freedups.pl --actuallylink --minsize 512 --cachefile=$bank/freedups.cache \    
            --maxfiles=400 $bank

Response: (by KeithLofstrom on 2005 March 9)

JasonBoxman tried freedups; unfortunately, it chokes badly on the enormous vaults produced by dirvish. These directories have millions of file links in them. Now a simple two-way merge by a program that was aware of dirvish directory structure might work; still, for even that simple task the table that contains filenames and inodes will probably have to be stored on disk in a .db file. Otherwise, a pointer file will not fit. In any case, it will likely be VERY slow.

JTMoree has been using the perl version of freedups with success on discremental-lite archives of multiple Terabytes. I have had to fix a few bugs and have not gotten responses from the author when sending patches. I posted patches to http://pcxperience.org/freedups

PerMarkerMortensen has good experience with dupmerge or the faster-dupemerge variant. faster-dupemerge can be found here:

http://www.furryterror.org/~zblaxell/dupemerge/dupemerge.html

It runs well even on big dirvish setups, with lots of vaults and files. It starts with the largest files, and therefor saves space quite fast.

FreeDups (last edited 2011-01-24 05:05:50 by KeithLofstrom)