freedups
Comment: (by AnonymousGnome in 2005 March)
Dirvish and rsync are not smart enough to know that filesystems in different branches are similar, but freedups is: http://www.stearns.org/freedups/ . If you have many duplicate files on different hosts that are backed up to the same volume (not necessarily even the same vault), you can save a lot of disk space by running freedups after dirvish. Example:
freedups.pl --actuallylink --minsize 512 --cachefile=$bank/freedups.cache \
--maxfiles=400 $bankResponse: (by KeithLofstrom on 2005 March 9)
JasonBoxman tried freedups; unfortunately, it chokes badly on the enormous vaults produced by dirvish. These directories have millions of file links in them. Now a simple two-way merge by a program that was aware of dirvish directory structure might work; still, for even that simple task the table that contains filenames and inodes will probably have to be stored on disk in a .db file. Otherwise, a pointer file will not fit. In any case, it will likely be VERY slow.
JTMoree has been using the perl version of freedups with success on discremental-lite archives of multiple Terabytes. I have had to fix a few bugs and have not gotten responses from the author when sending patches. I posted patches to http://pcxperience.org/freedups
PerMarkerMortensen has good experience with dupmerge or the faster-dupemerge variant. faster-dupemerge can be found here:
http://www.furryterror.org/~zblaxell/dupemerge/dupemerge.html
It runs well even on big dirvish setups, with lots of vaults and files. It starts with the largest files, and therefor saves space quite fast.
