Mar 4, 2011
Saving disk space by eliminating duplicate files
A friend of mine, matou, wrote a script that searches a file tree recursively, hashes every file and outputs duplicate files.
I forked his script and added several functionalities. I added a flag -s
, which makes the script to keep just one occurence of duplicate files, the other duplicates are deleted. Then hardlinks are set from the location of the, now deleted, files to the one occurence that still exists.
$ python duplicatefiles.py -s /foo/
In fact the file system should look exactly the same to the operating system after execution — besides from using less space. I used the script to clean my media folder and saved nearly 1 gb of disk space.
You can also do a dry run:
$ python duplicatefiles.py -l=error -c ./foo
these files are the same:
.//music/foo/bar.mp3
.//music/bar/foo.mp3
-------------------------------
Duplicate files, total: 15
Estimated space freed after deleting duplicates: ca. 40 MiB
The script is available under a BSD-like license on github.
I tested it under Mac OS X but it should work on other UNIX systems as well.
I would recommend to make a backup before executing the script.