MICHA.ELMUELLER

 

Saving disk space by eliminating duplicate files

A friend of mine, matou, wrote a script that searches a file tree recursively, hashes every file and outputs duplicate files.
I forked his script and added several functionalities. I added a flag -s, which makes the script to keep just one occurence of duplicate files, the other duplicates are deleted. Then hardlinks are set from the location of the, now deleted, files to the one occurence that still exists.

$ python duplicatefiles.py -s /foo/

In fact the file system should look exactly the same to the operating system after execution — besides from using less space. I used the script to clean my media folder and saved nearly 1 gb of disk space.

You can also do a dry run:

$ python duplicatefiles.py -l=error -c ./foo
these files are the same:
.//music/foo/bar.mp3
.//music/bar/foo.mp3
-------------------------------
Duplicate files, total:  15
Estimated space freed after deleting duplicates: ca. 40 MiB

The script is available under a BSD-like license on github.
I tested it under Mac OS X but it should work on other UNIX systems as well.

I would recommend to make a backup before executing the script.

About Me

I am a 32 year old techno-creative enthusiast who lives and works in Berlin. In a previous life I studied computer science (more specifically Media Informatics) at the Ulm University in Germany.

I care about exploring ideas and developing new things. I like creating great stuff that I am passionate about.

License

All content is licensed under CC-BY 4.0 International (if not explicitly noted otherwise).
 
I would be happy to hear if my work gets used! Just drop me a mail.
 
The CC license above applies to all content on this site created by me. It does not apply to linked and sourced material.
 
http://www.mymailproject.de