GitHub - AndyA/fdmf: Find Duplicate Music Files

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
proto		proto
t		t
tools		tools
.gitignore		.gitignore
CHANGES		CHANGES
COPYING		COPYING
INSTALL		INSTALL
Makefile		Makefile
README		README
README.andy		README.andy
TODO		TODO
bm-par.sh		bm-par.sh
bm-results.pl		bm-results.pl
bm.sh		bm.sh
fdmf		fdmf
fdmf_bench		fdmf_bench
fdmf_correlator.c		fdmf_correlator.c
fdmf_dump		fdmf_dump
fdmf_prune		fdmf_prune
fdmf_report		fdmf_report
fdmf_run		fdmf_run
fdmf_sonic_reducer.c		fdmf_sonic_reducer.c
optparam		optparam
sfx.pl		sfx.pl
use-ogle.sh		use-ogle.sh
Repository files navigation

###########  FDMF - find duplicate music files  ###############

Installation instructions are in the file INSTALL.

THANKS: John Jolly, Lawrence Wade, Paul Russo, Paul Forgey, Luis Magaña,
	David Parker, Alain Deschenes, and Andre Pool.

This program is designed to detect music files that contain the
same music, even if the files are differently named and contain
different or nonexistent metainformation such as ID3 tags.

USAGE: 

	fdmf <music_dir>     
	fdmf_vector_pairs

USAGE EXAMPLE:

	./fdmf /home/bob/music/
	./fdmf_vector_pairs

fdmf_vector_pairs will print on stdout a list of pairs of files that may
contain the same music.  

NOTE: The first time you run this program it will do a lot of processing.  
It takes 10 minutes to process a music_dir containing 140 mp3 files on 
my Apple G4.  It will attempt to cache the result so that subsequent runs 
are faster.  You can interrupt it and your progress will not be lost.

ANOTHER NOTE: On my current development box, an IBM T20, mplayer is the
fastest mp3 decoder.  mpg123 is also quite fast, much faster than mpg321.  
Some Linux distros come with mpg123 being a symbolic link to mpg321 since
mpg123 is not under GPL.  Bottom line: you can use any program for decoding
if it can take the name of a music file as a command line arg and write the
raw decoded 44100/stereo/16-bit stream on stdout.  You might want to take 
some measurements on your system to see which decoder is fastest.

##################################################
TUNING:

There are important threshold parameters that affect the 
the detection of similar files by fdmf_vector_pairs.  If you don't
give fdmf_vector_pairs any command line arguments, then it will use
the default parameters, which should hopefully work OK.  Or you
can tune these parameters to improve the performance.  

##################################################
HOW IT WORKS:

1) Build a list of all music files in music_dir.

FOR EACH FILE:
	2) Decode/decompress file to raw binary data.
	3) Calculate the energy in 4 frequency bands for each 250 msec chunk.
	3.3) Calculate the sum of the 4 bands for each chunk.
	3.5) Calculate the (b2 + b3):(b0 + b1) ratio for each chunk.
	3.7) Calculate the (b0 + b2):(b1 + b3) ratio for each chunk.
	4) Calculate the power spectrum of steps 3.3, 3.5, and 3.7.
	5) Spline fit power spectra to a fixed set of frequency points.
	6) Quantize the result of step 5 to one bit using median as threshold. 
	6.5) Store the result of step 6 in a database.

FOR EACH POSSIBLE PAIR OF FILES:
7) Count the number of bits that are the same in the two spectra.
8) If the result of 7 exceeds a threshold, and both files still exist,
	print the filenames. 


fdmf keeps a cache of the results of step 6.  This cache persists from
one invocation of fdmf to the next in the form of a database file in the
user's home directory. (~/.fdmf)
If a music file has its step 6 result already in the cache, then step 2
through step 6 will be skipped for that file.  That speeds things up by about
two orders of magnitude.
If a music file is identical (data has the same MD5 hash) to another
file that has already been indexed, then the step 6 result will be copied
from the existing index entry to the new index entry.  This skips steps 
2 through 6 and speeds up processing of that file by about two orders of
magnitude.  In the worst case (no identical files) this slows things down
by about 1%.

After finding that two files in the database seem to match, fdmf_vector_pairs
checks to make sure that both files exist.  If they don't exist, it will
suppress the output regarding that pair.  The goal of this is to only
print messages about actual duplicate files, but to keep the records for
all files that have been analyzed.  If sometime later you get the exact
same file (based on MD5 sum) then fdmf will just reuse the summary
that it already has.  The summaries are small, so for most people,
it is worth it to cache these "ghost" records.

Steps 1 through 6.5 are implemented in fdmf.
Steps 7 and 8 are implemented in fdmf_vector_pairs in C.
################################################

DEVELOPMENT GOALS:

It would be really nice to synchronize/merge two music collections
by exchanging tables of results from step 6 of the algorithm.  This
reminds me of rsync.   I think that it might make sense to qualify
the filenames in the database with the host that they are on.  Right
now it is just a UNIX path.  Maybe better would be an NFS-style name
like foobox:/home/john/music/funsong.mp3.

It should be easy to extend the program to support all types of music
files.  It currently supports mp3, ogg, m4a, wma, wav, and realaudio
using mplayer.  But any format that has a command line player that can 
output a raw stream should be easy to support. 

It would be really good to have a GNU configure script.  If somebody
wants to write one, that would be great.

The FFTW fourier transform library that fdmf_sonic_reducer  uses has 
a "plan" structure that controls how the FFT will be performed.  
It is possible to cache this plan from one invocation to the next in 
a file.  This might speed things up.  But currently about 90% of the 
CPU time required to index music files is the time it takes to decode 
the music file (mpg123, for example), it is questionable whether it
is worth worrying about optimizing fdmf_sonic_reducer.  It might be more 
beneficial to see whether there is something that can be done to speed 
up the decoding.  Skipping frames in the decoding is possible in mpg123,
but in my early testing it seemed to increase the number of classification
errors.
##############################################

KNOWN BUGS:

##############################################

HACKING:

If you want to hack around, you might find useful the stderr output
of fdmf_vector_pairs.  It is four numbers in the following format:

<THRESH 0> <THRESH 1> <THRESH 2> <NUM TYPE1 ERRS> <NUM TYPE2 ERRS>

This can be used to print out the number of classification
errors that the program is making.  It will tell you how many times 
fdmf_vector_pairs said that the pair of files contained the same music when 
they really didn't, and how many times fdmf_vector_pairs said that the pair 
of files did not contain the same music when they actually do.  For 
the program to do this, you must give it a big hint in the naming 
of the files.  It looks at the basenames.

Here is how to use this feature:

	1) Create a directory inside your music_dir.
	2) Recode your music files and put the results in the new directory.

When you recode your files, the basename of the file should be the same
as the original.  Just the directory is different.  After you recode,
the contents will be different, even though the files my sound the same 
when you play them.  A shell command for recoding is

	mpg123 -s foo.mp3 | bladeenc STDIN new_dir/foo.mp3

When fdmf_vector_pairs runs, it checks to see whether the basenames of the 
filenames are the same.  If the classification algorithm judged the contents 
to be different but the files have the same basename, it increments the 
type2 error counter.  If the classification algorithm judged them to have
the same contents but basenames are different, the type1 error counter
is incremented.  These two counters are printed to stderr at the end of the
run.  

When generating this synthetic test data, you can also add some silence to
the beginning of the files.  You can do this with dd if=/dev/zero.  
This helps to make the tests more realistic.
Also, it is good to use different encoders.  I use blade and lame for
MP3 encoding.  Anything that you can do to simulate real-world variations
in the representation of the music will make the tests more meaningful.

FDMF_BENCH:

A little script called fdmf_bench is included in the fdmf tarball.  It will 
run fdmf_vector_pairs over a range of parameter values.  It will output a file
called gpd.  gpd output line format:

<p1>	<p2>	<p3>	<t1 + t2>	<t1>	<t2>

To use this, you must have a set of music files that were prepared 
as described above.  It will be totally meaningless to run this for 
a regular (non-synthetic) set of files.
To find good parameter values for a set of synthetic data, you can
use something like this:

	sort -n -k 4 gpd | less


OPTPARAMS:

Another little script is included that uses fdmf_bench to search the 
entire parameter space.  It is useful for finding reasonable parameters
in a reasonable amount of time.  Usually this isn't necessary because 
the default parameters defined in fdmf_vector_pairs.h should be OK, but when
somebody makes a change to the fdmf_sonic_reducer algorithm, it is necessary 
to find new parameter values.

AUTOMATIC DUPLICATE HANDLING (cleanup_dups):

The fdmf_vector_pairs program takes and optional -0 (that's a zero) argument.
This has the same purpose as the -print0 option for the UNIX find program.
It causes the output of the program to be delimited by null bytes instead 
of the usual human readable delimiters, tab and newline.  This way, the
output can be fed to another program and the other program can unambiguously
parse the stream into separate filenames. 

With the -0 option, the output of fdmf_vector_pairs can be piped to another 
program which takes in the list of matching pairs of music files and
takes some action.  All kinds of logic can be employed to decide what 
to do.  Maybe you want to move duplicate files to some other place.  
Or maybe you want to delete them.  Or maybe you want to delete one
and replace it with a link to the other.  Or maybe you want to have 
some user interaction.    

A program called cleanup_dups is included as an example of this possibility.
The program is just an example, and it just finds which file of each pair
is smaller and copies that file to a directory in /tmp.  This is documented
at the top of the file cleanup_dups.  You are encouraged to hack 
cleanup_dups to add your own logic.  Please be careful.  Your music
collection is precious.  Back it up to CDROM/DVD/whatever.  Have fun.