Similarity Comparison with SDHASH (fuzzy hashing)

4 minute read

Being a fan of ssdeep for fuzzy hashing, I was interested in this article comparing ssdeep to sdhash.

As the article says, ssdeep basically breaks up a file into pieces, hashes each piece, and combines the produced hashes to create a hash for the whole file. Similarity for each piece of the file can then be calculated since small chunks of the file are examined. When comparing two files, the similarity of chunks will result in a similar hash.

sdhash, on the other hand, uses probabilistic feature extraction to find features that are unlikely to happen randomly. The extracted features are hashed and put into a Bloom filter. For the comparison of two files, Bloom filters are compared using a Hamming distance measure. The more similar the filter, the lower the Hamming distance.

The article An Evaluation of Forensic Similarity Hashes demonstrated that with the different approaches, sdhash appears to outperform ssdeep in many instances. Because of this, I took a closer look at sdhash.

One of the more interesting features of sdhash is the ability to tell the command how many threads you want to spawn. In Roussev's paper he talks about parallelization. He had 12 processors on a test machine, with 24 cores. In many of his tests he was spawning 24 threads.

So, wanting to get a feel for sdhash I built it on a 2.4 Ghz Intel Core 2 Duo machine. Building was straightforward on OS X: make && sudo make install
$ sdhash --version
sdhash 2.3alpha by Vassil Roussev, Candice Quates, August 2012, rev 476
       http://sdhash.org, license Apache v2.0
When doing a comparison you first need to create a list of SDBFs (bloom filters) for the files of interest. The switches I was most interested in were:

  -r [ --deep ]                         generate SDBFs from directories and files
  -g [ --gen-compare ]             generate SDBFs and compare all pairs
  -c [ --compare ]                   compare all pairs in SDBF file, or
                                            compare two SDBF files to each other

So, let's say I had a corpus of illegal images in a folder (and potentially many sub-folders) named "evidence". I could create the hashes of each of the images using the command:
$sdhash -r evidence/ > illegal_images.sdbf
In this case I am outputting the results to the file "illegal_images.sdbf" using a standard >, but you can also use -o to specify the output file.

If I then want to compare all the hashes in illegal_images.sdbf to hashes create from a new case, on the new machine I would have to create a sdbf (test_images.sdbf) file for the new hashes, and then compare:
$sdhash -c illegal_images.sdbf test_images.sdbf
The output will be the names of the two compared files, plus a score of similarity all separated by pipes ( | ). For an explanation about the scoring system please see Content triage with similarity digests: the M57 case study:
warnings.py|warnings.py|100
opera8-2png1txt/file2.png|flask.ui/static/jquery.js|005
opera8-2png1txt/file2.png|flask.ui/venv/bin/python|009
opera8-2png1txt/file2.png|opera8-2png1txt/request.txt|066
As Roussev says in his papers, the score is a calculation of similarity. In other words, even if the score is 100, the files may not be exactly the same.

To test this, I have a 3.3K text file named "Makefile". First, I create a md5 and fuzzy hash of the file:
$sdhash Makefile > make.sdbf
Result: MD5 (Makefile) = 4fe15b4cf1591cdfe92b7efd65d291ec
sdbf:03:8:Makefile:3417:sha1:256:5:7ff:160:1:50:AICBAiDMAAAAAoEACMAQAAAICA5gACQIgBAAQYAAAAACAhAAAFAEACIAIAoEAACCAAICAAAAAEgEEMCgQAACBAAACAAQAABCCCBSQFgACAAAABLAAiBCciCAAgAIAACIBCQAAgGAAAASJhAgAAAUAAgAACQAAIEAEAAIAAAIAIAAAAaQCAAQEgABAWABAAAAEAAAAAAik6CgAQEAAAAABRQUxwAAAEEGAAAACAAQAEAAABAAAAA8CAiAAAEpgIAkAAQGAAICCAYSBsIFEQCoiAAAQIYCAAAIAASAAAAMSCABOCAAAARAAAQAIAAEFIABAgACgAACAEAIDAIAAACEEg== 
Next, I modify "Makefile" to remove the first "# " (hash and a space), create an md5 and fuzzy hash again and compare.

Result: MD5 (Makefile) = d7e9182ee1d8cb7c6ad41157637c7d62
sdbf:03:8:Makefile:3415:sha1:256:5:7ff:160:1:50:AICBAiDMAAAAAoEACMAQAAAICA5gACQIgBAAQYAAAAACAhAAAFAEACIAIAoEAACCAAICAAAAAEgEEMCgQAACBAAACAAQAABCCCBSQFgACAAAABLAAiBCciCAAgAIAACIBCQAAgGAAAASJhAgAAAUAAgAACQAAIEAEAAIAAAIAIAAAAaQCAAQEgABAWABAAAAEAAAAAAik6CgAQEAAAAABRQUxwAAAEEGAAAACAAQAEAAABAAAAA8CAiAAAEpgIAkAAQGAAICCAYSBsIFEQCoiAAAQIYCAAAIAASAAAAMSCABOCAAAARAAAQAIAAEFIABAgACgAACAEAIDAIAAACEEg==
Comparison of fuzzy hashes:
$sdhash -c make.sdbf make_change.sdbf 
Resulting in:
Makefile|Makefile|100 
So from this we see that the MD5 hashes (4fe15b4cf1591cdfe92b7efd65d291ec and d7e9182ee1d8cb7c6ad41157637c7d62) changed dramatically - as expected. The result of sdhash, however, still returned a score of 100, meaning that the file is very similar, even if the content is slightly different. I may run more experiments to see how much the file can change before the score is reduced, but that is for a later day.

Another feature I am interested in is the specification of the number of threads.
  -p [ --threads ] arg (=1)             compute threads to use
Using this, we can (hopefully) easily utilize server farms to make the hash generation and comparison an easier task.

My test machine has 1 processor with 2 cores. Running sdhash without specifying the -p flag, ran with 1 thread and used around 80% - 90% of the CPU. When 'sdhash -p 2' was specified, 2 threads were spawned, and 170% - 190% of the processor was used (both cores). So, using multiple processors is quite easy, however, it may not always be the best option. In one of his papers, Roussev claims that running some threads on multiple cores of the same processor may not result in increased performance, but instead a competition for the processor's time. This appeared to be the case in our brief experiment:

$ time sdhash -r . > test.hashes
real 2m20.365s
user 0m57.093s
sys 1m11.767s
$ time sdhash -p 2 -r . > test.hashes
real 2m38.744s
user 2m20.538s
sys 2m41.520
 So in other words, you may want to test to make sure they way you are processing is the most efficient. I was hashing only a few hundred files, and not doing any comparison. If this were thousands, the a lot of time might have been wasted.

Roussev suggested creating hashes at the same time as imaging by redirecting the output of the suspect device to an image file and as an input to sdhash. He suggested dcfldd. I've not tried it, but if you could hash and image at the same time, there could be some definite time savings depending on your forensic process.

Overall fuzzy hashing appears interesting for finding similarity, however, I would like to test this method against full-size images vs. thumbnails. Since this method is NOT content analysis, but instead feature selection of raw data, I would be surprised if it found much similarity between an image and a corresponding thumbnail. If anyone has tested this, please let a comment with your results.

Leave a Comment