Hunting Hashes

I’ve been working on a thing that looks and acts like a hash but it isn’t designed to uniquely identify a set of bytes. It’s designed to represent a group of similar files, like a bucket it can contain many drops of water.

The problem I hope to address is as follows:

  • Changing some bits is ok
  • Change too many bits and the identifier should change in a similar magnitude
  • Identifiers should allow for bit-wise distance measurement

My initial objective was to identify malicious files, email spam, websites and eventually plasmids similarity.

I thought was statistics was difficult in college but once a customer asked me about the rate of false positives, my mind broke. Type one and Type two errors (FP,FN) occur when providing context to the thing being observed.

The algorithm is visual and 2-D which is handy if there are lots of GPUs available on eBay. Wile many algorithms used in security or crypto finance leverage entropy, in this context we attempt to diminish entropy.

Procedure

Take any block of bytes, we will only consider 8 bit bytes regardless of the content. We will make an exception for plasmids later so for now, we just talk about bytes. The algorithm I’ve settled on has a propensity for amplifying floating bit rounding errors. I can only identify my strange attraction to this algo because of its obscure nature.

A version of the Lanczos algorithm is use to make thumbnail images which look better than other image resampling techniques. I decided to couple this algorithm with the Hilbert transform. We do two simple but computational difficult operations: map bytes to a Hilbert curve large enough to contain the file and then shrink it to a few pixels. Sounds stupid right, simple statistics show that it works well enough to match plazmids or PDFs or binaries.

After playing with this for a while I hope the community of people that search for things (hunters) will enjoy giving this a try. If there is interest I’ll add a module to yara that will assist in identification of threats.

The repo is at https://github.com/wessorh/HuntingHash

If your first thought is that an Integer might make a good signature, you are not alone.



Categories: Cyberr

Discover more from Cyber Warhead

Subscribe now to keep reading and get access to the full archive.

Continue reading