When I started this blog, I wanted to write about computer security. Turns out I preferred to write about my distractions from computer security. So I’m working on writing more about Icewater. Several years ago I started on a project that I thought could help with visualizing binary code, but as with any research I had no clue if it would actually be helpful. Like most ideas, it wasn’t originally mine, but through lots of research, programming, analytics and time I can call the improvements to prior art mine, so the US PTO says.
I am fascinated with patterns in nature and the natural world. I noticed this pattern called Space Filling Curves in nature and I saw it used over and over in computer science too. Once I realized that it was also contained in DNA, I decided that made it above the signal to noise ratio and began to take the idea more seriously.
Back about 2.4 Billion years ago we (as in “life on Earth”) got a software upgrade. The first branch of the Tree of Life leveraged a space filling curve to organize the DNA inside the nucleus of a cell. When I learned this I thought — that is a good indicator of a successful algorithm. I wonder if using that would be of benefit to understanding the internal structure of code without running it.
Programs are executed mostly in a linear way. We first started writing them for single CPUs and then scaled them to multi-core CPUs, all the while networking those together with glue called “network protocols”. I like how nature simplifies things to scale them. Think “slow down to go faster.” I realized that I could perform scalable analysis on many things if I could leverage a new kind of programming called parallel coding which requires a different kind of thinking.
This description is an oversimplification: Take some data (a program) and make a picture out of it. There are many ways to do this and mapping the bytes of a program to pixels is simplified if we can do it using an algorithm that works well on a GPU.
I wanted to find similar executable code without running code. I call this the “Stopping Problem”, after Turing’s Halting Problem. Finding new bad stuff is harder if you need to run the code. Since I was poor and couldn’t run the code I had to figure out a way to “look” at the code and understand what code would run without actually running it.
Today it takes about five minutes (worst case) to run a sample in a sandbox. I wanted to bring that down nine orders of magnitude. In a half millisecond per core with a GPU, Icewater can give you a really good hint of whether you should sandbox the sample, or if the sample is enough like something you already have analyzed. Everyone loves binary decisions.
By folding the code into a 2-Dimensional Hilbert Curve (which is super fast) it’s easy to generate a viewable image of the code. The rest of the process should be naturally intuitive to people who are good at interpreting pictures. If you are interested in more details look at some of the patents.
In the past few years I have something like 700 million pieces of malware indexed. I’m working on exposing this stuff to developers and I’m looking for folks from the computer security field to give me ideas on how they might be able to leverage it.
I digest about ~400K samples a day, find the “interesting samples” and sand box those. After a few years worth of sand box reports the world of good and bad begins to resolve itself. Average analysis time is one-half millisecond using COTS servers and a mid range GPU. It’s no more magical than life. (What would you expect from a 2 billion year old algorithm.)
If you are thirsty for solutions that approch the problem of computer security and file safety, reach out to firstname.lastname@example.org because I’d like to see if Icewater could help you. I have customers using this stuff, but they won’t let me tell you who they are or how they use it.