Validating Malware Classification with Genomic Data

The problem I’d like to describe here is the lack of a well defined corpus of cyber security related threat intelligence. Lacking a well defined corpus of data to prevent us from qualifying the utility of various solutions. While there are many databases of malware samples available each sample is labeled differently by the company that “detects” it. There are other collections of signature databases that detect malware and give it a label but the signatures only detect a specific linage and not all with the same label.

Like any annoyance we attempted to solve the label issue by creating a clustering technique that can be leveraged against genomic data. The reason I chose genomic data is because there are large well documented databases of sequences. Each sequence comes with a publication, authors and citations. Collections of malware come with nothing, or at best have several labels pointing to a marketing document. So, we will first review the samples contained in the NIH genbank.

The NIH supported data bank of genomic data called GENBANK. I’ve chosen release 259.0, released on December 22, 2023 which contains about 6.7Tb of uncompressed sequence data covering millions of life forms from virus, bacteria to humans. It’s a deep dive into the code of life.

This database version has 249,060,436 sequences of 257,0711,588,044 bases, for traditional GenBank records and 3,711,386,807 sequences, 25,371,955,930,639 bases, for set-based (WGS/TSA/TLS) records.

We removed any sequence that was smaller 3072 bp (1024 acids) and clustered them using the same pipeline as we use for clustering malware.

What we found was of the 10.6 million sequences with more than 3072 bases which clustered into 2,125,295 clusters. We then looked into each cluster to determine how many had the same locus. The locus is like the title of the sequence from the NIH/Genbank data and found that of those 2+ Million clusters 92% had the same loci.

Smiply clustering each sequence and labeling it with the locus (genbank title) and then asking if each sequence has the same title we call that a 100% labeled cluster. 92% of the clusters had samples labeled with the same title.

Of the 8% of clusters that had multiple loci, many were in the form as:

28390 {‘acute respiratory syndrome coronavirus 2’, ‘acute respiratory syndrome coronavirus 2 (SARS-CoV-2)’, ‘acute respiratory syndrome-related coronavirus’}

The above demonstrates that we need to normalize the locus. In this cluster there were 28,390 samples they were all the same virus but had differing loci.

4355 {‘acute respiratory syndrome coronavirus 2’, ‘acute respiratory syndrome coronavirus 2 (SARS-CoV-2)’, ‘construct’}

The above cluster deserves some further investigation to understand which samples are a “construct” and when were they published. Since 2019 the database has added some 6 million SARS samples. Many of these examples will be about SARS.

5 {‘Hu/GII.4/New Orleans1805/2009/USA’, ‘GII’, ‘Hu/GII.4/C00007934/2010/UK’, ‘GII/Hu/NL/2012/GII.4/Nijmegen02’, ‘GII/Hu/NL/2012/GII.4/Nijmegen01’}

The above cluster seems to have the same virus, all called similar things. I’m not a biologist but I’m confident that they too are similar.

Next in this series, I’ll explain how the malware pipeline works.



Categories: Cyberr, Icewater

Discover more from Cyber Warhead

Subscribe now to keep reading and get access to the full archive.

Continue reading