12.11
Grouping malware with similar binary structure saves time and effort. As a standalone part-time researcher, such productivity again is invaluable. When you collect malware, in time you will accumulate malware samples – many of them. Perhaps 2000 samples of malware. Processing all of them could be a costly operation. To save time and effort, we want to remove similar or duplicates of the same family. What can one do?
For this problem, we assume all the files are malicious as honeypots do not collect innocent software.
One way is to use virus scanners to scan and classify the files. After a scan, group together all the files that are detected as “Conficker.B” for example. As Conficker family is quite prevalent, such duplication identification can save a lot of time and effort. This way, just analyzing one or two of them is sufficient. However, the drawback is that all the undetected samples will be left as a big group which you must analyze one-by-one.
Extract of a clamscan result…
/tmp/4c71b97435a24ffb8fd7fedd1b1790e1: OK
/tmp/82dd3a3d386d4ea09870dcee4a75a531: OK
/tmp/72bdd3bd37a0b5d1dd5f1be80cb29639.bin: OK
/tmp/24bd1722b994f7daa193458348108bfc.bin: OK
/tmp/39960c5ff1922466ded71a4a2799c295: Trojan.VanBot-366 FOUND
/tmp/33f5f14c33bf2f71556204705407a885: W32.Virut-54 FOUND
/tmp/880ce6df69aaeb1d3c57e756f53dd158.bin: Trojan.Delf-911 FOUND
/tmp/7e0ce66bb299370010016f4522152969: Trojan.VanBot-366 FOUND
/tmp/4f2d9f8129e7d7fd9b37f700aacdc9aa.bin: Trojan.Hupigon-25647 FOUND
/tmp/5b69ff6f331ece36558516f66306f969: Trojan.Small-4287 FOUND
/tmp/078aedb8630339487cf39d028b0156bd.bin: OK
/tmp/417bdef0688996a845701da9dcf1b145: Trojan.VanBot-366 FOUND
/tmp/eda3b7766c23dfffc0b85d0ba546b0c1: W32.Virut-54 FOUND
/tmp/86f22ff53382dbb54e2c22560a3db373: Trojan.VanBot-366 FOUND
/tmp/a4a41d2122c4d3552e3d59315f42d4e3: W32.Virut-54 FOUND
In the above, without signatures, how can you tell if 4c71b97435a24ffb8fd7fedd1b1790e1 and 82dd3a3d386d4ea09870dcee4a75a531 is not the same family? How can you tell which malware is unique? You have to analyze them. Now scale the problem to perhaps 600, for yourself only.
The other way is to use ssdeep, a fuzzy hashing tool. It is used to match inputs that are similar, perhaps only some bytes and length. It will produce a hash signature like md5 but unlike md5, a single change of byte will not create a wildly different signature. The concept of ssdeep is to chop the files into many sections, and calculate the hash for each section.
Below I take a sample of an exe file (“file1.exe”). I copied the file and concatenates a byte after it (“file2.exe”), and computes the md5 sum of the two files.
$ cp file1.exe file2.exe
$ echo 1 >> file2.exe$ md5sum file1.exe file2.exe
72bdd3bd37a0b5d1dd5f1be80cb29639 file1.exe
a626b78fa6ba13fdd9cfddb9f55ee7c6 file2.exe
Just a difference in one byte, and the md5 hash is completely different. Let us do the ssdeep sum of the two files.
(broken into lines for clarity)
$ ssdeep -b file1.exe file2.exe
ssdeep,1.0–blocksize:hash:hash,filename
768:my+qxlsz7yiV0+7YUaFhLFAtVI0xbM
LvzEg1B1Ki8nJ78:R+qxlsHvGhLFyI0l8tC5J78,”file1.exe”
768:my+qxlsz7yiV0+7YUaFhLFAtVI0xbM
LvzEg1B1Ki8nJ7V:R+qxlsHvGhLFyI0l8tC5J7V,”file2.exe”
Separated by colon, the first (768) is the blocksize, then two ssdeep hashes (my+qxlsz7yiV0+7YUaFhLFAtVI0xbMLvzEg1B1Ki8nJ7V and R+qxlsHvGhLFyI0l8tC5J7V) , then the last is the file path name (“file2.exe”). The main point are the two hashes – the signatures of the file. Both file hashes of the two files are really alike except for the last byte ( “8″ vs “V” ).
If you have a large number of unidentified malware, antivirus scanners will not help to classify, but ssdeep can try. Below is extracted output of file matching with ssdeep. Each file name is the md5 of the file itself.
$ ssdeep -dr .
…
/tmp/72bdd3bd37a0b5d1dd5f1be80cb29639.bin matches /tmp/fa7c91b738e763eccf69676bd393925e.bin (88)
/tmp/72bdd3bd37a0b5d1dd5f1be80cb29639.bin matches /tmp/ae142ce3b35cc04f5648a0c17c37ea30.bin (82)
/tmp/72bdd3bd37a0b5d1dd5f1be80cb29639.bin matches /tmp/794b74fc4e833d245eb005e078dc21da.bin (82)
/tmp/72bdd3bd37a0b5d1dd5f1be80cb29639.bin matches /tmp/46fb9678675df8dc83d38761a76c7950.bin (99)
/tmp/72bdd3bd37a0b5d1dd5f1be80cb29639.bin matches /tmp/f412d41aacb4b16ded7b158b89fd3552.bin (90)
/tmp/72bdd3bd37a0b5d1dd5f1be80cb29639.bin matches /tmp/4bfba885ed3dc4ba800446df49051af0.bin (82)
/tmp/72bdd3bd37a0b5d1dd5f1be80cb29639.bin matches /tmp/13776c2b604290906305a56c4e7c61e5.bin (99)
/tmp/72bdd3bd37a0b5d1dd5f1be80cb29639.bin matches /tmp/5a8424f4e1504b5823ca8742e2b1ce8d.bin (82)
…
In the above, all of them are undetected malware and gives wildly different md5 signature. Yet, ssdeep can relate them. For malware that does not match any other files, it can be assumed to be a unique malware in your collection, and you should pay more attention to it. Moreover, even packed executables (tested on UPX) still can be matched since packers are just compressors – the similar code will be compressed into a similar binary pattern.
There are a few culprits. First, remembering that ssdeep just does mini-hashes, if some bytes vary a little throughout the file ( by some obfuscation, etc, every 1 byte change at 100 byte intervals, i.e. no-ops) will cause the ssdeep to fail to identify matches. Then, for botnets credentials identification, similar files could contain very different login credentials and wrongly discarded due to highly similar binary structure. However, you can analyze the access control logic through such duplicated samples, then you can generalize the login credentials.
With ssdeep, you can now group duplicated undetected malware into groups for more efficient analysis.
===
ssdeep – http://ssdeep.sourceforge.net/
UPX – http://upx.sourceforge.net/
English
No Comment.
Add Your Comment