Part 1 Of Using Edit Difference For Detection – Charles Leaver

Written By Jesse Sampson And Presented By Charles Leaver CEO Ziften


Why are the exact same techniques being used by enemies over and over? The basic answer is that they are still working today. For instance, Cisco’s 2017 Cybersecurity Report tells us that after years of wane, spam email with malicious attachments is once again on the rise. Because conventional attack vector, malware authors usually conceal their activities by using a filename just like a typical system process.

There is not always a connection between a file’s path name and its contents: anyone who has tried to conceal delicate details by providing it a dull name like “taxes”, or altered the extension of a file attachment to circumvent e-mail guidelines understands this principle. Malware authors understand this too, and will often name malware to resemble common system procedures. For instance, “explore.exe” is Internet Explorer, but “explorer.exe” with an extra “r” may be anything. It’s simple even for professionals to neglect this small difference.

The opposite issue, known.exe files running in uncommon places, is simple to fix, using SQL sets and string functions.


What about the other case, finding near matches to the executable name? The majority of people begin their search for near string matches by arranging data and visually searching for discrepancies. This typically works effectively for a small set of data, maybe even a single system. To discover these patterns at scale, nevertheless, requires an algorithmic approach. One established strategy for “fuzzy matching” is to utilize Edit Distance.

Exactly what’s the very best method to determining edit distance? For Ziften, our technology stack consists of HP Vertica, which makes this task easy. The internet has lots of data scientists and data engineers singing Vertica’s praises, so it will be sufficient to point out that Vertica makes it easy to develop custom-made functions that take full advantage of its power – from C++ power tools, to analytical modeling scalpels in R and Java.

This Git repo is kept by Vertica lovers operating in industry. It’s not a certified offering, but the Vertica team is definitely familiar with it, and furthermore is thinking everyday about ways to make Vertica better for data scientists – a great space to watch. Most importantly, it contains a function to determine edit distance! There are also some other tools for the natural processing of langauge here like word stemmers and tokenizers.

By utilizing edit distance on the leading executable paths, we can rapidly discover the closest match to each of our leading hits. This is an interesting dataset as we can arrange by distance to discover the closest matches over the entire data set, or we can sort by frequency of the top path to see what is the nearest match to our typically utilized processes. This data can also surface on contextual “report card” pages, to reveal, e.g. the leading five closest strings for a given path. Below is a toy example to provide a sense of use, based upon real data ZiftenLabs observed in a customer environment.


Setting a threshold of 0.2 appears to discover excellent results in our experience, however the point is that these can be adapted to fit individual use cases. Did we discover any malware? We notice that “teamviewer_.exe” (must be simply “teamviewer.exe”), “iexplorer.exe” (should be “iexplore.exe”), and “cvshost.exe” (must be svchost.exe, unless perhaps you work for CVS pharmacy…) all look weird. Considering that we’re already in our database, it’s likewise insignificant to get the associated MD5 hashes, Ziften suspicion ratings, and other attributes to do a deeper dive.


In this specific real-life environment, it turned out that teamviewer_.exe and iexplorer.exe were portable applications, not known malware. We assisted the client with more investigation on the user and system where we observed the portable applications because use of portable apps on a USB drive could be proof of naughty activity. The more troubling find was cvshost.exe. Ziften’s intelligence feeds indicate that this is a suspect file. Searching for the md5 hash for this file on VirusTotal verifies the Ziften data, indicating that this is a potentially major Trojan virus that may be a component of a botnet or doing something much more harmful. When the malware was discovered, however, it was simple to solve the problem and make sure it remains resolved using Ziften’s capability to kill and constantly block processes by MD5 hash.

Even as we develop sophisticated predictive analytics to identify harmful patterns, it is very important that we continue to improve our capabilities to hunt for known patterns and old tricks. Just because brand-new hazards emerge does not imply the old ones go away!

If you enjoyed this post, keep looking here for part 2 of this series where we will use this approach to hostnames to detect malware droppers and other malicious sites.

Leave a Reply

Your email address will not be published. Required fields are marked *