Written By Jesse Sampson And Presented By Charles Leaver CEO Ziften
In the very first about edit distance, we looked at searching for harmful executables with edit distance (i.e., how many character modifications it takes to make two matching text strings). Now let’s take a look at how we can use edit distance to look for malicious domains, and how we can build edit distance functions that can be integrated with other domain name features to pinpoint suspect activity.
Here is the Background
What are bad actors doing with harmful domains? It might be merely using a close spelling of a typical domain name to fool negligent users into looking at ads or getting adware. Legitimate websites are slowly catching onto this technique, often called typo squatting.
Other harmful domain names are the result of domain generation algorithms, which could be used to do all types of dubious things like avert countermeasures that obstruct recognized compromised websites, or overwhelm domain servers in a dispersed DoS attack. Older variations use randomly generated strings, while further advanced ones add techniques like injecting typical words, further confusing protectors.
Edit distance can assist with both use cases: here we will find out how. First, we’ll leave out typical domains, because these are usually safe. And, a list of regular domains supplies a standard for discovering abnormalities. One great source is Quantcast. For this discussion, we will stick to domain names and prevent subdomains (e.g. ziften.com, not www.ziften.com).
After data cleaning, we compare each candidate domain (input data observed in the wild by Ziften) to its prospective neighbors in the exact same top-level domain (the last part of a domain name – classically.com,. org, and so on but now can be nearly anything). The basic task is to discover the nearest neighbor in regards to edit distance. By finding domains that are one step removed from their closest neighbor, we can easily identify typo-ed domain names. By discovering domains far from their next-door neighbor (the stabilized edit distance we introduced in Part 1 is useful here), we can also discover anomalous domain names in the edit distance area.
What were the Outcomes?
Let’s take a look at how these outcomes appear in reality. Use caution when browsing to these domain names considering that they might contain harmful content!
Here are a few potential typos. Typo squatters target well known domains given that there are more possibilities somebody will visit. Several of these are suspicious in accordance with our risk feed partners, however there are some false positives as well with charming names like “wikipedal”.
Here are some strange looking domain names far from their next-door neighbors.
So now we have produced 2 useful edit distance metrics for hunting. Not just that, we have 3 functions to possibly add to a machine-learning design: rank of nearest neighbor, range from neighbor, and edit distance 1 from next-door neighbor, showing a threat of typo shenanigans. Other functions that could be used well with these include other lexical functions like word and n-gram distributions, entropy, and the length of the string – and network functions like the total count of failed DNS requests.
Streamlined Code that you can Play Around with
Here is a streamlined variation of the code to play with! Created on HP Vertica, but this SQL should run with the majority of innovative databases. Note the Vertica editDistance function might differ in other implementations (e.g. levenshtein in Postgres or UTL_MATCH. EDIT_DISTANCE in Oracle).