Malware classification found a new ally in machine learning

From agriculture and banking to telecommunications and healthcare, a lot of industries are finding that machine learning (ML) is a very powerful ally for their daily activities and beyond. So, it was only a matter of time before cybersecurity companies started to turn to this potent subset of artificial intelligence (AI) to enhance their security solutions

Malware classification article image 111999Why wouldn’t they? Hackers are already using it, so it was just a natural move. Besides, ML applications can process a vast amount of data, learn to detect suspicious patterns, anticipate potential attacks, and get better with time. Read on to learn more about why cybersecurity companies are teaming up with offshore software development outsourcing companies to revamp their security suites and their malware modules.

How malware recognition modules work

Basically, most malware algorithms included in the most popular cybersecurity suites of the market identify a threat based on the data they collect about it. The data can be gathered in two main stages:

  • Pre-execution stage: the suspicious file hasn’t been executed yet but malware modules can collect certain data on it that might lead them to flag it as a threat. From file format and code descriptions to text strings and information gathered through code emulation, any file can provide a lot of indicators just by analyzing them with a malware module.
  • Post-execution stage: once a file is executed, its behavior and all the things it tries to do are examined to define if it can be considered a threat or not. Depending on the events triggered by its processes in the operating system, a lot of activities can be suspicious, like altering files or blocking the action from certain features.

Since malware has evolved a lot in the last years, creating rules or gathering information from users to update databases isn’t efficient anymore. New threats are getting stronger, they can be polymorphing, change their behavior on the fly, and use highly sophisticated techniques to camouflage themselves. That’s precisely what marked the field as ripe for using machine learning in cybersecurity systems.

Machine learning basics

To understand how ML can help in malware classification, it’s important to comprehend a couple of basic definitions first. The most obvious one is what we are talking about when we mention ML, and that’s a set of algorithms that are capable of processing large data sets, analyze them, and “understand” the identifiable patterns, learning how to spot them easily in new samples over time. The best thing about this is that the algorithms only have to be trained initially and they will learn more and more as they are used.

Then, there are the ML models. In other words, models are approaches to ML that standardize principles for application. In layman’s terms, it’s how they are programmed to act. There are 3 major models in ML:

  • Unsupervised learning: as its name implies, this model is fed data and left alone to figure out what it all means. The algorithms using this model have to understand the data structure and their working laws.
  • Supervised learning: in this model, the algorithm isn’t just fed the data but it’s also instructed how it’s supposed to act with it. This is done through a two-step process. First, the model is trained with known data sets and rules. Then, the model is fed new data samples and left to apply what it learned in the training stage. This leads to its refining, sophistication, and to new stages with new data samples.
  • Deep learning: this is the most complex model, as it uses an approach similar to how the human mind works. Deep learning infers valuable abstractions from low-level data, and for this reason, it is used for speech recognition, natural language processing, and computer vision.

Finally, there are other two things needed to understand how cybersecurity solutions are tackled. The first one is that security suites are trying to detect as many malicious files and potential threats during the pre-execution stage. Though they don’t succeed 100% of the time, it’s important that the detection happens before the threat can infect or damage the system.

The second one is that security tools use a combination of the three models and data collected in the two main stages. Doing that enhances the final product, as all the known threats are already fed into the algorithms to refine supervised learning models, while the inclusion of unsupervised and deep learning models try to cover unknown or rare threats.

Machine learning image 4939939393How machine learning detects malware

The above indicates that detecting malware is a highly challenging task. Combining the different approaches can be quite tricky, especially because doing so requires constant refining, adaptation, and feedback. This doesn’t mean that’s impossible, though. Kaspersky’s work with ML is a great example of how this technology can be used for malware detection. Let’s see what they do.

The first thing used in their products is the detection of new malware in the pre-execution stage through similarity hashing. The decision to do so had to do with the spread of polymorphism in malware. Through this technique, malicious files gained the ability to morph with each new infection. By slightly changing with each attack, they avoided detection, as antivirus software included similar files in their databases but not precisely those files – until they were manually included in the database, that is.

In the light of that, ML came to the rescue combined with the use of hash functions called locality-sensitive hashes (LSH). The LSH of almost identical files (such as the ones coming from a polymorphic malware) are very similar, therefore can be classified as part of the same group (which here is called binary bucket). Thus, the LSH can be fed to an unsupervised model for it to define the different binary buckets on its own.

Combining this with a supervised model and a similarity hashing approach that’s trained to check similar and non-similar objects, the whole security software can become stronger. That’s because, on one hand, it can detect threats through LSH comparison and deduction, while also being capable of identifying known threats through the rules derived from the supervised model.

This can be done through a two-stage analysis that’s carried out in the pre-execution phase. There’s a pre-detect stage that allows for a file to be analyzed without much impact on the system’s computational resources. In other words, the ML algorithm doesn’t “unpack” the file but rather it analyzes it “from the outside.” This examination, which uses similarity hash mapping, is frequently enough to define if a file is benign or not. If the pre-detection didn’t work, then the software can move to the second stage, where more features are analyzed in what can be seen as a traditional detection stage.

All of that is elevated with the presence of deep learning, which is used for rare attacks or unknown threats. This happens more frequently in enterprise contexts in the form of high-profile targeted attacks. Since there are so rare and so unique, these attacks need to be tackled in a different way. Using the various layers of a deep-learning-based model, a security solution can “reason” if a certain file or group of files is potentially dangerous by considering the discriminative features of several types of known malware.

Since deep learning can infer certain conclusions from scarce data, it can also be used during the post-execution stage through behavior detection. Since not every malware can be identified in the pre-execution stage, this becomes essential. By analyzing the unstructured data coming from the logs of security solutions, providers can use deep learning to detect behavioral patterns quickly that can inform of malicious logs. Then, a real-time update can be issued through the use of cloud computing, making the solution available to all users in the shortest time possible.

A challenging future

Machine learning is an exciting technology that can do wonders for security experts. It can bring the power of its three models to create stronger security suites that can act before a malicious file is executed and pinpoint the dangers with high accuracy.

Unfortunately, the use of ML by hackers poses a bigger threat for the near future, as there will be some competing sophisticated algorithms with different objectives pitted against each other. The use of a variation of generative adversarial networks can be beneficial for this, as the race for cybersecurity in the times of AI rages on.


Interesting related articles: