logo
Online:0

NorthSec 2025 - François Labrèche - How not to do ML

NorthSec

NorthSec

1 followers

time2 days agoview0 views

Machine learning has been used extensively for the prediction of cyber security threats for a number of years. More specifically, building predictive models for the exploitation of security vulnerabilities and the publication of vulnerability exploits is essential in anticipating threats in the cyber security landscape.

Many published approaches train ML models using publicly available data, be it online discussions or vulnerability details available through the publication of CVEs. Unfortunately, many challenges arise when encoding this data to predict exploitation. More importantly, many of these do not impact the model's performance on historical data, but instead result in a poor performance when used as a live model in a real environment.

In this talk, we will demonstrate our implementation and deployment of several of these methods. We show that performance of these models in a live environment underperforms in comparison with its historical evaluation. Vulnerability and threat information evolve over time, and are often not available on the day of a vulnerability's publication. We identify four incorrect ways to encode and evaluate features for the prediction of exploits, that causes the model to incorrectly predict exploits when used in a day-to-day live system.

Ultimately, we show how a model that has a lower performance on its historical data evaluation can better predict the publication of exploits in a live setting, by encoding the features correctly.

Loading comments...
affpapa
sigma-africa
sigma-asia
sigma-europe

Licensed