Effective Machine Learning Practices Part II – A Data Scientists Perspective
At Star Lab, we’ve seen a recent increase in the number of defense industry research projects calling for the use of Machine Learning (ML) in innovative and unique ways. As we continue to work on these projects, we have noticed the need for better collaboration between software developers and data scientists. Better communication and a comprehensive understanding of what the other field needs and why it is needed has proven necessary for efficient and effective incorporation of ML elements into traditional defense industry projects.
This blog acts as lessons learned from each of the two perspectives. While Part I focused on the software developer—from basic ML topics that should be kept in mind to more specific “gotcha” moments that can become more time consuming than anticipated—Part II will now discuss the data scientist’s perspective and key takeaways.
Models from 30,000 Feet
At a high level, most machine learning (ML) models, regardless of industry or application, take data and return a prediction. Most models fall into two categories: regression models and classification models. Regression models return a single value as a prediction. For example, Zillow uses square footage, zip code, number of bedrooms, and more to predict the value of a house. Classification models return a range of probabilities, one for each class. For example, detecting the difference between anomalous and normal network behavior is a two-class problem. The nature of the models and the predictions they return is determined by the data on which the model was originally trained. All models use statistics to identify patterns and correlations between the data they are trained on (training data) and the corresponding labels or values (training labels). They then apply that logic to new, unseen data (test data) to predict new labels or values. No magic involved.
The Data (Part II)
Most models can accept only numerical data (or the numerical representation of Boolean data). Text is generally translated into numerical embeddings or encodings. There are four types of data:
Nominal: mutually exclusive labels like male or female. They are different, and one label is not bigger or better than another.
Ordinal: labels where the order matters, but the difference between labels is unknown. This is often found in survey data. Consider, for example, if respondents are asked their satisfaction between one and five. Say that two is somewhat dissatisfied, three neutral, and four is somewhat satisfied. The order is known and may be valuable information, but the difference between two and three may not be the same as between three and four. Furthermore, assuming that four is twice as good as two is a risky leap in logic.
Interval: numerical scales where order and difference between values are known, but division is not possible. For example, the difference between 60 and 70 degrees and 70 and 80 degrees Fahrenheit are meaningful and consistent, but 40 degrees Fahrenheit is not half of 80 degrees Fahrenheit.
Ratio: numerical scales where division is possible. Converting interval to ratio data in the previous example would entail using degrees Kelvin instead of Fahrenheit. Division between values is possible because 0 Kelvin is absolute zero.
Models cannot reliably learn from nominal or ordinal data. Most model frameworks that attempt to apply statistical assumptions are valid for only interval or ratio data. The solution is to convert nominal or ordinal variables to “dummy variables” or create a new column of ones and zeros (a one-hot encoding) for each unique value. This solution can be problematic if there are many categories, as it can require many preprocessing steps and can quickly swell the size of a dataset. Consider the anomalous behavior detection scenario where a process involves touching a set of files. If there are hundreds or thousands of possible files, the dataset may need to be increased by hundreds or thousands of columns. But, it is not likely all files will be connected to anomalous behavior, so some cuts can be made without harming modeling performance. These considerations are all core to the data scientist’s role, and it is helpful for engineers to understand these problems at a high level.
The Terminology and Schema
Ensuring a consistent and reliable schema to build models and a data pipeline is an essential responsibility that, in practice, is shared by both engineers and data scientists. Understanding how ML models work at a high level can help engineers work more seamlessly with data scientists to provide consistent and properly formatted data.
Most models take data in rows and are trained on many rows or a matrix of data. Most data scientists would prefer both static (for training and analysis) and pipeline (for real-time predictions) data that can be formatted as a Python array, as Python has an extensive and user-friendly collection of ML tools. Generally, each row corresponds to one “sample,” or data point. Each column is a “feature”, or a variable. Relating to anomalous behavior detection, each feature might correspond to a sensor whose output can be used to detect a threat. Each sample might correspond to a point in time, and each column’s value for that row would correspond to each sensor’s output at that time. This is an increasingly common application; many defense-related ML projects involve some sort of anomaly detection. Engineers should also note that many data scientists consider it good practice to throw out samples with missing values for some features.
Models from 10,000 feet
As explained previously, an ML model simply applies statistical assumptions to data to produce a prediction. But how does it do that? All models apply a weight (or combination of weights) to each feature within each data point and perform some calculations with the data and weights to make a prediction. Some model architectures don’t literally use weights. They may involve logical rules or more abstract statistical analyses, but all have some sort of abstraction of weights. To “train” a model, data scientists take a dataset (training data) and its corresponding labels (training labels) and then optimize the weights to minimize the errors the model makes when making predictions on the training data against the training labels.
Common Data Issues
Understanding the concepts of weights and the training process brings to light four common data and modeling issues that the data scientist and engineer will have to work together to avoid. Consider the anomaly detection example where each sample is a point in time and features correspond to sensor output data.
1) Feature Mismatch
If the order of features is not always consistent within the data, the model will attempt to learn or apply the wrong weights to each feature. For example, if Sensor X’s output is mapped to the 20th column in the training dataset, it needs always to be the 20th column in every iteration. If Sensor X’s data is somehow shifted to column 19 in the pipeline, the model will attempt to apply the wrong weights to that data disrupting its calculations and possibly leading to incorrect predictions. Consider the context: with a pipeline of hundreds or thousands of features, thousands of samples per second, and many preprocessing steps, ensuring consistent mapping of features can be a significant engineering challenge.
2) Label and Data Mismatch
Consider how a model learns: it finds patterns in the training data and then attaches weights based on corresponding training labels. A surefire way to build a weak model is if there is a mismatch between the training data and the training labels. This may sound obvious. But, for example, if anomalous behavior unfolds over time and spans many rows of data, there may be some rows of data with an anomalous behavior training label that don’t contain a pattern that can or should be linked to anomalous behavior, or vice versa. If there is a mismatch between labels and the data, a model will attempt to learn and then apply incorrect weights because it cannot identify the correct patterns within the data to predict accurately. Limiting this mismatch between data and labels is easier said than done and may require extensive preprocessing of the data before modeling to make sure features are time-independent and sophisticated model architectures that are more robust to noisy data. More broadly, it is important to consider what patterns are actually in the data that a model could identify and how they may or may not match up with the labels. For hundreds or thousands of features or sensors, this can be quite time-consuming.
3) Incorrect Data Types
If a model attempts to apply weights to (or learn weights for) a nominal or ordinal variable, its underlying statistical assumptions are likely invalid. For example, if the data output of one sensor refers to a file touched by an operation and is then coded 1-10 for 10 different files, the model will attempt to learn that file 10 being touched is 10x bigger or better or more likely to indicate anomalous behavior than file 1, not that file 10 is simply different from file 1. The model is then attempting to learn values from the possibly random order of the files’ numerical encodings rather than any actual value in the data. If an operation touching file 5 is anomalous behavior, but touching files 4 and 6 is not, a model may not capture that pattern. One solution, as mentioned previously, is to add preprocessing steps to convert non-interval or ratio variables to one-hot encodings, one for each possible category. Again, determining what preprocessing steps need to be taken to address these challenges requires a deep understanding of the data and the patterns each feature may contain that can indicate anomalous behavior.
4) Boiling the Ocean
Assume a model has been trained to detect one type of anomalous activity reliably. Anomalous activity can take many forms, and this model may not recognize a different type of activity as anomalous. Partial solutions may include training the same model on multiple datasets with multiple variants of anomalous behaviors or training multiple models to each handle a different type of anomalous behavior. But a well-designed model will likely predict that a sample contains anomalous behavior if the patterns in that sample’s data are more similar to what that model has learned is anomalous behavior than to what it has learned is normal behavior. In this example, introducing a model to every conceivable type of anomalous behavior is not necessary. Perhaps with only a few well-chosen models trained on a range of data that captures patterns that are common across most or all anomalous behavior, anomalous behavior of many stripes can be detected without attempting to boil the ocean.
Training and tuning the models falls on the data scientist, but incorporating multiple models into a pipeline may require work from the software engineer. Both sides must thoroughly understand the data to select the optimal combinations of anomalous behavior to train on.
Conclusion
ML models are garbage in, garbage out. Numerous potential issues with data, schemata, and methodology can cause models to be ineffective. Mitigating these risks involve clear communications between data scientists and engineers from understanding and preprocessing the data to training models.
While these can be labor-intensive challenges, the payoffs to designing an effective ML model are immense. A solution that effectively detects a wide range of anomalous behavior is achievable, which can protect mission-critical systems and reduce the need for ongoing, labor-intensive analysis. Well-tailored ML solutions can solve a host of potentially costly problems so long as there’s good teamwork between the data scientists and engineers.