Effective Machine Learning Practices Part I – A Software Engineer’s Perspective

At Star Lab, we’ve seen a recent increase in the number of defense industry research projects calling for the use of Machine Learning (ML) in innovative and unique ways. As we continue to work on these projects, we have noticed the need for better collaboration between software developers and data scientists. Better communication and a comprehensive understanding of what the other field needs and why it is needed has proven necessary for efficient and effective incorporation of ML elements into traditional defense industry projects.

This blog acts as lessons learned from each of the two perspectives. Part I focuses on the software developer, from basic ML topics that should be kept in mind to more specific “gotcha” moments that can become more time consuming than anticipated. Part II will discuss the data scientist’s perspective and key takeaways.

The Context

image.png

While incorporating ML components into our more traditional security-focused work, Star Lab determined three main challenges that define the architecture of a defense industry ML research project.

  1. The project must be high impact. Due to the importance of the data being analyzed and the implication of false predictions, the ML components must achieve the highest accuracy possible. For example, a false negative could result in a missed aircraft maintenance requirement or an undetected event anomaly.

  2. The architecture must have low latency. An accurate prediction becomes worthless if it is generated only after the event it is meant to mitigate has already occurred. If a project’s goal is to detect anomalous behavior on a network to prevent further harm from the perpetrator by restricting access, the prediction must be determined in a matter of seconds or even fractions of a second or else the damage is already done.

  3. Defense industry research requires a level of specificity that is only understood once the project is actively being researched. Take the previous example of anomalous network behavior, but now limit that to anomalous behavior of a single node on the network. If that node is expected to receive only heartbeat signals from other nodes to determine infrastructure status, then the transfer of a file over SSH to or from the system is considered anomalous behavior.

The Environment

image.png

The ML infrastructure will be largely dictated by its environment. This includes anything from hardware limitations to the architecture of the nodes themselves to the available network access while generating predictions. If the project requires the use of an ML framework like Google’s open-source TensorFlow, the available architecture and operating system requirements can limit the version of TensorFlow used, which in turn limits the available features and even the accuracy of its documentation. In some cases, old enough limitations even require modifying the source code of the ML framework directly and building it local to the infrastructure. This, of course, falls solidly within the software engineer’s tasking with some consultation with the data scientist about available tooling alternatives and preferences where needed.

The Data

image.png

Data is the most important part of the ML pipeline. It defines which modeling algorithm to use and how the data should be processed, or in other words, reformatted and repositioned before being presented to the predictive model. Data representation is highly guided by the data science perspective but will rely on the software engineer’s data generation approach. Both perspectives are equally important for an effective model and require a good amount of cooperation between the software engineer and the data scientist.

From the software engineer’s perspective, data should be easily reformatted enough to support multiple iterations of updates to the model and how the data needs to be represented. It also requires enough transparency for the data scientist to gather meaning from the data to make informed decisions about its representation and interpretation by the model. There is also an element of what type of data should be generated or how important is each of its components. Consider, for example, data on a file.

To determine the file’s importance level, one might initially want information on the file location, its required permissions, its last updated date, and its creation date. After some analysis, the creation date is determined irrelevant, so now the architecture is updated to either exclude that information from data generation or remove it during a data processing step. However, an extra processing step could introduce more latency in the long run. This is where the ability to easily toggle data generation components is crucial. Now instead, assume the creation date is just being represented improperly – more on how this can occur in Part II: The Data Scientist, and with some tweaking, it is now crucial to the model. Those tweaks are informed by the data scientist and become infinitely easier when he or she can interpret the data in its raw as well as processed form.

The Pipeline

image.png

From the design perspective, pipelines come in two main forms: static and real-time. The static pipeline acts time-independent of data generation. In this setup, data is generated within the problem scope, stored offline, and pushed through a model. The main benefits include easily reproducible results and access to the full dataset for informed analysis and preprocessing. Because of this, it is great for training models. Alternatively, real-time pipelines push data through a model as it is generated. This requires a well-defined data representation, including the handling of data outliers and a pre-determined processing pipeline.

Ultimately, any defense industry project will likely require a combined use of both pipelines for separate purposes. The static pipeline is ideally equipped for training the model and modifying data representations for achieving the highest possible accuracy. Once the model is generated, it is dropped into the real-time pipeline for the lowest possible latency from data generation to predictions. The biggest caveat here to remember is that every processing step completed in the static pipeline has to be mimicked exactly in the real-time pipeline, or the predictions will be inaccurate. The best way to accomplish this is to leverage the same modules for both pipelines from the start of the project.

Takeaways

Work on an ML-related defense industry project requires the utmost cooperation between the software engineers and data scientists on the project. It is the software engineer’s job to remain well informed about the data scientist’s requirements and implement the data generation components within the technical restrictions of the project. Points to keep in mind include:

  • Project restrictions will inform what type and version of an ML framework to use.

  • Project context will definitely inform the model that is used and must be clearly relayed to the data scientist.

  • Data will always have to be modified according to model requirements.

  • Data transparency is crucial to making informed modifications.

  • The ML pipeline will be a combination of both static and real-time for different purposes.

In Part II: A Data Scientist’s Perspective, the lessons learned here will find their better half. Share this page with your data scientist friends and check back in two weeks for the full story including:

  • data schema and terminology

  • ML model calculations

  • training dos and don’ts