I have attended a security conference recently. To my surprise, a great portion of accepted papers use some kind of Machine Learning techniques to achieve certain security targets. While I am NOT going to criticize any of the papers, this “Machine Learning” hotness did remind me the old days when I spent years on AI and Machine Learning when I was not this old. In this post, I am trying to explain something people should know before using this powerful tool. Please note that if you have a solid understanding and background of Machine Learning, just stop here and open your Spotify!
0. What is Machine Learning (ML)?
The basic idea of ML is to construct a model from data. This model reveals the inner pattern (or rules, knowledge) of the process (or procedure, behavior), which generates the data. Usually, there are 3 tasks in ML – classification, regression and clustering, which though differing from targets, are well connected. Early ML was originated from AI, which was focusing on Logic. Decision Trees (DT) and Artificial Neural Networks (ANN) are the famous examples in early ML. Modern ML enjoys the help of statistics, which are also called statistical learning. Support Vector Machine (SVM) and Probabilistic Graphical Model (PGM) are the typical examples. Deep Learning, which studies ANN on structure and Back Propagation (BP) into another level, becomes popular because of its usage in image recognition.
1. Distinguish between ML researches and ML applications
ML researches try to improve the existing ML algorithms (either from the computational requirement or parameter tuning or structure learning) and create new ML algorithms (HARD problems). Theoretical ones usually analyze the mathematical bounds or properties of these algorithms.
ML applications usually refer to applying certain ML algorithm to some data, with the hope of finding something interesting and useful. ML applications also involve parameter tuning once the model is determined.
2. Why ML works?
Whatever ML algorithm is applied, it needs data. There are 2 requirements for the data. First, the data should be independent and identically distributed (aka, i.i.d.). Second, the data set should be large enough. Imagine a case where we are sampling an analog signal. We should guarantee that the sampling frequency should be fixed without caring the previous sampled value. Also, we should be careful that we are not sampling the noise. Moreover, we know that our sampling frequency should be big enough if we wanna our digital signal getting enough close to the original analog one.
3. Why ML not working?
When the data does not satisfy the previous requirements, ML could work or not, depending on how lucky you are on that day. But in most cases, we could have a good precision using the training data (we are really good at this!). However, when the system is plugged into the practical use, it suddenly does not work that well. This is and will always be the Achilles heel of ML – overfitting (or over-learning). Once it happens, it is telling the truth that either the data is NOT i.i.d. or NOT enough.
4. Which ML algorithm to use?
I do believe the truth that ML is actually Human Learning. When you are clear about your target (the ML task and things you wanna do), it is very possible that you have some feeling about what kind of data you need – feature selection! Once you are trying to collect different data (feature), you start to feel how these data could work together to generate the result. Then you start to apply certain ML algorithm which looks reasonable to your imagination of the whole process (behavior). If it works, then congratulates – your imagination fits the fact. If not, try to tune the parameter of this algorithm at first. If this does not work, try some other ML algorithms. There should exist an algorithm better than others and if the precision is still not acceptable, think about the data again – do they satisfy the 2 requirements?
Actually, most ML algorithms are designed without caring the data. This means, as long as you have got the “perfect” data, it does not THAT matter what algorithm you choose. It is the same as programming, where data structure determines the algorithm. In ML, data is the king. The reason to learn different kinds of ML algorithms/models is to understand the pattern (rules/knowledge) behind the process (behavior) you are trying to learn.
However, this does NOT mean that different ML algorithms would perform the same given the same data set. Eventually, different ML algorithms have different using cases. As long as the dimension of data (number of features) is not that large, we can always start with DT, which may be the oldest but useful one. SVM with the kernel methods are designed for the case where you do not have enough data and the correlation among different features are not linear. Both ANN and PGM has a network structure, which could be suitable for time-driven or event-driven system modeling.
Unfortunately, most people, including myself, would just try different ML algorithms using the same data set and pick up the one with the best precision. This assumption, that the precision is used to determine the quality of the ML system in general, was meaningful when the precision is under 80%. When there are 2 ML systems with precision 91% and 92%, the assumption is not valid anymore. As there is no metric (yet) for generality, the only way to tell which ML algorithm/model is better is to put the ML system BACK to system, run it for a while and compare the precision again.
5. A general procedure to conduct ML applications
Before everything starts, please think again about the target you wanna achieve. Specifically, please ask the following questions again. The only thing we do NOT need to consider here is performance!
What is the target?
Why do I need ML?
What kind of features do I need?
What kind of ML algorithm/model am I gonna use?
What results I would expect?
A. Collect the real data!
Unless you are working on ML researches, collect the real data as much as possible. By real data, I mean NOT simulation data, which should be used to evaluate the ML algorithm rather than construct the ML system/model. All data, including training, testing, validation and future using, should come from the same place, under the same environment.
B. Filter the noise.
Unless you are trying to find some outliers, filter the noise. This is especially important if the size of data set is limited (Yes, sometimes we are not able to get enough number of data, e.g., survey), where each data sample could easily impact the model. For example, we can use clustering to filtering the noise.
C. Choose a ML algorithm fitting your imagination – starting from simple!
As I have mentioned, before the ML task is started, you should have a general feeling of how things interacting with each other. If a linear classifier is fairly good enough, there is no reason to use non-liner classifier. The key point here is to K.I.S.S. Try to ask yourself – Do I actually need ML? If yes, does DT work? Do NOT jump to the SVM with Gaussian kernel at the first step.
D. Understand your model!
Once you have got a working model, do NOT come to the conclusion immediately. Try to analyze and understand the process (behavior) you have just learned from the model. DT tells how different features are weighted; SVM with kernel method tells the non-linear interaction among features; PGM shows the structure of how features entangle. Give a reasonable explanation why the whole thing works!
E. Put it into the practical use.
The only way to tell whether a ML system works or not is from practice. The validation data is useful to publish a paper but useless to prove a ML actually working. If you intend to make it work in a practical system, use it and collect the data the same time. If your ML system still works, congratulates! Otherwise, come back to A!
F. Perfomance!
Frankly, there is not too much we could do unless we want to have our own implementation using C/C++ rather than using the existing ML libraries or implementations. From the application perspective, one thing we could think about is parallel processing.
6. Why a lot of ML application paper are garbage from my point?
Q. Why do you want to use ML?
A. Coz I wanna use ML.
Q. Where does your data come from?
A. I have got “big” simulation data.
Q. Why do you choose algorithm COOL?
A. Because it is powerful and looks cool.
Q. How does that work?
A. The precision is 99% using the validation data.
Q. This 99% looks a little bit overfitting!
A. What?!
Q. How does that work in practice?
A. What do you mean by practice? The precision is 99% using the validation data!
7. Why using ML is hard for systems?
Most ML algorithms we are talking about here are supervised, offline learning, as this is how actual learning works for human beings. Professors give lectures; students work on homework after class. One could imagine an A+ in the final exam if he understands all the questions from the homework. However, when there is no professor, a student would not get an A+ until he had read thru the whole textbook and done all the homework. Things get worse when there is no professor, no textbook nor homework and the student is asked to take the exam – which is unsupervised, online learning.
Unfortunately, the most reasonable ML algorithm for a running system should be unsupervised, online learning, as the state of system is changing every second. Just like the weather forecast, we could collect TB of data and use it to predict that tomorrow is sunshine. But it rains and no one would be surprised. The reason is simple that just like the weather, there are so many factors interacting with each other to impact the final behavior of the system. We just can NOT predict the result using certain amount of features when other factors/parameters within the system start to change.
Are we doomed? Not really! The scientific and traditional saying about the weather report is something like – “Any forecast beyond one hour is BS”. On one hand, this is right because of the Butterfly Effect; on the other hand, this reveals the fact that as long as the system is in a “stable” state, the forecast or the ML system modeling this state SHOULD work! In a word, we could use ML techniques in the system, even if the algorithm itself is supervised, offline learning. As long as we could guarantee that all the data used to train this ML model is from the real system where we will plug in our ML system, as long as the training data set is large enough, as long as ML algorithm works well on the training validation data, as long as we could guarantee that the system will keep the same state for a period, our ML system should also work well given this period.
Then how to tell if the system state is changed? This is really a HARD question. Though every change of the factors/parameters within the system could lead to the system state change, NOT all these changes would impact our ML system. Imagine we are using certain ML technique to implement a network-based IDS. We collect some data from NIC statistics with the average CPU usage 30% and build our model accordingly. Does this ML model work when the average CPU usage becomes to 90% (a overloaded web server)? I would vote for “no” here.
The other question would be what if the system state is changed and our ML system does not work well. In practice, a real ML system would also collect data when it is put into a production system. This “more recent” and “live” data could be used to tune the parameters of this ML model if there is a slight change in the system. After the ML system runs for a while, some human being with a fairly good salary and some knowledge of ML (someone called Data Scientist…) would jump in and try to determine if a new ML algorithm/model is better/needed.