How Linux kernel works – in 4 sentences

I found this in “Understanding the Linux Kernel” (ULK). I believe I have seen a lot of analogies in computer science. But this one is “Simply the Best”. I am not going to put anything here except the original analogy in 4 sentences. Take a few seconds and try to see if this analogy could be the best you have ever seen.

  1. If a boss calls while the waiter is idle, the waiter starts servicing the boss.
  2. If a boss calls while the waiter is servicing a customer, the waiter stops servicing the customer and starts servicing the boss.
  3. If a boss calls while the waiter is servicing another boss, the waiter stops servicing the first boss and starts servicing the second one. When he finishes servicing the new boss, he resumes servicing the former one.
  4. One of the bosses may induce the waiter to leave the customer being currently serviced. After servicing the last request from the bosses, the waiter may decide to drop temporarily his customer and to pick up a new one.
Posted in OS | Tagged , , | Leave a comment

Cross compile user-space applications using Yocto for Gumstix Overo

An official way to add a software package using Yocto is to add a new layer if there is no recipe existing for this package. Then baking the whole image should have it included. However, we are not going to take about it in this post, where we will examine how to make the Wifi work and/or cross compile the code in the development server.

1. Compile locally

One would expect to have gcc installed in the rootfs (console-image) but it is not. As usual, the corresponding recipe could be added into the image. However, if the gumstix is equipped with a wifi adaptor, bringing it online may be a good idea as well if the corresponding package manager (e.g., opkg) is installed already.

This link (http://wiki.gumstix.org/index.php?title=Overo_Wifi) provides all the information to configure wifi connection. Note that the example is WPA2 in the wpa_supplicant. I have tried WPA-EAP, which is only supported by my wifi environment, but failed – wpa_supplicant parsing failure. I was using Gumstix Overo and Yocto 1.6 Daisy. It is not clear if newer Yocto Gumstix layer would support WPA-EAP (I am highly suspicious not). The workaround is simple as long as you have a smartphone supporting hotspot. Yep, configure the hotspot to make it WPA2 only and you have got it.

2. Cross-compile in the development server

Believe it or not – the best way working for me is cross-compile in the development server. This link (https://github.com/gumstix/yocto-manifest/wiki/Cross-Compile-with-Yocto-SDK) has everything. Note that the SDK provided by the Gumstix team is based on Yocto 1.8. If you are not using this version, build your own SDK:

bitbake gumstix-console-image -c populate_sdk

One tricky thing before starting the cross-compile is the Makefile. The environment gets reset for the ARM gcc after the environment script is sourced. This means, on one hand, any env variables reseting in the Makefile would break the cross-compile and fall back to native-compile. These env variables include CC, CFLAGS, LDFLAGS and etc. On the other hand, this new Makefile is much more simpler.

# Makefile for provd
# using Yocto cross compile
# Sep 1, 2015
# root@davejingtian.org
# http://davejingtian.org

# NOTE:
# Do NOT change CC,CFLAGS,LDFLAGS

OBJS = provd.o nlm.o provmem.o

provd : $(OBJS)
	$(CC) $(CFLAGS) $(LDFLAGS) $(OBJS) -o $@
Posted in Embedded System, IDE_Make | Tagged , , , , , | Leave a comment

Linux kernel I/O Stack Diagram 2

https://www.thomas-krenn.com/en/wiki/Linux_Storage_Stack_Diagram

Linux-storage-stack-diagram_v4.0

Posted in OS | Tagged , , , , | Leave a comment

Linux Kernel DSA and Provenance Release

Linux Provenance kernel (2.6.32) and tools for CentOS and RedHat Enterprise Linux

https://github.com/daveti/prov-kernel
https://github.com/daveti/prov-tools

Linux kernel crypto – DSA

https://github.com/daveti/kdsa

During the development on kernel 2.6.32, we found a bug in mpi-pow.c which failed DSA.  The patch file has been included as well. Note that this bug should be fixed in later kernel versions. Also, this DSA implementation does not use the PKI framework and key ring in the kernel for quick prototyping. To support dynamic kernel module, some changes are needed as well.

Posted in Linux Distro, OS, Security | Tagged , , , , , , , , | Leave a comment

Find your paradise – “I feel I was born to do this. It’s in my blood.”

The quote was from a truck driver who has been driving for 25 years. Check out the question from Quora and the answer from Mr. Rick Klugman: http://qr.ae/RC3jpW

After 4 days in D.C. for a conference travel, almost-cancelled flight back home, being sick the whole weekend, pushing another paper to another damn conference till last night, I am truly exhausted. This reminds me the old days in the company when I was a junior developer staring at the screen 14 hours a day, trying to find the bug. Of course, things got better later due to the technical growth along with more experience. However, I still became exhausted and frustrated 3 years 8 months later, when I decided to come back to school, hoping trying to find something different. The most valuable thing I have learned in this period is that I know I love machines and coding.

My first year of the Ph.D. program did not go well. The lesson learn is that do NOT come to an conclusion (either like or dislike sth) until you have been digging around at least 6 months. Another way of saying this is the quote from a senior developer in my old company – “You do not know that thing until you do that thing!”. I changed my research direction and advisor the next year and thank God, I am still working on the same direction with the same advisor!

Do I enjoy the research in general? Frankly, it is hard to tell. I hate travelings and I hate the fact that people would kill your paper even without reading it and I hate the fact that publication means everything. But I do enjoy the things I have been doing, either in the name of research or not. They cheer me up, give me headache and will continue fulfill my next 3 years as a Ph.D. student. I still love machines and coding and now I am playing with machines and coding a lot. This is the thing I want.

David Evans talks about “how to live in paradise” and academia is the paradise in his opinion. It is not important whether we agree with his opinion or not (FYI, I do not agree!). Similarly, the truck is the paradise for Mr. Rick Klugman and I believe he does not give a shit about security research. So how to find our own paradise? Mr. Rick Klugman gave the answer – “I feel I was born to do this. It’s in my blood”. When you have such a feeling, that is your paradise.

Find your paradise and enjoy~

Posted in Life Prose | Tagged , | 1 Comment

Linux kernel I/O Stack Diagram

Found this figure online. The title of this diagram may be biased as it is not the complete I/O stack but a concentration on block devices with a focus on SCSI layer. Some important missing stuffs, in my opinion, include MTD and USB. But still, this is a good diagram worth keeping here:)

linux-io-stack-diagram_v1-01

Posted in OS, Programming | Tagged , , | Leave a comment

Trustworthy Whole-System Provenance for the Linux Kernel

Our paper “Trustworthy Whole-System Provenance for the Linux Kernel” has been accepted by USENIX Security 2015. While details could be found in the paper (link below), I would like to talk about some background about LPM (a.k.a., Linux Provenance Module, our main contribution in the paper), current working status and some of my personal thoughts, in this post. Please note that this post does NOT stand for the opinion of the whole LPM team but my itch of writing as a developer and co-author.

0. The paper (linked from Adam’s website)

http://cise.ufl.edu/~bates/documents/Bates_Security15a.pdf

1. LSM

The most successful security feature in the Linux kernel could be LSM (Linux Security Module). I am sure people would argue about how I define “the most successful”. My rule is simple – unlike grsecurity, LSM is in the mainline and most of the security code under the security subdir in the Linux kernel source is about LSMs, e.g., SELinux and AppArmor. Basically, LSM is a set of kernel hooks distributed almost everywhere in the kernel source, where the operation of certain APIs is security sensitive and needs MAC (Mandatary Access Control). To enforce the policy, one needs to write a module which implements these hooks. Like it or not, major Linux distros usually apply some LSM implementations. RH/Fedora uses SELinux while Ubuntu loves AppArmor. Even Android has something called SE Android, which in essence is a kind of LSM. Even though LSM was designed for mandatary access control and most implementations are MAC indeed, a simple question would be what else we could do using LSM?

2. Provenance

When we are talking about Provenance, we are talking about data provenance and/or lineage of data. Check out the wikipedia for a quick understanding – https://en.wikipedia.org/wiki/Provenance. My personal interpretation of provenance is “extra logging”, which could be used to track either the information flow or the working flow of the program or data and provides a trusted way for forensic analysis if something goes wrong. Most previous provenance works were targeting certain applications, making them prov’able. The assumption of application-level approach lies on the trust to the application itself, which has a long history of being vulnerable. The question here is could we have a better provenance for the application without trusting the application? The most inspiring work in system provenance is Hifi, which places some hooks within the Linux kernel and collects the logging for kernel activities. The Hifi system inspires our work from the very start.

3. LPM

The combination of the two questions mentioned above made LPM – Linux Provenance Module, which takes the advantage of LSM framework and provides the system-level trustworthy provenance for both the kernel and the application.

lpm_layer[picture from early draft of the paper, made by Adam Bates]

The good thing of using LSM framework as LPM is that we are now standing on the giant’s shoulder. Tons of research and even formal methods have been applied to LSM, making sure it satisfies the three properties of a reference monitor: Complete mediation; Tamperproof and formal verification. Unlike Hifi, LPM is able to make these arguments as well because of the nature of LSM framework. With the LPM framework, we move forward to create the ‘real’ LPM – provmon, which collects all the kernel activities and relays the logging information to the user space. For example application usage, such as data loss prevention, please refer to the paper.

4. Current status

A working provenance kernel 2.6 based on RHEL will be published soon. The LPM coexists with LSM. Actually, LPM is followed after LSM, acting as the 2nd level of ‘policy’ enforcement. It is also possible to have both SELinux and Provmon working the same time. The team is also investigating new security features to ‘stack’ LPM with LSM if possible, trying to make it available to all kernel releases in a non-intrusive way.

5. More considerations (NOTE: personal opinions)

During the development of LPM, an interesting question has been bothering us – what could LPM do for applications? While applications may be interested in its own data structure such as a ‘record’, the kernel is, unfortunately, agnostic about this information. Instead, kernel talks about inode, socket, packet and etc. Within the paper, we have instrumented an application to bridge the semantic gap between the application and the kernel. But could we do better? Would a new bunch of syscalls designed for provenance be better for application provenance with kernel support? Imagine future applications, which needs kernel-level provenance support, can just call some provenance syscall to have the full picture of information flow and working flow with kernel-level granularities.

Another thing I have been thinking is that if LSM is the best solution for provenance from the very start. LSM could be viewed as a horizontal layer in the kernel to provide a unified access control. However, when we rethink about the information flow and working flow from the point of applications, we may realize that we need vertical layers (not just only one layer) in the kernel to provide a complete picture – how a data structure in the application is mapped into different kernel objects and how these different objects interleave and transform from each other.

Posted in OS, Security | Tagged , , , , , , , , | Leave a comment

Python Levenshtein distance – Choose Python package wisely

Brad and I were working on some text similarity computation. One of the most popular string distance functions is the Levenshtein distance, which is also called the edit distance. We use Python for its brevity and widely-library support (OK, I lied. The truth is that I was persuaded by Brad about his miserable experience interacting with R…) As we would like to try different distance functions, we picked up Python distance package (pip install distance). Surprisingly, we found the Levenshtein is pretty slow comparing to other distance functions (well, regardless of the complexity of the algorithm itself). The next day, Brad found another Python package – editdistance (pip install editdistance), which is 2 order of magnitude faster than the distance package. In this post, we will exam these 2 packages and the other 2 implementations according to Wikipedia. K.R.K.C.

1. Run the Damn Code!

# A Levenshtein distance demo
# using distance pkg and editdistance pkg
# and yet another 2 implementations
# May 2, 2015
# root@davejingtian.org
# http://davejingtian.org
import cProfile
import editdistance
import distance

loop = 10000
dist = 0
str1 = 'kitten'
str2 = 'sitting'
script1 = 'for i in range(loop):\n\tdist = distance.levenshtein(str1, str2)' 
script2 = 'for i in range(loop):\n\tdist = editdistance.eval(str1, str2)'

# A K.I.S.S. implementation using recursion
# Ref: http://en.wikipedia.org/wiki/Levenshtein_distance
def recLev(s, lenS, t, lenT):
	if lenS == 0 or lenT == 0:
		return 0
	if s[lenS-1] == t[lenT-1]:
		cost = 0
	else:
		cost = 1
	return min(recLev(s, lenS-1, t, lenT)+1, recLev(s, lenS, t, lenT-1)+1, recLev(s, lenS-1, t, lenT-1)+cost)

script3 = 'for i in range(loop):\n\tdist=recLev(str1, len(str1), str2, len(str2))'

# A fast and memory efficient implementation
# by Hjelmqvist, Sten
def fastMemLev(s, t):
	# degenerate cases
	if s == t:
		return 0
	if len(s) == 0:
		return len(t)
	if len(t) == 0:
		return len(s)
 
	# create two work vectors of integer distances
	#int[] v0 = new int[t.Length + 1];
	#int[] v1 = new int[t.Length + 1];
	v0 = []
	v1 = []
 
	# initialize v0 (the previous row of distances)
	# this row is A[0][i]: edit distance for an empty s
	# the distance is just the number of characters to delete from t
	# for (int i = 0; i < v0.Length; i++)
	# v0[i] = i;
	for i in range(len(t)+1):
		v0.append(i)
		v1.append(0)

	for i in range(len(s)): 
		# calculate v1 (current row distances) from the previous row v0
		# first element of v1 is A[i+1][0]
		# edit distance is delete (i+1) chars from s to match empty t
		v1[0] = i + 1
 
		# use formula to fill in the rest of the row
		for j in range(len(t)):
			cost = 0 if s[i] == t[j] else 1;
			v1[j + 1] = min(v1[j]+1, v0[j+1]+1, v0[j]+cost)
 
		# copy v1 (current row) to v0 (previous row) for next iteration
		for j in range(len(t)+1):
			v0[j] = v1[j]
 
	return v1[len(t)]

script4 = 'for i in range(loop):\n\tdist=fastMemLev(str1, str2)'


# Profile
cProfile.run(script1)
print('script1:', dist)
print('='*20)
cProfile.run(script2)
print('script2:', dist)
print('='*20)
cProfile.run(script3)
print('script3:', dist)
print('='*20)
cProfile.run(script4)
print('script4:', dist)

[daveti@daveti python]$ python ./leven.py
540003 function calls in 0.881 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.010 0.010 0.881 0.881 :1()
10000 0.699 0.000 0.871 0.000 _levenshtein.py:6(levenshtein)
20000 0.002 0.000 0.002 0.000 {len}
1 0.000 0.000 0.000 0.000 {method ‘disable’ of ‘_lsprof.Profiler’ objects}
420000 0.139 0.000 0.139 0.000 {min}
90001 0.031 0.000 0.031 0.000 {range}

(‘script1:’, 3L)
====================
10003 function calls in 0.013 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.004 0.004 0.013 0.013 :1()
10000 0.009 0.000 0.009 0.000 {editdistance.bycython.eval}
1 0.000 0.000 0.000 0.000 {method ‘disable’ of ‘_lsprof.Profiler’ objects}
1 0.000 0.000 0.000 0.000 {range}

(‘script2:’, 3L)
====================
396510003 function calls (99150003 primitive calls) in 220.297 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.019 0.019 220.297 220.297 :1()
297370000/10000 198.277 0.000 220.276 0.022 leven.py:20(recLev)
20000 0.002 0.000 0.002 0.000 {len}
1 0.000 0.000 0.000 0.000 {method ‘disable’ of ‘_lsprof.Profiler’ objects}
99120000 22.000 0.000 22.000 0.000 {min}
1 0.000 0.000 0.000 0.000 {range}

(‘script3:’, 3)
====================
900003 function calls in 0.629 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.007 0.007 0.629 0.629 :1()
10000 0.461 0.000 0.622 0.000 leven.py:33(fastMemLev)
170000 0.014 0.000 0.014 0.000 {len}
160000 0.015 0.000 0.015 0.000 {method ‘append’ of ‘list’ objects}
1 0.000 0.000 0.000 0.000 {method ‘disable’ of ‘_lsprof.Profiler’ objects}
420000 0.097 0.000 0.097 0.000 {min}
140001 0.035 0.000 0.035 0.000 {range}

(‘script4:’, 3)
[daveti@daveti python]$

2. Analysis

As one can tell, editdistance is much more faster than distance on Levenshtein. This is not a big surprise when you realize that the editdistance is a C++ implementation while distance is a pure Python implementation. Actually, both of them use iterative method with matrix rather than a straightforward recursive implementation, which is the slowest but with least lines of code. In a word, there is nothing wrong with the distance implementation. A most recent fast and memory efficient algorithm from Hjelmqvist, Sten, using Python is also examined. It is faster than the distance implementation, which proves again that distance implementation is not that evil. However, there should be some improvement for the distance implementation to achieve similar performance comparing to the fastMem. A quick glance at the distance implementation, they are using numpy array, which could be a bottleneck.

3. Be wise

Let those who claim that Python is fast go to hell. It all depends on which Python is comparing to. There is no doubt that Python implementation should be slower than the original C/C++ implementation. If performance is all about, try to find Python packages which are implemented in C/C++ or implement the Python interface by yourself wrapping your favorite C/C++ implementation. Be wise about your choice~

Posted in Programming | Tagged , , , | Leave a comment

More Guidelines Than Rules: CSRF Vulnerabilities from Noncompliant OAuth 2.0 Implementations

Our paper, as titled, has been accepted by DIMVA 2015 – Milano, Italy. While the final paper will not be released until July, we will have a brief summary of what we have done in this post. Another focus here will be the implementation of the web crawler we have designed for OAuth Detection (OAD), which is open-source now. K.R.K.C.

0. OAuth 2.0

While there is an argument that OAuth 1.0 is gracefully designed (and more secure) comparing to OAuth 2.0, the later has become the dominant deployment. All these flagship IT companies, such as Google, Facebook and Twitter, provide OAuth 2.0 authentication services for third-party applications, which allows the user to use the services without exposing the personal information (e.g., password) to the application itself or registering each service each time. For example, we can use Facebook account to login Spotify without exposing the password to Spotify or registering Spotify.

1. Cross-Site Request Forgery (CSRF)

“A CSRF is a common form of confused deputy attack where a users’s previous session data is used by an attacker to make a malicious request on behalf of the victim.” [our paper] The most common countermeasure for this attack is to include a random token within URL/request to implement the challenge-response mechanism. Like other HTTP/HTTPS based protocols, OAuth 2.0 is also vulnerable to CSRF attacks. However, the good thing here is that this attack is well documented and discussed in the OAuth 2.0 RFC. According to the RFC, a “state” parameter is recommended as the implementation of the random value and should be added into the authentication request URL. As long as the OAuth 2.0 implementation has such “state” parameter considered correctly, CSRF attacks should be defended.

2. So, what is wrong here?

In this paper, we are interested in what happens in the real-world OAuth 2.0 implementation and deployment. Specially, we are interested in the CSRF vulnerabilities in the wild. We analyzed Alexa Top 10,000 domains using our OAD crawler, which crawled and analyzed more than 5 million URLs in total. Surprisingly (or maybe not), 25% of websites using OAuth 2.0 appear vulnerable to CSRF attacks. The root cause is simple that either the Identity Provider (IdP, providing OAuth authentication services) does not force the “state” parameter or the application developer does not implement it at all. This again reminds me the giant gap between the theory and practice – In theory, there is no big difference between theory and practice; in practice, there is!

3. OAuth Detector (OAD)

While more details could be found in the paper, here we focus on the design and implementation of our web crawler – OAD. However, Scrapy may be the most popular and convenient Python web crawler framework one can think of. Personally, I do support using Scrapy for fast and common web crawler implementations. At least, this is what it is designed for. Our concerns here regarding some existing general web crawlers, such as Scrapy are:

a. They seem to be heavy-weighted.
b. What if we need low-level control/debugging, such as threading related?
c. What about the performance considering a long run?

Because of these considerations, we implemented our own web crawler, OAD, which is a simple, light-weighted, multi-thread and depth-based crawler specially designed for OAuth URL detection and analysis. OAD uses basic Python libraries to avoid extra overhead introduced by frameworks mentioned above. For example, Python process is used for threading and BeautifulSoup is used for HTML parsing. Selenium webdriver is also integrated for Javascript execution. Check out the code below. Please note that OAD is still under development by Aaron Lamb in another private repo (thanks Aaron!). We will release a new version soon, with the support for BloomFilter and parallel Javascript execution.

4. Code

https://github.com/daveti/oad

Posted in Security | Tagged , , , , , | 1 Comment

Human Learning – some thoughts about Machine Learning and its applications

I have attended a security conference recently. To my surprise, a great portion of accepted papers use some kind of Machine Learning techniques to achieve certain security targets. While I am NOT going to criticize any of the papers, this “Machine Learning” hotness did remind me the old days when I spent years on AI and Machine Learning when I was not this old. In this post, I am trying to explain something people should know before using this powerful tool. Please note that if you have a solid understanding and background of Machine Learning, just stop here and open your Spotify!

0. What is Machine Learning (ML)?

The basic idea of ML is to construct a model from data. This model reveals the inner pattern (or rules, knowledge) of the process (or procedure, behavior), which generates the data. Usually, there are 3 tasks in ML – classification, regression and clustering, which though differing from targets, are well connected. Early ML was originated from AI, which was focusing on Logic. Decision Trees (DT) and Artificial Neural Networks (ANN) are the famous examples in early ML. Modern ML enjoys the help of statistics, which are also called statistical learning. Support Vector Machine (SVM) and Probabilistic Graphical Model (PGM) are the typical examples. Deep Learning, which studies ANN on structure and Back Propagation (BP) into another level, becomes popular because of its usage in image recognition.

1. Distinguish between ML researches and ML applications

ML researches try to improve the existing ML algorithms (either from the computational requirement or parameter tuning or structure learning) and create new ML algorithms (HARD problems). Theoretical ones usually analyze the mathematical bounds or properties of these algorithms.

ML applications usually refer to applying certain ML algorithm to some data, with the hope of finding something interesting and useful. ML applications also involve parameter tuning once the model is determined.

2. Why ML works?

Whatever ML algorithm is applied, it needs data. There are 2 requirements for the data. First, the data should be independent and identically distributed (aka, i.i.d.). Second, the data set should be large enough. Imagine a case where we are sampling an analog signal. We should guarantee that the sampling frequency should be fixed without caring the previous sampled value. Also, we should be careful that we are not sampling the noise. Moreover, we know that our sampling frequency should be big enough if we wanna our digital signal getting enough close to the original analog one.

3. Why ML not working?

When the data does not satisfy the previous requirements, ML could work or not, depending on how lucky you are on that day. But in most cases, we could have a good precision using the training data (we are really good at this!). However, when the system is plugged into the practical use, it suddenly does not work that well. This is and will always be the Achilles heel of ML – overfitting (or over-learning). Once it happens, it is telling the truth that either the data is NOT i.i.d. or NOT enough.

4. Which ML algorithm to use?

I do believe the truth that ML is actually Human Learning. When you are clear about your target (the ML task and things you wanna do), it is very possible that you have some feeling about what kind of data you need – feature selection! Once you are trying to collect different data (feature), you start to feel how these data could work together to generate the result. Then you start to apply certain ML algorithm which looks reasonable to your imagination of the whole process (behavior). If it works, then congratulates – your imagination fits the fact. If not, try to tune the parameter of this algorithm at first. If this does not work, try some other ML algorithms. There should exist an algorithm better than others and if the precision is still not acceptable, think about the data again – do they satisfy the 2 requirements?

Actually, most ML algorithms are designed without caring the data. This means, as long as you have got the “perfect” data, it does not THAT matter what algorithm you choose. It is the same as programming, where data structure determines the algorithm. In ML, data is the king. The reason to learn different kinds of ML algorithms/models is to understand the pattern (rules/knowledge) behind the process (behavior) you are trying to learn.

However, this does NOT mean that different ML algorithms would perform the same given the same data set. Eventually, different ML algorithms have different using cases. As long as the dimension of data (number of features) is not that large, we can always start with DT, which may be the oldest but useful one. SVM with the kernel methods are designed for the case where you do not have enough data and the correlation among different features are not linear. Both ANN and PGM has a network structure, which could be suitable for time-driven or event-driven system modeling.

Unfortunately, most people, including myself, would just try different ML algorithms using the same data set and pick up the one with the best precision. This assumption, that the precision is used to determine the quality of the ML system in general, was meaningful when the precision is under 80%. When there are 2 ML systems with precision 91% and 92%, the assumption is not valid anymore. As there is no metric (yet) for generality, the only way to tell which ML algorithm/model is better is to put the ML system BACK to system, run it for a while and compare the precision again.

5. A general procedure to conduct ML applications

Before everything starts, please think again about the target you wanna achieve. Specifically, please ask the following questions again. The only thing we do NOT need to consider here is performance!

What is the target?
Why do I need ML?
What kind of features do I need?
What kind of ML algorithm/model am I gonna use?
What results I would expect?

A. Collect the real data!
Unless you are working on ML researches, collect the real data as much as possible. By real data, I mean NOT simulation data, which should be used to evaluate the ML algorithm rather than construct the ML system/model. All data, including training, testing, validation and future using, should come from the same place, under the same environment.

B. Filter the noise.
Unless you are trying to find some outliers, filter the noise. This is especially important if the size of data set is limited (Yes, sometimes we are not able to get enough number of data, e.g., survey), where each data sample could easily impact the model. For example, we can use clustering to filtering the noise.

C. Choose a ML algorithm fitting your imagination – starting from simple!
As I have mentioned, before the ML task is started, you should have a general feeling of how things interacting with each other. If a linear classifier is fairly good enough, there is no reason to use non-liner classifier. The key point here is to K.I.S.S. Try to ask yourself – Do I actually need ML? If yes, does DT work? Do NOT jump to the SVM with Gaussian kernel at the first step.

D. Understand your model!
Once you have got a working model, do NOT come to the conclusion immediately. Try to analyze and understand the process (behavior) you have just learned from the model. DT tells how different features are weighted; SVM with kernel method tells the non-linear interaction among features; PGM shows the structure of how features entangle. Give a reasonable explanation why the whole thing works!

E. Put it into the practical use.
The only way to tell whether a ML system works or not is from practice. The validation data is useful to publish a paper but useless to prove a ML actually working. If you intend to make it work in a practical system, use it and collect the data the same time. If your ML system still works, congratulates! Otherwise, come back to A!

F. Perfomance!
Frankly, there is not too much we could do unless we want to have our own implementation using C/C++ rather than using the existing ML libraries or implementations. From the application perspective, one thing we could think about is parallel processing.

6. Why a lot of ML application paper are garbage from my point?

Q. Why do you want to use ML?
A. Coz I wanna use ML.
Q. Where does your data come from?
A. I have got “big” simulation data.
Q. Why do you choose algorithm COOL?
A. Because it is powerful and looks cool.
Q. How does that work?
A. The precision is 99% using the validation data.
Q. This 99% looks a little bit overfitting!
A. What?!
Q. How does that work in practice?
A. What do you mean by practice? The precision is 99% using the validation data!

7. Why using ML is hard for systems?

Most ML algorithms we are talking about here are supervised, offline learning, as this is how actual learning works for human beings. Professors give lectures; students work on homework after class. One could imagine an A+ in the final exam if he understands all the questions from the homework. However, when there is no professor, a student would not get an A+ until he had read thru the whole textbook and done all the homework. Things get worse when there is no professor, no textbook nor homework and the student is asked to take the exam – which is unsupervised, online learning.

Unfortunately, the most reasonable ML algorithm for a running system should be unsupervised, online learning, as the state of system is changing every second. Just like the weather forecast, we could collect TB of data and use it to predict that tomorrow is sunshine. But it rains and no one would be surprised. The reason is simple that just like the weather, there are so many factors interacting with each other to impact the final behavior of the system. We just can NOT predict the result using certain amount of features when other factors/parameters within the system start to change.

Are we doomed? Not really! The scientific and traditional saying about the weather report is something like – “Any forecast beyond one hour is BS”. On one hand, this is right because of the Butterfly Effect; on the other hand, this reveals the fact that as long as the system is in a “stable” state, the forecast or the ML system modeling this state SHOULD work! In a word, we could use ML techniques in the system, even if the algorithm itself is supervised, offline learning. As long as we could guarantee that all the data used to train this ML model is from the real system where we will plug in our ML system, as long as the training data set is large enough, as long as ML algorithm works well on the training validation data, as long as we could guarantee that the system will keep the same state for a period, our ML system should also work well given this period.

Then how to tell if the system state is changed? This is really a HARD question. Though every change of the factors/parameters within the system could lead to the system state change, NOT all these changes would impact our ML system. Imagine we are using certain ML technique to implement a network-based IDS. We collect some data from NIC statistics with the average CPU usage 30% and build our model accordingly. Does this ML model work when the average CPU usage becomes to 90% (a overloaded web server)? I would vote for “no” here.

The other question would be what if the system state is changed and our ML system does not work well. In practice, a real ML system would also collect data when it is put into a production system. This “more recent” and “live” data could be used to tune the parameters of this ML model if there is a slight change in the system. After the ML system runs for a while, some human being with a fairly good salary and some knowledge of ML (someone called Data Scientist…) would jump in and try to determine if a new ML algorithm/model is better/needed.

Posted in AI/ML | Tagged , | Leave a comment