Saturday 12 March 2016

The importance of data in Machine Learning


Data for Machine Learning (ML) algorithms is like fuel for a car, no fuel no movement. If you do not have access to enough data, it is better to think about using rule-based systems rather than ML systems. I've seen failure of many ML projects not because of the algorithm or the implementation but mostly because of:

1- Having not enough data to train the system
2- Shallow understanding of the data

You only can expect from an ML system to understand and have a proper response just based on the given data and algorithm nothing more! I always wonder why do people expect ML systems work perfectly based on a tiny dataset they provide for them.

If you as a human recognize a pattern in a second; say a lemon between oranges in a black and white picture, it is because you have seen and observed thousands of these fruits before. The understanding of data is also a necessity. Otherwise, your expectation is to get wrong again.

In the lemon example, if you can recognize a lemon from any side, it is because you have seen it before from any directions or angles. Looking at the lemon from different angles is like providing different training datasets. That could be a big problem when you analyze the given training data set; it is just one of the many angles you can see the a lemon or an orange.

In Physics, we have something called "Frame of Reference," it is the point you do the observation or measurements, and you know better than me what happens if we change the frame of reference. So even for a similar training datasets, changing the "Reference Frame" may modify the results dramatically especially when you have limited training sets. (We will talk about the effect of "Frame of Reference" in training data sets later.)

What I am going to talk about in this post, and perhaps the next two posts is about how to find a data source, understand it and use an ML to analyze it.

The port usage
Honestly, it took me a week until I found a good source of data available for almost everyone to run the tests. Before exploring it let me say that I have used my Mac with IPv4 to gather data, so the scripts I introduce here should work on a Mac even on a Linux machine. However, to use them in Windows or a system with IPv6, you may need some modifications n.

Run the following command from a terminal; it simply shows the establish connections from your computer to the Internet.

$ netstat -an | grep "ESTABLISHED"
tcp4       0      0  192.168.0.11.60692     66.185.84.44.443       ESTABLISHED
tcp4       0      0  192.168.0.11.60691     54.90.148.22.443       ESTABLISHED
tcp4      98      0  192.168.0.11.60690     216.58.211.35.443      ESTABLISHED
tcp4      98      0  192.168.0.11.60689     216.58.211.35.443      ESTABLISHED
tcp4      98      0  192.168.0.11.60688     216.58.211.35.443      ESTABLISHED
tcp4      98      0  192.168.0.11.60687     216.58.211.35.443      ESTABLISHED
tcp4      98      0  192.168.0.11.60686     216.58.211.35.443      ESTABLISHED
tcp4       0      0  192.168.0.11.60685     216.58.211.35.443      ESTABLISHED
tcp4       0      0  192.168.0.11.60684     172.217.2.137.443      ESTABLISHED
...


The above list is a good source of easy to access data to work with for everyone who reads this post. The data can be interpreted as the current network usage fingerprint of your computer. To make the calculation simple, I am going to use only the remote side port of the connections, so we just need to retrieve the last part of the fifth column like the following (On Linux you just have to change the "-F." to "-F:"):

$ netstat -an | grep "ESTABLISHED" | awk '{print $5}'  | awk -F. '{print $NF}'
443
443
443
443
443
443
443
443
5228
5223
5223
...


Now to have a compact version of the data we can use the following script, which shows the remote ports and their corresponding number of established connections:

$ netstat -an | grep "ESTABLISHED" | awk '{print $5}'  | awk -F. '{print $NF}' |  sort -n | uniq -c |  awk '{print $2 ":" $1}'
443:29
5223:2
5228:1


Alternatively, this one shows all of them in a comma separated form in a single line:

$ netstat -an | grep "ESTABLISHED" | awk '{print $5}'  | awk -F. '{print $NF}' |  sort -n | uniq -c | awk '{print $2 ":" $1}'|  tr '\n' ','
443:31,5223:2,5228:1,


So this command gives us the number of established connections for each port on the remote side of the connection. This will be our data to analyze.

Now close all the open application that may use the Internet connection, like browsers, torrent applications, etc. Let the system cleans and closes the network connections. Then run the command, after that open an application like Safari or Chrome and again execute the command, every time you want to get the finger print just wait a minute and let the system closes the connections. Here are some of the results on my computer:

No application
4244:1 , 5223:2

Safari with no page
80:6 , 443:12 ,  5223:2 , 4244:1

Chrome with no page
443:33 , 4244:1 , 5223:2 , 5228:1

Chrome with www.google.com
443:21 , 4244:1 , 5223:2 , 5228:1

Chrome with bing
80:35 , 443:21 ,  4244:1 , 5223:2 ,  5228:1

Chrome with single youtube video
443:45 , 4244:1 , 5223:2 , 5228:1

Chrome with two youtube video
443:52 ,  5223:2 , 5228:1

Chorome with three youtube video
443:64 , 4244:1 , 5223:2 , 5228:1

No browser, just a torrent application
5223:2 , 10091:1 , 15870:1 , 19335:1 , 47187:1 , 54706:1 , 54707:1 , 57508:1 , 58343:1 , 65108:1


You see that it seems there is a relation between your Internet usage and these port information. It is hard to write a rule to find out what you are doing, but I think it is not that much difficult to build an ML algorithm to learns the way you use the system and recognizes what you are doing, especially when you consider how easy you can gather the required data. Before ending this post let me bring a simplified definition for a learning program given bye "Tom Mitchell":
A computer program is said to learn from experience E when its performance P at task T improves with experience E.

No comments:

Post a Comment