sleptons: March 2016

Sunday, 27 March 2016

Information Gain as a measure of change in port usage distribution

We saw we could use Bayes Theorem and Machine Learning methods to catch changes in your computer's port usage. There is another complementary way we can use to find out any changes in port usage. I say complementary because you cannot strictly judge something wrong is going on if you do not have enough evidence. So the idea is to gather sufficient information to help us to get a sense of the usage pattern.

Port usage distribution

Remember the post "Visualizing port usage data", how we draw a graph for computer's destination ports, here are three separate samples I captured from my computer with provided tool in that post:

Three different port usage spectrum

The question is how can we define a measure to show the difference between the steps we catch a spectrum snapshot or distributions?

A tool to monitor your computer's real-time port usage

Screen snapshot of the tool

In recent 5 posts, we talked about how important is data in Machine Language algorithms and introduced a source of data every one of us who uses a computer and the Internet; has access to it. It is port information of our computer's connections to the Internet or the attached network. That is a good source of data because:

The port information does not contain data about the target or hosts you work with them, so you do not give us information about your host.
It is steady and always available.

We also introduced a simple script which gathers the information and sends them to the server. This script is the base of our data collection, and you can run it on Linux or Mac OS (On linux you just have to change the "-F." to "-F:" and may need some changes to work on Windows too):

netstat -an | grep "ESTABLISHED" | awk '{print $5}' | awk -F. '{print $NF}' | sort -n | uniq -c | awk '{print $2 ":" $1}'

We also showed how you can visualize this information and use some classification methods to learn the way your computer uses ports (spectrum & pattern).

Visualizing the current pattern of your computer's port usage

The application

I set up a web application which enables you to get the idea of how the applications you are working with on your computer, use the TCP connection ports. You can go to the following link and download the script.

Blog Tools: Port usage class visualizer

Go to the above link and download the script. The script is based on what we have talked about in latest four posts, all it sends to the server is the established connections' remote port and corresponding count. Give the script execution permission (chmod +x train.sh), and it is ready to run on Mac os and Linux. If you have Windows (or IPv6) you need to modify the script. By the way, the resolution for portId is 1000 and for valueRange is 20, this means ports 52321 and 52890 are shown as L52 and the current count of ports between 0 to 19 shown as 1, 20 to 39 as 2 and ...

Open a terminal and execute the following command send 20 training samples to the server and then the script calls a web page to show the graph. The command is like training the first class; the application only lets you have two class.

Modeling the port usage data

In the previous posts, we saw that our raw data from your computer comes to our classification program as bellow:

443:47,5223:2,5228:1, ...

Then we apply the following function to get the portId:

function String getPortId(port) {
if (knownPorts.contains(port) {
return "R" + String(port);
} else {
return "L" + String(Integer(port/100));
}
}

Mapping and changing the resolution of dataset

Human eyes can process 10-15 images per second,
it can handle higher rates too, but will lower precision.

Consider you are sitting in a car, the car is moving on a road, and you are looking through the window. The best you can do is processing 10-15 images per second, so you unquestionably miss much information, and can not get information about many things happen out there.

However, your brain does not let you feel the information gap; it constructs the best reality it can from the given information and the things it has learned through your past life.

That is somehow the reason reality is different from the different observers point of view. Someone may say I saw a tree in the corner of the street while another one may claim to see a wooden street light.

If we believe through the evolution, the human brain does the best it can; we need to (or, at least, can) do the same in Machine Learning (ML) algorithms too. Because the idea of ML itself is from the way our brain works. When you see a bus on the road or a traffic sign your brain does not process the given information as a raw bitmap of a picture, it just works with edges, curves, colors, etc. It uses the features it thinks are best to understand the reality better and faster.

Visualizing port usage data

Visual perception is the power of processing information we typically get from our eyes. Why typically? Because even if we cover our eyes, we still have the ability to get visual perception by touching objects too, this shows the loose dependency between this power and our eyes as sensors. In fact, as I have said it before, we see with our brain, not with our eyes, and since that process which visualizes our environment gets trained continuously during our life, it is one of our most powerful understanding tools.

OK, to continue our previous talk, run the "netstat" command, it should show something like this:

$ netstat -an | grep "ESTABLISHED" | awk '{print $5}' | awk -F. '{print $NF}' | sort -n | uniq -c | awk '{print $2 ":" $1}'
80:1
443:24
5223:2
5223:1
5228:2

The importance of data in Machine Learning

Data for Machine Learning (ML) algorithms is like fuel for a car, no fuel no movement. If you do not have access to enough data, it is better to think about using rule-based systems rather than ML systems. I've seen failure of many ML projects not because of the algorithm or the implementation but mostly because of:

1- Having not enough data to train the system
2- Shallow understanding of the data

You only can expect from an ML system to understand and have a proper response just based on the given data and algorithm nothing more! I always wonder why do people expect ML systems work perfectly based on a tiny dataset they provide for them.

Python script to walk on Markov Chain

Suppose you have a simple system which has only four different states, and your observation shows the system changes its state like the following Markov Chain:

Sample Markov Chain to walk on

We want to write an application to simulate the system's state change based on the above Markov Chain. You can find an answer for one of the frequent questions, "Why do many people use Python in Artificial Intelligence?" here. Because you are going to see how easy we can do this job with just basics knowledge of Python while writing the same application with Java or C++ could be a problem.

People do sin, no matter who they are!

If you check the meaning of the word "sin" in a dictionary it is something like "transgression of the law of God", I am going to rephrase it as "violation of the society accepted law" to talk about the nature of our decision making or the way our brain works. It is obvious why I'm using the rephrased version, isn't it?

Scenario
Consider a small city in which the wealthiest family of the city invites all the people to a party in their house. People know they always keep a vast amount of money and jewellery in one of the open rooms of the house. There are rumours that they do not even have any idea of the quantity of money or jewellery in that room, so if you take something from that room, they are not going to find it out. There is no CCTV or alarm system as well. Mr X is our subject who suddenly finds himself at the door of the mentioned room, and now we want to analyze the process of his decision making.

Look at the following simple network which shows the current state of Mr X and the options he has to do or thinks about them at the door.

Data Network of Mr X's status and the actions he can choose from.

Random walk on graph and collapse of Bayesian Network

Random Walk
Consider the following Bayesian Network or Markov Chain, which shows your walking direction. Every time you want to take a step; the network says you have the equal chance of walking towards the north (n) or south (s).

Two states Bayesian Network with equal probabilities for any
state change

I've prepared simple graph based on d3.js for 10,000 steps; you can see it by clicking on the following link. The horizontal axis could be considered as time and the vertical one as the direction to the north or the south. There are equal probabilities to take a step toward the north or the south for the given link. Try the redraw button to see what happens each time it redraws it.

10,000 Random Walk / 50% up - 50% down, 100% forward ...

A sample result of the 50/50 random walk.

sleptons

Sunday, 27 March 2016

Information Gain as a measure of change in port usage distribution

Wednesday, 23 March 2016

A tool to monitor your computer's real-time port usage

Monday, 21 March 2016

Visualizing the current pattern of your computer's port usage

Friday, 18 March 2016

Modeling the port usage data

Wednesday, 16 March 2016

Mapping and changing the resolution of dataset

Monday, 14 March 2016

Visualizing port usage data

Saturday, 12 March 2016

The importance of data in Machine Learning

Monday, 7 March 2016

Python script to walk on Markov Chain

Saturday, 5 March 2016

People do sin, no matter who they are!

Wednesday, 2 March 2016

Random walk on graph and collapse of Bayesian Network