Friday, 7 April 2017

Time Series Anomaly Detection White Paper

Here is a white paper on time series anomaly detection. This white paper is about finding anomalies in time series, which we encounter in almost every system. I usually keep notes when I work on projects, and this paper is based on my experiences and the notes I took while working on anomaly detection systems. It is a practical guide to detecting anomalies in time series using artificial intelligence concepts.


Thursday, 21 April 2016

Free Port Monitoring Service

One of my friends and I have built a service based on the recent posts of the blog on visualizing computer's port distribution and monitoring the usage. It is like the tools we already had here basically, but with a better UI, easier to use and some more information. I do not know why, but the service is named "Puffin"

Clear as crystal
One of the problems with installing a client that sends data to a server on the cloud is that you usually don't know what information it sends up. Same issue happens even when the software is not supposed to do anything with the Internet. What we have done in Puffin, is using simple shell scripts to send data to the Puffin's back-end service. So you can see the inside of the script with any text editor and make sure what information it sends to the server.

Dashboard page of the service

Who is this service for?
Everybody who is curious or wants to know what his/her computer does while connecting to a network or the Internet. Mostly computer students or geeks, network administrators, technical supports or those who does not trust to installed software and want to know what the installed software are doing with the Internet or network connection. If you are a computer, software, network, ... geek you don't need to read the rest of the post; test it here:

Wednesday, 13 April 2016

Bias and Variance in modeling

It is alway important to remember why we do classification. We do it because we want to build a general model to support our problem, not to model the given training datasets only. Sometimes when you finish training the system and look at your model; you see not all of your training data fits in the model, it does not necessarily means that your model is wrong. You can also find many other examples and cases that the model fits the training data very well but not the test data.

Which one of these three models describes the pattern of
the given training set better? 

Besides, never forget, we use modeling the data because we do not know what exactly happens in the system. We do it because we cannot scientifically and mathematically write a formula to describe the system we are observing. So we should not expect our model completely describes the system, why? Because we have modeled the system just by a small fraction of dataset space.

Friday, 8 April 2016

Entropy pattern of network traffic usage

I am working on a new project; it is about recognizing usage patterns in network traffic by having as minimum as possible data. Usually, you need to do DPI or look for signatures, ... These are good and all work well, but the problem is they require access to low-level traffic (not everybody likes to give low-level traffic for analysis) and also are very CPU intensive, especially when you are dealing with large volume of traffic. I think using these techniques are like recognizing your friend by his fingerprint or DNA! However, you, as a human being can recognize your friend, even after 10 or 20 years without these techniques!

Entropy distribution of Port and IP of the TCP connections

We never do a precise and accurate measurement to recognize a friend or a pattern or an object. We usually do some rough calculation or comparison but over many features. When you do DNA analysis to identify people, you can consider it as a precise and accurate measurement while looking to your friend's face, body, hair, and even his clothing style are just simple comparisons and summarizing the results to get the required information you need, is he your friend or not?

Sunday, 27 March 2016

Information Gain as a measure of change in port usage distribution

We saw we could use Bayes Theorem and Machine Learning methods to catch changes in your computer's port usage. There is another complementary way we can use to find out any changes in port usage. I say complementary because you cannot strictly judge something wrong is going on if you do not have enough evidence. So the idea is to gather sufficient information to help us to get a sense of the usage pattern.

Port usage distribution
Remember the post "Visualizing port usage data", how we draw a graph for computer's destination ports, here are three separate samples I captured from my computer with provided tool in that post:

Three different port usage spectrum

The question is how can we define a measure to show the difference between the steps we catch a spectrum snapshot or distributions?

Wednesday, 23 March 2016

A tool to monitor your computer's real-time port usage

Screen snapshot of the tool
In recent 5 posts, we talked about how important is data in Machine Language algorithms and introduced a source of data every one of us who uses a computer and the Internet; has access to it. It is port information of our computer's connections to the Internet or the attached network. That is a good source of data because:
  • The port information does not contain data about the target or hosts you work with them, so you do not give us information about your host.
  • It is steady and always available.

We also introduced a simple script which gathers the information and sends them to the server. This script is the base of our data collection, and you can run it on Linux or Mac OS (On linux you just have to change the "-F." to "-F:" and may need some changes to work on Windows too):

netstat -an |  grep "ESTABLISHED"  | awk '{print  $5}'  | awk -F. '{print $NF}' |  sort -n |  uniq -c | awk '{print $2 ":" $1}'

We also showed how you can visualize this information and use some classification methods to learn the way your computer uses ports (spectrum & pattern).

Monday, 21 March 2016

Visualizing the current pattern of your computer's port usage

The application
I set up a web application which enables you to get the idea of how the applications you are working with on your computer, use the TCP connection ports. You can go to the following link and download the script.

Blog Tools: Port usage class visualizer

Go to the above link and download the script. The script is based on what we have talked about in latest four posts, all it sends to the server is the established connections' remote port and corresponding count. Give the script execution permission (chmod +x, and it is ready to run on Mac os and Linux. If you have Windows (or IPv6) you need to modify the script. By the way, the resolution for portId is 1000 and for valueRange is 20, this means ports 52321 and 52890 are shown as L52 and the current count of ports between 0 to 19 shown as 1, 20 to 39 as 2 and ...

Open a terminal and execute the following command send 20 training samples to the server and then the script calls a web page to show the graph. The command is like training the first class; the application only lets you have two class.