## Friday, 8 April 2016

### Entropy pattern of network traffic usage

I am working on a new project; it is about recognizing usage patterns in network traffic by having as minimum as possible data. Usually, you need to do DPI or look for signatures, ... These are good and all work well, but the problem is they require access to low-level traffic (not everybody likes to give low-level traffic for analysis) and also are very CPU intensive, especially when you are dealing with large volume of traffic. I think using these techniques are like recognizing your friend by his fingerprint or DNA! However, you, as a human being can recognize your friend, even after 10 or 20 years without these techniques!

 Entropy distribution of Port and IP of the TCP connections

We never do a precise and accurate measurement to recognize a friend or a pattern or an object. We usually do some rough calculation or comparison but over many features. When you do DNA analysis to identify people, you can consider it as a precise and accurate measurement while looking to your friend's face, body, hair, and even his clothing style are just simple comparisons and summarizing the results to get the required information you need, is he your friend or not?

We can use the same idea to recognize patterns in other phenomena and complex systems too. For example in computer networking, consider somehow you have the ability to see the passing traffic from or to a computer, the question is "Can you guess what the computer is doing at the moment?" Before continuing this topic, I am going to talk about something a bit strange.

Mathematics and Boolean Logic are just valid in our mind
If you ask me, they do not exist in our real world. You never can be 100% sure about almost everything. I do not want to open this discussion here but just consider even two Hydrogen atoms are not 100% equal because who knows if their electrons are at the same position related to their protons or not. We can also open the "Heisenberg's Uncertainty Principle" and so many other ideas, but let us talk about them later. Alternatively, in the mathematical point of view, you can always say or proof A = B, only on a piece of paper or in your mind. While when you see someone from a distance and recognize him as your friend, how can you make 100% sure he is your friend? The more you get close to him, the more you get sure of the result of your recognition. At some point, you may reach to a percentage of 99.9999% that he is your old friend, but it never reaches to 1, even when you start talking to him about the past mutual history, or you both go for lunch; there is still a small chance he is not your friend.

Network Traffic
So to find out the pattern of a network traffic we can do the same, compare some features with already known patterns and see how much these features are like the stored patterns, then we can come to a firm result. For a network traffic, consider having features like traffic volume, ports are in use, session counts, sessions average duration, etc. All of these features are like having information about body size, hair style, clothes style, nose size and form, eyebrows, eyes, etc. Calculating the probability for each of these features, given (or against) the specific patterns we already know gives the firm result we are looking for.

Entropy pattern of network traffic
Suppose you collect the remote IP and Port address of the established TCP connections every second and calculate the entropy of the Port and IP distribution. It turns out that there is a clear correlation between what you are doing and the values of these entropies. Here are my test results on my MacBook.
1. Idle situation: While I had no running application which uses the Internet the entropy of  Port and IP, [which I am going to show it as (Port Entropy , IP Entropy)] was (0.0 , 1.0) and that is correct because MacOS usually have two connections to some remote hosts even when you do not use the Internet. These connections are to a single Port but different IPs which comes to (0.0 , 1.0), means no uncertainty for Port and 50% for IP address.
2. Googling: When I opened a browser and started searching both in web and image sections, using around 10 tabs the average entropy was (0.9222 ,  3.8173) and the explanation is that the browser uses many connections to the Google to support the open tabs, and since Google gives the service on Port 443, then because we already had some connections on another Port, the Port entropy should be something between 0 and 1. Moreover, the entropy of almost 4 shows I had around 16 different connections to different IPs to support this task.
3. YouTube: Working with YouTube, opening around 5 tabs gives the average of (0.4235 , 4.7071).
4. Twitter: Working with Twitter, opening pages gives the average of (0.4045 , 4.6060) which is near YouTube but if you see the distribution you will see it is different.
5. Netflix:  Now if you open a browser and just look at a movie on Netflix it gives you the average of (0.6481 , 3.8458) for entropies.
6. Torrent: If you just download a single torrent file, you see an entirely new pattern, and the average of (2.1711 , 2.5264). Here it shows that Torrent almost uses different Ports for each connection which is evident, and the entropies for both Port and IP are almost the same. However, the point is that the average and distribution pattern ar dramatically different from the others.

I almost run the sampling process for 3 minutes for each of the above scenarios and calculated the entropy of Port and IP as I described before for every sample. As you see in the above image, the distributions are different and somehow recognizable, so we can count on it as a feature for recognizing traffic pattern.

An important question and Result
If you have ever seen the output of an Anomaly Detection System or an SIEM (Security Information and Event Management) you usually see many logs and alarms, it does not matter who has written the software you, HP or anybody else they all of them give you many logs and alarms. People mostly complain about these logs and alarms them.

The boss comes to the office and says, "What is this? Why does it show alarms? I do not see any problem in the network? And ..." He is right, there is no problem in the network but the software still shows tons of alarms. The main reason these kinds of software usually report many alarms is that they only process the network (or any other system) from a small angle of view, so all they report is based on all the data they have. (An analogy is considering the whole recognition system as a single perceptron which some of its inputs are killing themselves that this is the pattern you are looking for but since the other inputs are not correct the activation function could not get fired and the whole pattern does not get recognized.)

Back to our example, if we train our system with these six different patterns, we can calculate how much the given traffic is similar to any of these six patterns so, for example, the current pattern may be 10% Netflix, 60% Googling, 15% YouTube, 20% Twitter, 2% Torrent. If your software just does the above analysis with the given data, it reports that "The user is mostly using Google." It may be correct and may be wrong; it is like covering the whole body of your friend and showing you only his hands and ask you if this is your friend? You need more features to examine to get a better result; otherwise, you have to bear with the probabilities or go for examining his fingerprint.

Browser comparison
I also ran a simple test and found that between Chrome, Safari, Vivaldi and Opera; it is only the Opera when you open it, does not do something suspicious. Look at the bellow for the entropies while just the browser is open, I have no extension on any of them.

Opera:    (0.0000 , 1.0000)
Vivaldi:  (0.9183 , 1.5850)
Safari:   (1.5485 , 2.4056)
Chrome:   (0.4428 , 3.7868)