## Tuesday, 25 November 2014

### Entropy as a measure of anomaly detection

Let's talk a bit more about the anomaly and then leave it for a while, there are many ways and methods we can talk about these are just some easy practical solution which in fact really works. The thing is I would like to talk about some practical aspects of software engineering too, so this will be my final post on anomaly for a while.

OK, first of all, what is entropy? There are many definitions which more or less all have the same meaning at the end, even when you talk about the entropy in thermodynamics it somehow has the same meaning of the entropy we have in information theory. If you get in depth and reach to a point exactly feel what it means, you'll find that the way entropy describes the behavior of the molecules of gas is not that different from the way we can use entropy to describe the behavior of for example the volume of a network traffic.

Anyways, entropy is a measure for the amount of randomness (disorder) in a system. It can be interpreted as the measure of the number of the ways we can have in a system or it can also be described as the number of possible states we can see in a system. You may already have seen the Boltzmann's entropy formula which is something like:

S = k . Ln W

In which W is the number of the states and k is Boltzmann's constant. You can easily simplify this formula to something like the following too:

S ≈ Log10 W   or   S ≈ Log2 W

Using Log2 for programmers gives a better sense, so if we have a closed system which takes 100 individual states, the system's entropy is proportional to Log2(100) which is 6.64 or 7. What does this mean? If we forget about the coefficient k, it says if we want to express a system which takes 100 different states we just need 7 bits of information which can be interpreted as a measure of disorder for this system. The system gets more disordered as the entropy goes up or the number of the bits increases.

How entropy of a system helps us to find anomaly?
It is easy, it doesn't matter what number you get when you calculate the entropy of a system, what matters is, after doing it for some normal situations and find out the average entropy and the margins in which it varies, you can always compute it and find out if the calculated entropy is in the acceptance range or not.

The wonderful thing is that even if the entropy usually is high, and then suddenly you find it low, it can be a sign of an anomaly. For example, you have a 100Mbps Internet connection and every hour you compute the measure of disorder for the source of IP addresses, assume it is 13 for normal conditions which means something around 5,000 different IP addresses. Now if in some hour you experience a value of 5 for entropy, it shows at that hour you have had only around 30 source IP, while you usually have 5,000.

Although you can not directly find the reason just with this number, but you can be sure that this situation can be an anomaly. The reason can be a DDOS attack if the bandwidth usage hadn't have any change or can be some hardware problem if the bandwidth usage had been dropped somehow, or even if some maintenance has been happened at that hour.

I'll talk later on how we can calculate the entropy for information we get from our environment, the environments we can consider them as a closed system.