Port usage distribution
Remember the post "Visualizing port usage data", how we draw a graph for computer's destination ports, here are three separate samples I captured from my computer with provided tool in that post:
Three different port usage spectrum |
The question is how can we define a measure to show the difference between the steps we catch a spectrum snapshot or distributions?
Entropy
Entropy is a good measure to show the changes in dynamics of almost all the systems. The measure as you know illustrates the degree of disorder or uncertainty in physical or information world. In the physical world, for example, the entropy of gases is more than the entropy of liquids and the liquid is more than solids. Same is in the information world if you toss a coin the outcome is either head or tail so the entropy is less than when you toss a dice which has six different outcomes, here as you see as the entropy gets larger the uncertainty grows too and vice versa.
Looking form the Shannon's point of view in coin example, the two different possible outcomes gives you a system with two state which requires just one bit of information for defining the state. Alternatively, for dice example, you need Log2(6) or 2.58 bits of information to specify the state of the system.
Briefly, in Shannon's information theory, if you have a system with states (or an experiment with outcomes) like s1, s2, ... sn, the following formal gives the entropy of the system:
Entropy by itself is a good measure to show the status of the port distribution but what helps us better is finding a way to see how much the distribution changes when we get a new dataset.
Information Gain
Information gain is, in fact, nothing but the "Kullback–Leibler Divergence". Which is a measure to show the difference between two different distribution of a system and that is what we exactly want. So in our problem, if we get a set of port usage at t1 and name it PortUsage1 and then at t2 a new set of PortUsage2, the information gain shows how much information we have lost or got during this transition. Note that here PortUsage1 and PortUsage2 are two different sample of port usage distribution we have observed at t1 and t2.
H = - ∑ P(si) Log2(P(si))
Entropy by itself is a good measure to show the status of the port distribution but what helps us better is finding a way to see how much the distribution changes when we get a new dataset.
Information Gain
The KL-Divergence shows how much information we have gained if we assume PortUsage1 as the prior distribution and get the PortUsage2 as posterior distribution, we show it as bellow:
DKL(PortUsage2 ‖ PortUsage1) = ∑ PortUsage2(i) Log2(PortUsage2(i) / PortUsage1(i))
Note in the above i is the class indicator of the ports and DKL shows the divergence of PortUsage1 from PortUsage2 and in the formula whenever PortUsage2(i) is zero we assume the whole term as zero. (It is easy to prove that limit of "x log(x)" when x approaches to zero is zero.) Also, consider that as the formula shows, the gain is not symmetric and does not follow the Euclidean's triangle inequality so, you can not count on or imagine it as a measure of distance.
The beautiful thing about the DKL is that it is a positive value unless PortUsage2 be exactly like PortUsage1, which in this case is zero. That is because when you get a new sample or dataset, it either helps you nothing, so the gain is zero, or it helps you to improve your understanding of the distribution.
Visualizing the Information Gain
For our previous project, I added a new module which calculates the instant information gain which means the divergence of distribution whenever you get a new dataset. You can test the previous port monitoring application and see how the Information Gain graph catches your instant port usage changes. To test it, go to the page "Online port usage analysis" and run the script. Then while looking at the Information Gain graph, use your browser and/or other Internet applications or stop using them.
No comments:
Post a Comment