## Friday, 18 March 2016

### Modeling the port usage data

In the previous posts, we saw that our raw data from your computer comes to our classification program as bellow:

443:47,5223:2,5228:1, ...

Then we apply the following function to get the portId:

function String getPortId(port) {
if (knownPorts.contains(port) {
return "R" + String(port);
} else {
return "L" + String(Integer(port/100));
}
}

So we get the following result:

R443:47,L52:2,L52:1, ...

Now we can count the number of each established portId and have the bellow aggregation:

R443:47,L52:3, ...

As we discussed in one of the previous three posts, having the count for a logical port like 47 or 45 is alike so we apply another transformation as bellow to the latest transformed data:

function String getPortCount(count) {
return floor(count / 10) + 1;
}

to get the bellow summary result:

R443:5,L52:1, ...

Now we have our data ready enough to build the model.

The Model
The model is based on the probability of observations' results; so we are going to build a probability model for every training dataset you send for individual classes. It is like what we do for "A Bag of Word" but here things are a bit complicated. We do not count all of the portId occurrences individually. In this model like the portId which may represent a group of ports (L52 is [5200..5299]), the value itself poses a group of values. In fact,  for each portId, we are going to calculate different probabilities of having different value ranges in training sets. To understand it better look at the following image: Observation matrix for each class

We simply build a score matrix which shows what range of values for each portId any of our class, possibly could have. So for the above class; the table shows for protId like R80, it is most likely to have a value in the range 2 then range 1 and never range 3 or 4, 5, and 6. Which means if for a test dataset, we get 4 as the range value for the R80 portID, it is something we have not experienced or observed in our training datasets before, so it should lower the total probability of the test dataset of the class. However, if its R.80 value is 2, it increases the likelihood of belonging to this class.

So all we need to do is to build the above matrix for every class and increase the proper counters in the boxes when parsing the training datasets. After finishing the training process, these matrices model our classes, and we can use Bayes Theorem to calculate the required probabilities.

Suppose we have two classes C1 and C2, and the training datasets like the followings:

D1 = { (d11, C1), (d12, C1), (d13, C1), ... , (d1m, C1) }
D2 = { (d21, C2), (d22, C2), (d23, C2), ... , (d2n, C2) }

in which d elements are in fact our transformed training datasets, here p and v are portId and rangeValue:

d1i = { (pi,vi) |  1 < i < m }
d2i = { (pi,vi) |  1 < i < n }

Now based on what we talked about, we can define the observation matrix like:

OM1 = { aij | aij = Count(*) for all elmenets in D1 for all elements in d1i over pi & v}
OM2 = { aij | aij = Count(*) for all elmenets in D2 for all elements in d2i over pi & v

Alternatively, we can show these matrices as matrices of just single function which returns the count by accepting the portId and valueRange like:

OM1 = { aij | aij  = om1count(pi , vi ) , for all (pi,vi) in d1i , for all d1i  in D1 }
OM2 = { aij | aij  = om2count(pi , v) , for all (pi,vi) in d2i , for all d2i  in D2 }

And we can have the probability matrices as bellow

PM1 = { aij | aij  = pm1(pi , v) = om1count(pi , v) / (sum all elements in OM1) }
PM2 = { aij | aij  = pm2(pi , v) = om1count(pi , v) / (sum all elements in OM1) }

So now we have our observation matrices and if you assume the test dataset as:

t = { (tp1,tv1), (tp2,tv2), (tp3,tv3), ... (tpq,tvq) }

All we need to do is looking for the probability of each of the above pairs which show your current port usage and compute the overall probabilities if consider the t belongs to C1 or C2 like the below:

P(C1 | t ) = P(C1) . [P((tp1,tv1) | C1) . P((tp2,tv2)  | C1) ... P((tpq,tvq) | C1) ]
P(C2 | t ) = P(C2) . [P((tp1,tv1) | C2) . P((tp2,tv2)  | C2) ... P((tpq,tvq) | C2) ]

or:

P(C1 | t) = [m/(m+n)] x [ pm1(tp1,tv1) . pm1(tp2,tv2) . ... pm1(tpq,tvq)]             (1)
P(C2 | t) = [n/(m+n)] x [ pm2(tp1,tv1) . pm2(tp2,tv2) . ... pm2(tpq,tvq)]              (2)

Although it seems to derive the formulas (1) and (2) we have used Naive Bayes model; the truth is since we are calculating the probability of two classes given the same test data, the denominators of are all conditional probabilities for both (1) and (2) are equal, and when you want to compare them, they have no effect on the comparison result. So technically you can consider we have used Bayes Theorem and not Naive Bayes.

Now we can compare the values of P(C1 | t) and P(C2 | t) and choose the best class t belongs to. There is only one problem and that is what if we experience in our test dataset a pair like (R80, 3). As you see in the above matrix our training sets only shows values for (R80, 1) and (R80, 2). We can not use zero because it ruins the calculations. It is like seeing your friend in a new strange shirt and your brain says "since the shirt is not what I already have seen him put on before then he is not your friend." It is ridiculous, all we need to know is just assuming a small probability for this type of pairs. You can use Laplacian smoothing, but I prefer the following idea.

What is the first possible probability after zero for the (R80, 3) pair? It is 1 over the sum of the counts plus one. This probability is greater than zero and less than when you have a pair with count score of 1. That is it; I almost described everything. Now I am going to build a simple application to give you  more intuitive sense.