Human eyes can process 10-15 images per second, it can handle higher rates too, but will lower precision. |
However, your brain does not let you feel the information gap; it constructs the best reality it can from the given information and the things it has learned through your past life.
That is somehow the reason reality is different from the different observers point of view. Someone may say I saw a tree in the corner of the street while another one may claim to see a wooden street light.
So back to our port usage information project, (you need may need to check out the following links), there are some facts as I mentioned in first three paragraphs we need to consider to have a better trade of between prediction performance and cost.
Visualizing port usage data
I mean facts like, there are important ports and those are not that much important. For example ports in range, 49152 through 65535 do not that much mean anything, so having 52301 or 52302 or 52321 in different datasets almost have the same meaning. However, obviously same thing is not relevant for ports like 22, 80, 443 and ...
In another point of view having six connections to the particular port and seven connections may not be that much different, at least, we assume this is not a big deal, but six verses sixty is. That is important because we as human do not pay too much attention to high-resolution data when the data passes though the time, we just get a feeling from data.
We are going to build a list of known ports as bellow to process their information individually. Here are some of them, I think worth to do that, protocols like FTP, SSH, Telnet, SMTP, DNS, DHCP, TFTP, HTTP, POP3, SFTP, BGP, LDAP, HTTPS, LDAPS,...
knownPorts = [20, 21, 22, 23, 25, 53, 67, 68, 69, 80, 110, 115, 143, 161, 162, 179, 389, 443, 636, 546, 547, 989, 990]
And for all other ports we just divide them by 100 and use the integer result as a port number. Since there may be some conflict between known ports and scaled down ports we add a prefix of "R" as real to the known ports and "L" as logical for scaled ports. So the following function maps our ports:
function String getPortId(port) {
if (knownPorts.contains(port) {
return "R" + String(port);
} else {
return "L" + String(Integer(port/100));
}
}
if (knownPorts.contains(port) {
return "R" + String(port);
} else {
return "L" + String(Integer(port/100));
}
}
Now instead of having 65536 ports we have only 655+ 23 = 678 ports which works good enough for our project.
OK, now we can redefine our project, to get a better sense of what we are going to do. We have datasets containing different counts of objects from a set of 678 possible objects. Like the followings:
{ R80:10, R443:2, L0:14 } , { R80:2, R443:23, L0:12, L524:10 } { L0:23, L21:33, L554:10 } , ...
Each of them shows the remote port usage of your established connection on your computer. We want to train a system with labeled datasets and analyze some test datasets and see how much the given test data is similar to already recognized datasets.
Simple Solution
Mapping the 65536 numeric ports to 678 symbols, virtually let us consider a solution like "A Bag of Word" we already talked about in "Dealing with extremely small probabilities in Bayesian Modeling" post. As a matter of fact, this model works better here than the one we used for document classification. Because in this project the order of ports does not matter while in a document classification the order of words is important.
However, we are going to use another technique to see how much the given dataset is similar to the trained classes, in next post.
No comments:
Post a Comment