Wednesday, 16 March 2016

Mapping and changing the resolution of dataset

Human eyes can process 10-15 images per second,
it can handle higher rates too, but will lower precision.
Consider you are sitting in a car, the car is moving on a road, and you are looking through the window. The best you can do is processing 10-15 images per second, so you unquestionably miss much information, and can not get information about many things happen out there.

However, your brain does not let you feel the information gap; it constructs the best reality it can from the given information and the things it has learned through your past life.

That is somehow the reason reality is different from the different observers point of view. Someone may say I saw a tree in the corner of the street while another one may claim to see a wooden street light.

If we believe through the evolution, the human brain does the best it can; we need to (or, at least, can) do the same in Machine Learning (ML) algorithms too. Because the idea of ML itself is from the way our brain works. When you see a bus on the road or a traffic sign your brain does not process the given information as a raw bitmap of a picture, it just works with edges, curves, colors, etc. It uses the features it thinks are best to understand the reality better and faster.

So back to our port usage information project, (you need may need to check out the following links), there are some facts as I mentioned in first three paragraphs we need to consider to have a better trade of between prediction performance and cost.

The importance of data in Machine Learning
Visualizing port usage data

 I mean facts like, there are important ports and those are not that much important. For example ports in range, 49152 through 65535 do not that much mean anything, so having 52301 or 52302 or 52321 in different datasets almost have the same meaning. However, obviously same thing is not relevant for ports like 22, 80, 443 and ...

In another point of view having six connections to the particular port and seven connections may not be that much different, at least, we assume this is not a big deal, but six verses sixty is. That is important because we as human do not pay too much attention to high-resolution data when the data passes though the time, we just get a feeling from data.

Pause a movie on your computer and see the picture quality difference between when it is running and when it is paused. The reason is your brain cannot process that much information when the movie is playing, so it somehow guesses the quality must be good. While when you pause the film, you have less information to process, and your brain has enough time to process them better, so it understands the quality is not that much good.

This idea is crucial when you have an enormous amount of data to process, and we are going to use it in our port project. There is also another related thing which we may talk about it later posts, and it is using more features to get information from an observation instead of measuring with higher precisions and fewer features.

Mapping known port
We are going to build a list of known ports as bellow to process their information individually. Here are some of them, I think worth to do that, protocols like FTP, SSH, Telnet, SMTP, DNS, DHCP, TFTP, HTTP, POP3, SFTP, BGP, LDAP, HTTPS, LDAPS,...

knownPorts = [20, 21, 22, 23, 25, 53, 67, 68, 69, 80, 110, 115, 143, 161, 162, 179, 389, 443, 636, 546, 547, 989, 990]

And for all other ports we just divide them by 100 and use the integer result as a port number. Since there may be some conflict between known ports and scaled down ports we add a prefix of "R" as real to the known ports and "L" as logical for scaled ports. So the following function maps our ports:

function String getPortId(port) {
      if (knownPorts.contains(port) {
            return "R" + String(port);
      } else {
            return "L" + String(Integer(port/100));
      }
}


Now instead of having 65536 ports we have only 655+ 23 = 678 ports which works good enough for our project.

Redefinition of the problem
OK, now we can redefine our project, to get a better sense of what we are going to do. We have datasets containing different counts of objects from a set of 678 possible objects. Like the followings:

{ R80:10, R443:2, L0:14 }  , { R80:2, R443:23, L0:12, L524:10 }  { L0:23, L21:33, L554:10 }  , ...

Each of them shows the remote port usage of your established connection on your computer. We want to train a system with labeled datasets and analyze some test datasets and see how much the given test data is similar to already recognized datasets.

Simple Solution
Mapping the 65536 numeric ports to 678 symbols, virtually let us consider a solution like "A Bag of Word" we already talked about in "Dealing with extremely small probabilities in Bayesian Modeling" post. As a matter of fact, this model works better here than the one we used for document classification. Because in this project the order of ports does not matter while in a document classification the order of words is important.

However, we are going to use another technique to see how much the given dataset is similar to the trained classes, in next post.

No comments:

Post a Comment