## Monday, 24 November 2014

### Calculating a reference & acceptable distance for anomaly detection

We haven't talked about anomaly detection yet, but we did talk about calculating the distance between a sample data called X and a reference data called R. We talked about a threshold or margin of acceptable distance too (M). In this post, we want to introduce a very simple but effective way of finding the reference and this acceptable distance.

Suppose we have a verified sample set like S which is, in fact, something like:

S = { X | X is a vector like (X1, X2, X3, ..., Xn) , X is not anomaly sample }

Now since these samples are all valid, we can define the R as the point in the middle of the all given X in S as below:

R = { (R1, R2, R3, ..., Rn) |  Ri = (∑ Xij ) / n , where j loops over all elements of S elements }

It is nothing but calculating the average of each dimension of the samples and assume it as the dimension of the R. If it is hard for you to accept that R is exactly in the middle of the points we have in n-dimensional space, try to figure it out in 1, 2 or 3-dimensional space.

Now we have the R, but we need to have the acceptable distance we talked about. If we calculate the distance for each of these samples (X € S) from the calculated R, we will have a set of numbers which shows the distance of each of these vectors from the reference.  So if we have a function DF over two points which calculated the distance between two points we have:

D = { d | d = DF(X , R) , X € S }

In which D is the set of calculated distances. OK, we can have the standard deviation for this set so if we name it sd, which is a standard deviation of the distance of each X from the reference point, then we can easily find the margin we are looking for.

What is it going to be? It can be one sd or two sd depending on how accurate we want to be. I think two sd is good, remember that two sd means almost 95% of the samples with distance around the reference we calculated.

Now if we call the average distance on D as μ, any given sample with distance from R between d-sd and d+sd can be acceptable and distances out of this range will be anomalies or:

μ-sd ≤ DF(X , R) ≤ μ+sd   or   μ-2.sd ≤ DF(X , R) ≤ μ+2.sd