Suppose we have a verified sample set like S which is, in fact, something like:
S = { X | X is a vector like (X1, X2, X3, ..., Xn) , X is not anomaly sample }
Now since these samples are all valid, we can define the R as the point in the middle of the all given X in S as below:
R = { (R1, R2, R3, ..., Rn) | Ri = (∑ Xij ) / n , where j loops over all elements of S elements }
Now we have the R, but we need to have the acceptable distance we talked about. If we calculate the distance for each of these samples (X € S) from the calculated R, we will have a set of numbers which shows the distance of each of these vectors from the reference. So if we have a function DF over two points which calculated the distance between two points we have:
D = { d | d = DF(X , R) , X € S }
In which D is the set of calculated distances. OK, we can have the standard deviation for this set so if we name it sd, which is a standard deviation of the distance of each X from the reference point, then we can easily find the margin we are looking for.
What is it going to be? It can be one sd or two sd depending on how accurate we want to be. I think two sd is good, remember that two sd means almost 95% of the samples with distance around the reference we calculated.
Now if we call the average distance on D as μ, any given sample with distance from R between d-sd and d+sd can be acceptable and distances out of this range will be anomalies or:
μ-sd ≤ DF(X , R) ≤ μ+sd or μ-2.sd ≤ DF(X , R) ≤ μ+2.sd
No comments:
Post a Comment