## Wednesday, 13 April 2016

### Bias and Variance in modeling

It is alway important to remember why we do classification. We do it because we want to build a general model to support our problem, not to model the given training datasets only. Sometimes when you finish training the system and look at your model; you see not all of your training data fits in the model, it does not necessarily means that your model is wrong. You can also find many other examples and cases that the model fits the training data very well but not the test data. Which one of these three models describes the pattern of the given training set better?

Besides, never forget, we use modeling the data because we do not know what exactly happens in the system. We do it because we cannot scientifically and mathematically write a formula to describe the system we are observing. So we should not expect our model completely describes the system, why? Because we have modeled the system just by a small fraction of dataset space.

That is what happens in our brain every day, sometimes the result of the patterns or models we have in our mind is not what we expect. Even in a single outcome of an experiment the expectation of people are not the same. That means when a system is complex (read it as we can extract too many features from it) there are many models you can build for it.

Look at the above picture; there we have tried to construct a model to show the pattern of these 11 training points. If you do some approximate calculation, the average distance between the red points and the blue line (which is an underfit model) is more than the green line, and the orange (overfit model) line perfectly covers all the points. However, does that mean the orange line is a perfect model?

Overfitting usually happens when your prediction model is too complex and in our example exactly describes the given training data. We have to notice that in real world data always contains noise and when we create an overfit model (orange line), we have modeled the noise too, the noise which is unlikely happens the same again. Overfitting a model is like memorizing your friend's features' values as much as possible while you can simply keep some basic facial features to recognize him/her.

Overfit models usually give a higher error with new training or test datasets (high variance). Consider you have a training dataset like D1 and you have built an overfit model like M1 based on the given D1. So we have:

if M1 is an overfit model and Error(M1 over D1) = ε
then for any other given Di we have: Error(M1 over Di) >> ε)

Same thing happens for underfit models but if you consider M2 as something between overfitting and underfitting we have:

if Error(M2 over D1) = ε
then for any other given Di we have: Error(M2 over Di) ≈ ε

Bias vs Variance
The error we talked about can be decomposed into two different parameters, bias and variance. Consider the error function as the mean square function, so for every given dataset Di if we consider the desired class as Ci and the output of the classifier as H(Di) we have:

Error = E[ (Ci - H(Di))² ] Bias-Variance tradeoff

Here E is the mathematical expectation so basically E(x²) is a representation of the mean square of the x, calculation shows we can decompose the Error function as bellow:

Error =  Variance(H(Di)) + (Bias(H(Di))² + ν

or in plain english form as bellow:

Error = (Variance Error) + (Bias Error) + (Variance Noise Error)

"Variance Noise Error" is something we cannot get rid of it and stays there all the time. However, we can arrange to have the a model with the minimum possible values for the other two factors, look at the image above. The "Bias Error" indicates how much the model is far on the average from the desired output and the "Variance Error" as its name implies how much the model's prediction varies around its average.

You can see these two parameters clearly in blue and orange lines. Blue line has lower variance error than the orange one, while the orange line has lower bias error than the blue one.

Now you can always ask where is the point in which these two parameters both together get as low as possible? You build a model, and people say why it does not fit good at this point or that point and ... Don't listen to them, and don't try to convince them that the model should not fit good for all the given datasets, it is not your job to teach them machine learning or mathematics.

Generally speaking, most of the time it is impossible to find out a model which fits well both for training datasets and test datasets. Theoretically, you need to solve some equations to find the point in which the bias graph cuts the variance, this is the point of desired complexity. However in practice; you can build a range of models from low to high complexity and find the point in which the total error gets minimum like what we saw in the above image.