## Tuesday, 9 February 2016

### Expected Value and Learning Strategy

Suppose you need to train a machine with (or learn from) an online stream of data and the given data is not tagged, so you don't have any idea of if the recent data complies your classification or not. To continue talking about this problem let's consider a simple scenario in which we monitor only one variable of a system. For example, the temperature of a room, traffic volume of a network,  stock market return rate, etc.

To build a simple model for our problem, we start at time zero with the value of V(0), the next sample at time one will be V(1), etc. Like the following, don't forget the data is not labeled so we only have the current value:

V(0), V(1), V(2), ... V(n)

Studying and learning a phenomenon is about gathering information and building a model to have a better expectation of its future. Even when you keep a hard-copy history on something you just want to be able to study it whenever you like to improve your model and expectation accuracy. And in our simple single variable example, it is about finding a value which shows the most expected value of the variable with one condition, you are only allowed to keep or store the expected value, nothing else.

Why only one memory location?
The idea comes from how we process the expectation of things in our mind, we never keep all information about something or someone,  just update our expectation about them whenever we collect some new data from it. This is the reason we sometimes say:
I don't why, but I don't trust them!
Things were easier if we had the chance of keeping one more parameter. Then just by storing the number of the given data we could determine the average (mean/mathematical expectation), from the following formula:

Mean(n) = [ Mean(n-1) * (n-1) + V (n) ] / n

Here the passing parameters to the function (Mean or V) is the time index, and you see, by storing these two parameters, you can calculate the mathematical expectation whenever a new data gets ready, but you are not allowed.

Learning Strategies
Now consider the following simple formula in which LF (learning factor) is a constant number between zero and one which determines how we can update our single memory location with the new given variable value.

ExpectedValue(n) = (1-LF) * V(n-1) + LF * V(n)

If you set LF=1, the formula simplifies to ExpectedValue(n) = V(n) which says you always store the newest value of the V and consider it as the expected value of the variable. If you set LF=0, then the formula simplifies to ExpectedValue(n) = V(0) which shows you never update the first appeared value of the variable. A good analogy for using these two boundaries is judging people by their first or latest behaviour which is not a good approach to know people, so let's forget these boundary values.

So none of the above approaches is good enough to learn the expected value of the variable. What happens if we choose something between 0 and 1 for LF? Look at the trends in the following graph: The effect of learning factor on a variable when it changes its value

The scenario is that the variable has a value of 200.00 at time=0, and after that its value changes to 220.00. The question is how much we could rely on this new value? Is it going to keep this value for a while or it is going back to its previous one soon? Who knows?

The trends show how our expectation will approach the new value of the variable with various LF values. As you see choosing LF close to one, forces the expectation to get the new value faster, while having it closer to zero, gives the expectation more time to assume the new value which is 220.00.

We saw that both LF=1 and LF=0 gave us poor strategies to know the variable, but any value between zero and one works and lets the expectation follows the changes in a short or long run. If you choose LF=0.9 the expectation follows the change rapidly while for LF=0.1; it follows the change slowly. Which one gives us a better expectation of the variable? It depends on the business we are modelling. LF=0.9 learns and forgets faster while LF=0.1 learns and forgets slower. But I think most of the time the second strategy works better for most application, like the way you know real friend only with the passing of time.

The alien story!
To understand why? Consider an alien comes to the earth and wants to know if the Earthians are angry or calm. If he goes to countries and streets for a week or month and sees people are fighting to each other, with a fast-learning strategy, he would report back to his boss that "Earthians are angry". But if he spends more time and keeps monitoring countries and streets for a year or two, he'd say "Earthians are calm". The reason is that the dominant behaviour of human beings is peace, calmness, and coolness, not anger (I hope) and you can only find this out if you gather enough information in a long run.

The broken dice
You have a broken dice and for this reason, the probability of being 6 is 25% and the other sides are 0.15, while for a normal one, it should be almost 0.167 for all. The question is how can you determine if the dice is fair? You can't just toss it for 10 or 20 times; you need to roll it like 1000 times or even more to see if the probability of being 6 is 0.167 or not. That is a basic probability rule which we use in our learning strategy;  since the data is not tagged we can only rely on the results if we test them in a long run. (This is exactly like trusting people.)