Tuesday 23 February 2016

Dealing with extremely small probabilities in Bayesian Modeling

If you have ever tried to model a complex system with Bayesian Network or have an attempted to calculate the joint probability of many outcomes you know what I'm saying! Multiplication of many small numbers which eventually gives you nothing but zero!? To have a better idea of what the problem is, let us consider a classic text classification problem.

The source code of this post is located here:
Download the document classification Java code, using Bayes Theorem

Document Classification
Suppose you have some already classified documents and want to build a classification program, here is what your training dataset should look like:

(d11,c1), (d12,c1), (d12,c1), ...
(d21,c2), (d22,c2), (d22,c2), ...
.
.
.
(dn1,cn), (dn2,cn), (dn3,cn), ...

We give the labeled documents as training datasets to our program, it learns them and then it should determine the class of any given document.


Let's see how we can use Bayes Theorem to solve this problem. The idea is to calculate the probability of the given document belongs to any of the trained classes and see which one has the highest probability. Then this class will be the best-fitted class for the given document. So we have divided our training materials into n different classes. Bayes Theorem says the probability of the test document belongs to ci can be derived from the bellow formula:

P(ci | d) = P(ci | d) . P(ci) / P(d)         [1]

And the given class of the document could be defined as the following formula. This "argmax" notation returns the class which maximizes the probability:

Selected Class = argmax  P(ci | d)   argmax  P(ci | d) . P(ci) / P(d)   ;   i  [1..n]         [2]

So to find the "Selected Class" we need to calculate n probabilities via the Bayes Theorem and locate the maximum of them. But whatever the value of P(d) be, we can find the "Selected Class" via the following formula too which is simpler, because we have P(d) in all of these formulas, and it can be ignored:

Selected Class = argmax  P(ci | d) . P(ci)    ;   i  [1..n]         [3]

Here in [3], P(ci) is nothing but the probability of the class which can be derived form the following formula:

P(ci) = (Number of training documents for class ci) / (Total number of training documents)  [4]

But the problem arises when we want to calculate the P(ci | d) term. To calculate the P(ci | d)  term, we need to make some assumptions to make it easy, these assumptions will shape the topology of the Bayesian Network:
  1. The order of the words in both training or test documents are not important.
  2. The existence of any word in the text does not depend on other words.

Now we can easily calculate the P(ci | d)  from the following formula, which is joint probability of having words of the given test document in any classes like ci :

P(ci | d) = P(w1 | ci) . P(w2 | ciP(w2 | ciP(w3 | ci) ...         [5]

[5] works well but sometimes there are some words in the test document that don't exist in training sets, so they have zero probability which makes the total probability comes to zero and ruins the comparison, and this is one of the root causes of our problem. You can use smoothing methods like "Laplace Smoothing" to get rid of this zero probabilities.


The problem is here 
Again, the problem happens when you want to calculate formula [5]. Why? The test document might have many words with small probabilities, and their multiplication will give you a very small number your computer considers it as zero. But since the logarithm function is always ascending, like when we omitted the denominator of [2] and simplified the formula to [3], we can use a logarithm transformation function to compare and find the class with maximum probability like [6]. So the result of the [3] can be derived from the following formula too, and you'll not have the zero problem anymore:

Considering A > 0 and  B > 0 then we can say:  A > B <=> log(A) > log(B)  so

Selected Class 
   = argmax  log(P(ci| d) . P(ci))  
   = argmax  log(P(ci | d)) + log(P(ci)) 
   = argmax  log(P(ci))  + log(P(w1 | ci)) + log( P(w2 | ci) + log(P(w3 | ci)) + ...       [6]


Complete Document Classification Source Code
I wrote a very simple but a tutorial Java code to do what we talked about in this post, considering its size and simple code, it works pretty good. Here is how you can use it:

DocumentClassifer documentClassifier1 = new DocumentClassifer();

String apple = "apple";
String uber = "uber";
String microsoft = "microsoft";

documentClassifier1.trainDocument("/data/a1.txt", apple);
documentClassifier1.trainDocument("/data/a2.txt", apple);
...

documentClassifier1.trainDocument("/data/u1.txt", uber);
documentClassifier1.trainDocument("/data/u2.txt", uber);
...

documentClassifier1.trainDocument("/data/m1.txt", microsoft);
documentClassifier1.trainDocument("/data/m2.txt", microsoft);
...

documentClassifier1.learn();

System.out.println(documentClassifier1.testDocument("/data/aTest.txt"));
System.out.println(documentClassifier1.testDocument("/data/u1.txt"));
System.out.println(documentClassifier1.testDocument("/data/mTest.txt"));

Here we have given the "DocumentClassifer" instance, some documents of class "apple", some documents of class "uber" and some documents of class "microsoft". Now after giving the order of learning them, if you ask it to classify some test documents, it does the calculations we talked about in this post, and it returns the result. You can download the source code and sample files from the following address:

Download the document classification Java code, using Bayes Theorem

No comments:

Post a Comment