Lecture Notes 6: Naïve Bayes Classification
Section Topic
1 Bayes Rule
2 Bayesian Statistics
3 Anderson’s Iris Data
4 Introduction to the Naive Bayes Classification Algorithm
5 NB Implementation using R Package e1071
6 Cross-Validation using R Packages “klaR” and “caret”
7 Two Kinds of Independence
8 Sources
Exercises
Section 1: Bayes Rule
Suppose The rate of fetal spina bifida is 1/1000. A blood test (on the mother) can detect this disease early in pregnancy. The test has a false positive rate of 3% and a false negative rate of 1%. A woman decides to have the test and it comes back positive. What is the probability that spina bifida is present?
Consider a population of 100,000 pregnancies. We can say that “on average” there will be 100 cases of spina bifida in that population. Of those, on average, given a correct positive rate is 99% (100%-1%) and 100 actual positives, we get 99 correct positive test results because
0.99 x 100 = 99 correct positive test results
will result, so that only one of the 100 disease cases will be missed on, average.
There remain the
99,900 = 100,000 - 100
cases without spina bifida. Of those, on average 3% of those will test positive even though the disease is not present, so we will get on average 2997 incorrect positive test results:
0.03 x 99,900 = 2997 incorrect positive test results
so, on average there will be 2997 incorrect positive tests.
Thus, the total number of positive tests will be on average
99 + 2997 = 3096.
Of those, 99 are actual cases of spina bifida. Therefore, the probability that the disease is present given a positive test result is
99 / 3096 = 0.03198 = 3.198%.
This is approximately 1/33, a surprising and disturbing result.
What we have done, above, is an example of Bayes Rule. This rule is named for the Reverend Bayes (1701-1761), a skilled logician and mathematician.
We started with the conditional probability of a negative test given the disease was present – the false negative rate – and the conditional probability of a positive test given the disease was not present – the false positive rate. Along with the postulated disease rate of 1/1000, we were then able to use a very careful application of common sense to compute the conditional probability that the disease was present given a positive test result.
If we do not wish to use common sense, we can use Bayes Rule instead. This formalism states
P(A|B) = P(B|A)P(A) / [P(B|A)P(A) + P(B|A’)P(A’)],
where A’ is “not A”, or the complement of A.
In this case, A denotes the event that the disease is present, and B denotes a positive test. We have
[P(A) = 1/1000 = 0.001
P(A’) = 0.999
P(B|A) = probability of a positive test given disease is present = 0.99 = 1 – 0.01
P(B|A’) = probability of a positive test given disease is not present = 0.03,
so,
P(A|B) = 0.99*0.001 / [0.99*0.001 + 0.03* 0.999]
= 0.00099 / [0.00099 + 0.02997]
= 0.00099 / 0.03096
= 0.03198
In my experience, if I find agreement when using both methods, then the chances I have done the computation correctly are very high!
It is not obvious, but Bayes Rule can also be understood in this way:
A pamphlet provided at a midwife clinic my wife summarized the issue this way: “For every 33 positive tests, there will be on average only one actual case of spina bifida. The decision of whether or not to have the test is up to you, but we do not recommend it. We recommend you wait and have an amniocentesis test, later, which is far more reliable.”
P(A|B) = P(B and A) / P(B),
where
P(B and A) = P(B|A)P(A)
and
P(B) = P(A and B) + P(A’ and B)
= P(B|A)P(A) + P(B|A’)P(A’),
which is not obvious!
It is vital to understand that Bayes Rule is not limited to such simple cases. We could have far more possible outcomes to deal with, and we can even consider cases in which the random variables are continuous.
Section 2: Bayesian Statistics
The problem that was actually studied by Thomas Bayes is called “inverse probability,” which we would today call “statistical inference.” What Bayes realized was that it was possible to use conditional probability arguments to estimate parameters. Recall that in the Normal distribution, μ and σ are parameters. Likewise, for a binomial distribution, the parameters are n and π, where π is the probability of a success.
However, this method requires some courage on the part of the user. It requires that we view the parameter to be estimated as a random variable! For example, consider a real coin. We do not know if it is a fair coin or not. That is, letting π denote the probability a toss will result in heads, we do not know the value of π. To use Bayes rule, here, we will need to put forward a probability distribution for π. This is called the prior distribution. Then, we toss the coin n times and count the number of heads. Making the reasonable assumption that the number of heads follows a binomial distribution conditional on π, we can then employ Bayes Rule to get an updated distribution on π, called the posterior distribution.
Let us propose a prior distribution for the probability of heads as follows:
Prior Probability Distribution for π
Outcome Probability
4/10 ¼
5/10 ½
6/10 ¼
To make this situation concrete, let us translate the problem from a coin to a question of the number of black marbles out of 10 marbles in a bag. That is, you are given a bag containing 10 marbles, some black and some white. In fact, you are told that there are either 4, 5, or 6 black marbles, but you don’t know which it is. As a prior distribution, suppose you personally feel the chances are ¼ that there are 4 black marbles, the chances are ½ that there are 5 black marbles, and that the chances are ¼ that there are 6 black marbles.
Now, note that if there are 4 black marbles, then the probability of drawing a black marble is 4/10 = 0.4. Recall that a few paragraphs back, we let π denote the prior probability of getting heads. Here, let π denote the prior probability of getting a black marble. According to our prior probabilities, the probability is ¼ that π = 4/10, or
P(π=4/10) = ¼.
Study the above sentences carefully – they contain probability at two different levels: one is the probability of drawing a black marble; the other is the probability that the bag contains 4 black marbles! [The key to making progress in a field like that this is to come back again and again to wrestle with the above paragraphs until you get the meaning of that cryptic statement “P(π=4/10) = ¼.” Do not sit back and figure that understanding this will be done by somebody else!]
In summary, our prior distribution for π is:
P(π=0.4) = ¼, P(π=0.5) = ½, and P(π=0.6)=1/4.
Next, we want to get some data with which to update our prior distribution. Let us do this by drawing five marbles (with replacement). That means that each time we draw a marble we put it back and shake the bag before drawing, again. From a practical point of view, sampling with replacement is inefficient, but it results in much simpler probability problem than sampling without replacement. Based on the number of black and white marbles in the sample, we will update your prior distribution using Bayes Rule! Sampling with replacement in this setting, the number of black marbles in the sample “conditional on π” follows a binomial probability distribution, one of the simplest of useful distributions.
Suppose we find a total of 3 black and 2 white marbles. How do we digest that information? The key step is to evaluate the probability of getting 3 black marbles out of 5 for each of our three potential values of π. If π=0.4, what is the probability of getting 3 black marbles? We can use the binomial probability distribution to find out. (See any online source of information on the Binomial distribution or any intro stat text.)
P(k=3|π=0.4) = 5C3 0.43 0.62 = 0.2304.
That term is the P(B|A) that occurs in the numerator of the right-hand side of Bayes Rule. We need to multiply it by P(π=.4), which is given by the prior distribution as ¼. The probability of getting 3 black marbles in 5 and having π=0.4 is computed as
P(k=3 & π=0.4) = P(k=3|π=0.4) P(π=.4) = (0.2304) ( ¼ ) = 0.0576.
We now need the denominator. That is a little more complicated than in the spina bifida case because there are three possible values of π that were postulated by the prior distribution.
Here, our denominator will look like this:
P(k=3|π=0.4) P(π=.4) + P(k=3|π=0.5) P(π=.5) + P(k=3|π=0.6) P(π=.6)
= (5C3 0.43 0.62) ( ¼ ) + 5C3 0.53 0.52 ( ½ ) + 5C3 0.63 0.42 (¼)
= 0.0576 + 0.15625 + 0.0864.
Putting all this together, we have an updated, posterior probability of
P(π=0.4|k=3)
= P(k=3|π=0.4)P(π=.4) / [P(k=3|π=0.4)P(π=.4)+P(k=3|π=0.5)P(π=.5)+P(k=3|π=0.6)P(π=.6)]
= 0.0576 / [0.0576 + 0.15625 + 0.0864]
= 0.0576 / 0.30025
= .1918.
Note that this posterior probability that π=0.4 has been reduced from the prior probability of ¼. In turn, this means that greater weight will be given to other values of π by the posterior distribution.
We will now compute the posterior probability for π=0.5. This is relatively simple because the denominator remains the same. We have,
P(π=0.5|k=3)*P(k=3) = P(π=0.5 and k=3)
Or,
P(π=0.5|k=3) = P(π=0.5 and k=3) / P(k=3)
P(k=3) = P(π=0.4 and k=3) + P(π=0.5 and k=3) + P(π=0.6 and k=3).
Section 3: Anderson’s Iris Data
We will be working with a famous data set, “Anderson’s Iris data,” which is found in R under the name of iris. The Iris data contains lengths and widths of the petals and sepals of 150 specimens of iris flowers. In the R data set, those four measurements are labeled Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width. There are three species present: Versicolor, Virginica, and Setosa.
The Iris data seems to be used in every multivariate statistics book, perhaps because all reasonable classification methods correctly classify almost all of the 150 flowers. Despite the small size of the iris data, it is also used in almost all texts that contain the phrase “data mining” in the title.
When I studied multivariate statistics, I was not particularly excited by the Iris data set – what did I care? In the back of my mind, I think I believed I could easily tell which specimens belonged to which species by visual inspection. With the aid of photos on the Internet, I now see I was presumptuous! Here are pictures of specimens of each species: this is not a trivial classification problem if one relied only on photographs. However, as we will see, the species are fairly easy to distinguish by the relative sizes of the flower sepals and flower petals.
The sepal is the larger portion of the flower, while the petal is smaller, whitish and lies on top of the sepal.
Running the command
>dim(iris)
we find the iris data set contains 150 rows and 5 columns. As it happens, there are 50 of each type of iris flower. I wonder how that was arranged and why.
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme