Maximum Likelihood

A simple example of maximum likelihood estimation

Consider the simple procedure of tossing a coin with the goal of estimating the probability of heads for the coin. The probability of heads for a fair coin is 0.5. However, for this example we will assume that the probability of heads is unknown (maybe the coin is strange in some way or we are testing whether or not the coin is fair). The act of tossing the coin n times forms an experiment--a procedure that, in theory, can be repeated an infinite number of times and has a well-defined set of possible outcomes. When flipping a coin, there are 2n possible sample outcomes, w. The set of possible outcomes forms the sample space, W. For n = 3 tosses of a coin, the sample space is


where H denotes heads and T denotes tails. An event is a collection of sample outcomes and is said to occur if the outcome of a particular experiment is a member of the event. Often, it is useful to think of a function whose domain is the sample space, W. Such a function is known as a random variable. Examples of random variables for the coin flip experiment are (i) the number of times heads appears, (ii) the number of times tails appears, and (iii) the number of flips until a head appears. Random variables are often denoted by uppercase letters, often X, Y, and Z.

We now consider the coin toss example in the context of likelihood estimation. The three main components of the statistical approach are (i) the data, (ii) a model describing the probability of observing the data, and (iii) a criterion that allows us to move from the data and model to an estimate of the parameters of the model.

Data: Assume that we have actually performed the coin flip experiment, tossing a coin n = 10 times. We observe that the sequence of heads and tails was H, H, H, T, H, T, T, H, T, H. We will denote heads by 1 and tails by 0; hence, our data will be coded as

x = {1, 1, 1, 0, 1, 0, 0, 1, 0, 1}

where x is a vector with elements x1 = 1, x2 = 1, x3 = 1, x4 = 0, ..., x10 = 1. In tossing the coin, we note that heads appeared 6 times and tails appeared 4 times.

Model: An appropriate model that describes the probability of observing heads for any single flip of the coin is the Bernoulli distribution. The Bernoulli distribution has the following form:


where p is the probability of heads and xi is the experimental outcome of the ith coin flip (i.e., heads or tails). The vertical line in the function means "given". Note that if p = 0.6, the function returns 0.6 if xi = 1 and 0.4 if xi = 0. Many independent Bernoullis give the binomial distribution.

Criterion: Now that we have specified the data, x, and the model, f(xi |p), we need a criterion to estimate the parameters of the model. Here, the parameter of interest (the parameter to be estimated) is p--the probability that heads appears for any single toss of the coin. Several methods could be applied at this point, including the method of moments, least squares, and Bayesian estimation. However, we will consider only the method of maximum likelihood here.

The method of maximum likelihood was first proposed by the English statistician and population geneticist R. A. Fisher. The maximum likelihood method finds the estimate of a parameter that maximizes the probability of observing the data given a specific model for the data.

The likelihood function is defined as

The likelihood function is simply the joint probability of observing the data. The large P means "product". The likelihood function is obtained by multiplying the probability function for each toss of the coin. In the case of the coin flip experiment where we are assuming a Bernoulli distribution for each coin flip, the likelihood function becomes

Taking the logarithm of the likelihood function does not change the value of p for which the likelihood is maximized. After taking the logarithm of both sides of the equation, this becomes

The following figures show plots of likelihood, L, as a function of p for several different possible outcomes of n = 10 flips of a coin. Note that for the case in which 3 heads and 7 tails were the outcome of the experiment, the likelihood appears to be maximized at p = 0.3. Similarly, p = 0.5 for the case of 5 heads and 5 tails, p = 0.8 for the case of 8 heads and 2 tails, and p = 0.9 for the case of 9 heads and 1 tail. The likelihood appears to be maximized when p is the proportion of the time that heads appeared in our experiment. This illustrates the "brute force" way to find the maximum likelihood estimate of p.

Instead of determining the maximum likelihood value of p graphically, we could also find the maximum likelihood estimate of p analytically. We can do this by taking the derivative of the likelihood function with respect to p and finding where the slope is zero. When this is done, the maximum is found at . The estimate of p (the probability of heads) is just the proportion of heads that we observed in our experiment.

Hypothesis testing in a likelihood framework

On the basis of the data, a plausible measure of the relative tenability of two competing hypotheses is the ratio of their likelihoods. Consider the case in which H0 may specify that Q element of w0, where w0 is a subset of the possible values of Q, and H1 may specify that Q element of w1, where w1 is disjoint from w0. Here Q represents the parameters of the model describing the data (for example, p from the Bernoulli distribution). The likelihood ratio is


Here, the likelihood calculated under the null hypothesis is in the numerator and the likelihood calculated under the alternative hypothesis is in the denominator. When L is small, the null hypothesis is discredited, whereas when L is large, the alternative hypothesis is discredited.

One interesting fact concerns the case in which one hypothesis is the subset (or special case) of another more general hypothesis. When the null hypothesis is a subset (or special case) of the alternative hypothesis, ­2 log L is distributed according to a c2 distribution with q degress of freedom under the null hypothesis, where q is the difference in the number of free parameters between the general and restricted hypotheses.

As an example, consider the coin toss. Two hypotheses will be considered. H0 will denote the restricted (null) hypothesis whereas H1 will denote the unrestricted (or general) hypothesis. We want to test the null hypothesis that the coin is fair. Hence, under H0 p = 0.5. Under H1, p takes the value between 0 and 1 that maximizes the likelihood function. Under H0, the likelihood is


Under H1, the likelihood is


The likelihood ratio for a test of the null hypothesis that p = 0.5 is

To calculate the likelihood under the null hypothesis, one simply substitutes 0.5 for p in the likelihood function. We already discussed how to calculate the likelihood under the alternative (unrestricted) hypothesis. The likelihood under the alternative hypothesis is maximized by substituting

for p in the likelihood function. The two hypotheses differ by 1 free parameter; the parameter p is fixed under H0 and free to vary under H1. Hence, ­2 log L can be compared to a c2 with 1 degree of freedom. If ­2 log L is greater than 3.84, then the null hypothesis can be rejected with a level of Type I error (falsely rejecting the null hypothesis) of 5%.

Instead of using the c2 distribution to test the significance of the likelihood ratio, the null hypothesis can be tested in another way -- through simulation under the null hypothesis. This approach is called Markov simulation or parametric bootstrapping. To illustrate this approach, we will again consider the coin toss example. Let us assume that we have tossed a coin 1000 times and have noted that heads appears 450 times. Once again, we want to consider as a null hypothesis that the coin is fair (p = 0.5). The alternative hypothesis will again be the likelihood calculated with an unrestricted p.

The likelihood ratio test statistic for this example is ­2 log L = 10.018. Is this value greater than we would expect if p= 0.5? A computer program was written that generates 1000 coin tosses under the assumption that p = 0.5. For every computer replicate, we generate 1000 coin tosses, and calculate the likelihood ratio test statistic (­2 log L). After this procedure is repeated many thousands or millions of times, a histogram of the frequency that different ­2 log L values appear under the null hypothesis is constructed. The following figure shows the results of such a simulation:

The 95% cutoff for this simulated distribution is about 3.8, which closely corresponds to what we would expect from the c2 distribution (3.84). In fact, we can directly compare the simulated distribution to a c2 distribution with 1 degree of freedom. The two distributions are very similar.

In this case where we had 450 heads out of 1000 coin tosses, we can reject the null hypothesis that p = 0.5 with greater than 0.995 probability; we would not expect to observe 450 heads out of 1000 tosses if the coin was fair. The coin is biased toward tails.