**A simple example of maximum likelihood estimation**

Consider the simple procedure of tossing a coin with the goal
of estimating the probability of heads for the coin. The probability
of heads for a fair coin is 0.5. However, for this example we
will assume that the probability of heads is unknown (maybe the
coin is strange in some way or we are testing whether or
not the coin is fair). The act of tossing the coin *n* times
forms an experiment--a procedure that, in theory, can be repeated
an infinite number of times and has a well-defined set of possible
outcomes. When flipping a coin, there are 2^{n}
possible sample outcomes, w. The set
of possible outcomes forms the sample space, W.
For *n* = 3 tosses of a coin, the sample space is

where *H* denotes heads and *T* denotes tails. An event
is a collection of sample outcomes and is said to occur if the
outcome of a particular experiment is a member of the event.
Often, it is useful to think of a function whose domain is the
sample space, W. Such a function is
known as a random variable. Examples of random variables for
the coin flip experiment are (i) the number of times heads appears,
(ii) the number of times tails appears, and (iii) the number of
flips until a head appears. Random variables are often denoted
by uppercase letters, often *X, Y*, and *Z*.

We now consider the coin toss example in the context of likelihood estimation. The three main components of the statistical approach are (i) the data, (ii) a model describing the probability of observing the data, and (iii) a criterion that allows us to move from the data and model to an estimate of the parameters of the model.

**Data:**** ** Assume that we have actually performed
the coin flip experiment, tossing a coin *n* = 10 times.
We observe that the sequence of heads and tails was *H, H,
H, T, H, T, T, H, T, H*. We will denote heads by 1 and tails
by 0; hence, our data will be coded as

where **x** is a vector with elements
*x*_{1} = 1,
*x*_{2} = 1,
*x*_{3} = 1,
*x*_{4} = 0, ...,
*x*_{10} = 1.
In tossing the coin, we note that heads appeared
6 times and tails appeared 4 times.

* Model:*
An appropriate model that describes the
probability of observing heads for any single flip of the coin
is the Bernoulli distribution. The Bernoulli distribution has
the following form:

where *p* is the probability of heads and
*x _{i}*
is the experimental outcome of the

**Criterion:**** **Now that we have specified the
data, **x**, and the model,
* f*(*x _{i} *|

The method of maximum likelihood was first proposed by the English statistician and population geneticist R. A. Fisher. The maximum likelihood method finds the estimate of a parameter that maximizes the probability of observing the data given a specific model for the data.

The likelihood function is defined as

The likelihood function is simply the joint probability of observing
the data. The large P means "product".
The likelihood function is obtained by multiplying the probability
function for each toss of the coin. In the case of the coin flip
experiment where we are assuming a Bernoulli distribution for
each coin flip, the likelihood function becomes

Taking the logarithm of the likelihood function does not change the
value of *p* for which the likelihood is maximized. After
taking the logarithm of both sides of the equation, this becomes

The following figures show plots of likelihood, *L*, as a
function of *p* for several different possible outcomes of
*n* = 10 flips of a coin. Note that for the case in which
3 heads and 7 tails were the outcome of the experiment, the
likelihood appears to be maximized at *p* = 0.3. Similarly,
*p* = 0.5 for the case of 5 heads and 5 tails, *p* =
0.8 for the case of 8 heads and 2 tails, and *p* = 0.9 for
the case of 9 heads and 1 tail. The likelihood appears to be
maximized when *p *is the proportion of the time that heads
appeared in our experiment. This illustrates the "brute
force" way to find the maximum likelihood estimate of *p*.

Instead of determining the maximum likelihood value of *p* graphically,
we could also find the maximum likelihood estimate of *p*
analytically. We can do this by taking the derivative of the
likelihood function with respect to *p* and finding where
the slope is zero. When this is done, the maximum is found at
. The estimate of *p* (the probability
of heads) is just the proportion of heads that we observed in our
experiment.

**Hypothesis testing in a likelihood framework**

On the basis of the data, a plausible measure of the relative tenability
of two competing hypotheses is the ratio of their likelihoods.
Consider the case in which
H_{0}
may specify that
Q
element of
w_{0},
where
w_{0}
is a subset of the possible values of
Q,
and
H_{1}
may specify that
Q
element of
w_{1},
where
w_{1}
is disjoint from
w_{0}.
Here
Q
represents the parameters of the
model describing the data (for example, *p* from the Bernoulli
distribution). The likelihood ratio is

Here, the likelihood calculated under the null hypothesis is in the numerator and the likelihood calculated under the alternative hypothesis is in the denominator. When L is small, the null hypothesis is discredited, whereas when L is large, the alternative hypothesis is discredited.

One interesting fact concerns the case in which one hypothesis
is the subset (or special case) of another more general hypothesis.
When the null hypothesis is a subset (or special case) of the
alternative hypothesis,
2 log L
is distributed
according to a
c^{2}
distribution with *q* degress of freedom under the null hypothesis,
where *q *is the difference in the number of free parameters
between the general and restricted hypotheses.

As an example, consider the coin toss.
Two hypotheses will be considered.
H_{0}
will denote the restricted (null) hypothesis whereas
H_{1}
will denote the
unrestricted (or general) hypothesis.
We want to test the null hypothesis that the coin is fair.
Hence, under
H_{0}
*p* = 0.5.
Under
H_{1},
*p* takes
the value between 0 and 1 that maximizes the likelihood function.
Under H_{0}, the likelihood is

Under H_{1},
the likelihood is

The likelihood ratio for a test of the null hypothesis that *p* = 0.5 is

To calculate the likelihood under the null hypothesis, one simply
substitutes 0.5 for *p* in the likelihood function. We
already discussed how to calculate the likelihood under the alternative
(unrestricted) hypothesis. The likelihood under the alternative
hypothesis is maximized by substituting

for *p* in the likelihood function. The two hypotheses differ
by 1 free parameter; the parameter *p* is fixed under
H_{0}
and free to vary under
H_{1}.
Hence, 2 log L
can be compared to a
c^{2}
with 1 degree of freedom. If
2 log L
is greater than 3.84, then the null hypothesis can be rejected with a level of Type I error (falsely rejecting
the null hypothesis) of 5%.

Instead of using the
c^{2}
distribution to test the significance of the likelihood
ratio, the null hypothesis can be tested in another way -- through
simulation under the null hypothesis. This approach is called
Markov simulation or parametric bootstrapping. To illustrate
this approach, we will again consider the coin toss example.
Let us assume that we have tossed a coin 1000 times and have noted
that heads appears 450 times. Once again, we want to consider
as a null hypothesis that the coin is fair (*p* = 0.5).
The alternative hypothesis will again be the likelihood calculated
with an unrestricted *p*.

The likelihood ratio test statistic for this example is
2 log L = 10.018.
Is this value greater than we would expect if *p*= 0.5?
A computer program was written that generates 1000
coin tosses under the assumption that *p* = 0.5. For every
computer replicate, we generate 1000 coin tosses, and calculate
the likelihood ratio test statistic (2 log L).
After this procedure is repeated many thousands or millions of
times, a histogram of the frequency that different
2 log L
values appear under the null hypothesis is constructed. The following
figure shows the results of such a simulation:

The 95% cutoff for this simulated distribution is about 3.8,
which closely corresponds to what we would expect from the
c^{2}
distribution (3.84). In fact, we can directly compare the simulated
distribution to a
c^{2}
distribution with 1 degree of freedom. The two distributions
are very similar.

In this case where we had 450 heads out of 1000 coin tosses,
we can reject the null hypothesis that *p* = 0.5 with greater
than 0.995 probability; we would not expect to observe 450 heads
out of 1000 tosses if the coin was fair. The coin is biased toward
tails.