Friday, October 20, 2006

Statistics Primer. Part 5: Constructing a Confidence Interval for the Sample Proportion



At this point it's probably not possible just to jump in and read the posts separately unless you've had statistics already. If you are a statistical virgin, go and read the four earlier posts in this series first. The links to them are at the bottom of this post.

As a reminder, our goal is to learn to understand confidence intervals.* For example, political polls might say that Bush's approval is 36%, give or take 3%. Or a poll might tell us that 52% of people prefer candidate A over B, give or take 6%. Where do these numbers come from?

The previous post started the necessary explanations. The 52% or 36% figure is simply the proportion of people in the poll (the sample) who expressed a certain opinion. The other stuff, about giving or taking three or six percent, has to do with the sampling distribution of this sample proportion, and its size depends on the sample size (how many people were asked) and the inherent variability in the population (and in the sample).

Put another way, the 36% Bush approval rate, give or take 3%, is really an interval ranging from a low value of 33% to a high value of 39%. It's centered on the sample proportion and extends a certain distance in each direction from it. Like this, visually:

I------sample proportion------I

Or like this, for a smaller interval:

I---sample proportion---I

We call this interval a confidence interval when its length is derived from the sampling distribution for the sample proportion. Think of it this way: You are a blind-folded archer shooting funny kinds of arrows at a dartboard. The tip of each arrow looks like a staple or like the confidence interval I have drawn above, and you are to shoot lots of these staples to the dartboard. After you finish, you can take your blindfold off and go and check how many of your staples actually cover the bull's eye on the board. If you used really, really wide staples (say, a mile wide), you will cover the bull's eye every time, but you haven't really shown any precision at all. On the other hand, if you use very narrow staples you are going to miss the bull's eye a lot. The very wide staples allow us to be very confident in the knowledge that we have hit the bull's eye. The narrow staples would give us a lot more precision but less confidence. Some sort of a compromise is needed to get both.

The compromise statisticians use is like tying the staple length to the size of the dartboard, very roughly. In our case the dartboard is the sampling distribution for the sample proportion, and our knowledge of that sampling distribution allows us to specify a level of confidence first and then to make a staple of the necessary length to get that level of confidence.

The task is made much simpler by one remarkable result in statistics, called the Central Limit Theorem (CLT). This theorem shows that the sampling distribution of the sample mean, and the sample proportion, too, has one particular form if the sample we use is large enough**. Thus, if we learn the probabilities of this one distribution we can apply them to a whole lot of problems. Not all problems, mind you. But a large chunk of the more common problems.

This probability distribution is called the normal distribution or the Gaussian distribution (by its inventor) or the Bell Curve. The last name comes from its shape, as the distribution looks like a jingle bell viewed from the side. It has all sorts of neat characteristics.*** For example, the mean, the mode and the median are all identical and fall right below the peak of the bell. The two sides of the distribution are mirror images of each other. Here is a picture of the normal distribution****:





Ignore the blackened area for the time being. It's not part of the picture but a way of looking at probability calculations.

Note that there is an infinite number of these distributions, one for each possible value of the mean and the standard deviation, so some will be further to the right or to the left, and some will be fatter than others. To make things easier, statisticians usually show one specific normal distribution, called the standard normal distribution (as I have done above), and calculate the probabilities from that. The standard normal distribution has zero as its mean and one as the length of its standard deviation. So the horizontal axis here can be viewed as measuring the variable (say the sample proportion) in units equal to the length of the standard deviation of the sampling distribution. We can apply the standard normal distribution probabilities to any normally distributed variable by using linear transformations, but it's not necessary to go there right now.

The normal distribution is a probability distribution. For every possible value of the variable on the horizontal axis (say, the sample mean or the sample proportion) the area under the curve shows the corresponding probabilities. That's what the blackened area in the above picture demonstrates: the probability that this variable has a value between the mean and half a standard deviation above it. Note that the total area under the curve but above the horizontal axis equals one or 100% because it's certain that something happens and the values here cover all feasible ones.

Now, the normal distribution has certain very nice characteristics, shown in the picture below:





(This picture uses the general Greek letters mu and sigma for the mean and standard deviation of whatever distribution we are looking at. Mu is that thing which looks like the letter u with a long tail in the front. Sigma is the letter which looks like number six. If this bothers you, just pretend that the mu is zero and the sigma one)

The percentage areas marked in the picture are probabilities. For any normally distributed variable the following is true: If we move two standard deviations up from the mean and two standard deviations down from the mean the interval we have created corresponds to the probability of 0.954 (add up all the areas under the curve between the two chosen values on the horizontal axis) . Likewise, if we move three standard deviations up from the mean and three standard deviations down from the mean the interval we have created corresponds to the probability of 0.998. (What about trying to find an interval corresponding to probability 1 or certainty? Here we meet a slight snag as the normal distribution has tails which go on to infinity, so such an interval would range from minus infinity to plus infinity.)

See what I'm getting to here? Suppose that I wanted to find an interval length in the sampling distribution of a sample proportion which corresponds to the probability of 0.95. How many standard deviations would I need to go out from the mean towards each tail to get an area of that size under the curve? We already know that two standard deviations gives us a fairly good rough approximation of it, but if I want to find the exact value I can get it from precalculated tables in statistics books and even on the net. The value we need is 1.96 standard deviations. In a similar manner the number of standard deviations each side of the mean that would give us the probability of 0.99 is 2.576.

Ok. I'm skipping quite a lot of mathematics here and some statistics, too, but I hope that the basic idea is clear, and that idea is that we can create the confidence interval (the staple in the arrow example) using this way of thinking:

First, calculate the sample proportion.

Second, calculate the standard deviation of the sample proportion. It is found as follows: Multiply the sample proportion by its complement and divide this product by the sample size. Then take the square root of the result.*****

Third, multiply this standard deviation of the sample proportion by the numerical value corresponding to a given significance level in the standard normal distribution. For example, if we use 0.95 as the significance level, then we multiply the standard deviation of the sample proportion by 1.96

Fourth add the product from the third step to the sample proportion to get the upper limit of the interval. Subtract it from the sample proportion to get the lower limit of the interval.

The traditional choices for confidence levels are 0.90, 0.95 or 0.99, or the same in percentages(90%, 95% or 99%). What do these mean? Return to the arrow-shooting example. Suppose that you are blind-folded and shoot a staple-tipped arrow a hundred times to the dartboard. You then take the blindfold off and check the results. If you used a 0.95 staple, you should find at most five staples totally missing the bull's eye. If you used a 0.99 staple, at most one out of the hundred arrows should have missed the bull's eye.

Translated into statistics, what this means in the case of a 0.95 confidence level is that if the same study design was used a hundred times to draw samples of identical size, and if a confidence interval was created from each sample proportion, at most five of these confidence intervals would not include the true unknown population proportion.

Note that the interval is wider if we want to be more confident. It is also wider if the standard deviation of the sample itself is large or if the sample size is small. All this makes sense, as we should have a wider interval when there is more variation in the sample and/or when the sample is smaller.

We are going to finish this post with an example. But before that I need to talk about a concept which you often see when polls are discussed: The margin of error or MOE. For example, a poll might tell you that the margin of error is plus/minus three percent. If the poll consisted of only one single question with yes-no type answers, the MOE would equal the confidence interval as we have calculated it here, i.e., it would give us the amount to add to and subtract from the sample proportion.

But most polls have several questions. Clearly, the confidence intervals can't all be the exact same length. So what is the MOE in these cases?

It tends to be the longest possible confidence interval we could get in a particular study. Usually the studies use the 0.95 level of confidence for getting this. Think about the calculations I outlined above. If we have fixed the confidence interval at 0.95 and if we have a sample of a given size, what could we change to make the interval as long as possible? Suppose that we ignore the actual value of the sample proportion we get in any particular question and just replace it by 0.5. It turns out that this value makes the standard deviation of the sample proportion as large as it ever could be. So using this trick allows pollsters to give us just one margin of error for the whole study, when in reality each question has a different confidence interval. But this MOE is most likely an overstatement. Viewed from a different angle, the only real information it gives us is the sample size (as both the confidence level and the proportion used in the calculations are set to equal certain constants). The sample size is usually provided separately in any case.

Time for an example. It's from a recent poll, to be found here. The MOE for the whole study is given as plus or minus 3.1%. Let's pick one specific question from this poll, the one where the 1006 respondents are asked if they approve or disapprove of the Congress. The percentage numbers are as follows:
Approve 16%
Disapprove 75%
Not sure 9%

I started by taking out the people who were not sure. What polls do about this lot varies, but it's usually best to omit them from the analyses. That left me with 915 answers and the approval rate within this sample of 915 of 17.6%. The disapproval percentage is now 82.4%. I want to make a confidence interval for the disapproval rate among those respondents who express strong opinions. So the first step is to calculate the sample proportion, which is 0.824. The second stage is to calculate the standard deviation of this sample proportion. We find it by first multiplying 0.824 by 0.176, then dividing the product by 915 and finally by taking the square root of the whole thing. This gives me 0.0126. Next, I find how much to add to and subtract from the sample proportion to get an interval by multiplying 0.0126 with 1.96, the number of standard deviations that is associated with the 0.95 level of confidence. The result is 0.0247, or 2.47%. The 0.95 level confidence interval is then found by adding this to 0.824 and then by subtracting it from 0.824. We get, after rounding, an interval from 0.85 to 0.80.

A couple of comments: First, I did a lot of the calculating in my head so double-check. Second, note the difference between the 2.47% here and the MOE for the whole study which is given as 3.1%. The difference comes from the fact that the sample proportion here is quite far from 0.5, and so the standard deviation of the sample proportion is quite a bit smaller than the maximum possible value for this. Third, I claim no special significance for the choice I made about those who are not sure. You could experiment by putting them into the "approve" group or the "disapprove" group and see what happens to the results.




----
* I talk about finding confidence intervals for the sample proportion in this post, but everything applies to the sample mean just handily, except for the exact formula of the standard deviation of the sample mean which for the sample mean is just the sample standard deviation divided by the sample size.
**The CLT says something more than this. Google it if you are interested. Usually the "large enough" a sample is quite small, though prudent people want the sample size to be at least thirty. For binary data of the type we get with polls the requirement is that there are at least five answers counted as 1 and five answers counted as 0.
***It is a continuous distribution, though, and that means that the sample distribution of the sample proportion is only approximately normal, because we are using a continous distribution to approximate a discrete one. It's not very hard to find the exact sampling distribution for the sample proportion and to use that to find the probabilities needed. This is what is done with small sample studies anyway. But the extra work is unnecessary as the probabilities start equaling those we get from the normal distribution when the sample gets bigger.
****The vertical axis is marked probability density in the picture, because continuous probability functions are called probability density functions in geeky company.
*****This is the shorthand formula for the standard deviation of the sample proportion. In an earlier post I used a pedagogical aid to get the value, but this formula gives it much faster.
----
Part 1
Part 2
Part 3
Part 4
Part 5
Part 6