Modelling Sample Proportions
What is True Proportion?
Say we take many random samples of 1000 people and find the proportion of each sample being male. Let’s collect all those proportions into a histogram. Where would the centre of the histogram be?
We don’t know. But it’s reasonable to think that it’ll be very close to the true proportion. We’ll likely never know the value of the true proportion, but it’s important so we give it a label (p).
A computer can pretend to draw random samples of 1000 people from some population of values over & over. In this way we can model the process of drawing many samples from a real population. This is a simulation, and it can help us understand how sample proportions vary due to random sampling.
Variability of Data & Sample Size
Does the variability in a sample proportion depend on the size of the sample? The answer is Yes, and it’s higher for small samples.
p & q for Sample VS. Population
We use p for the proportion of successes in the population and p̂ for the observed proportion of successes in a sample.
We also use q for the proportion of failures in the population and q̂ for its observed value in a sample.
The Sampling Distribution for Proportions
What is the Sampling Distribution?
The distribution of proportions over many independent samples from the same population is called the sampling distribution of the proportions.
To model it using the Normal distribution, we need to know its mean & standard deviation.
The Mean of the Sampling Distribution
The mean is p, the true proportion.
The Standard Deviation of the Sampling Distribution
Where n is the sample size, and p̂ is the proportion of success in the sample.
Why is the Sample Distribution valuable?
First, because it is known from mathematics to be a good model (& one that gets better as the sample size gets larger), the standard deviation tells us how accurately our sample mean represents the true population mean.
Sampling Distribution & the Normal Approximation to the Binomial
If the true population proportion is p = 0.21, we’ve seen that the probability of getting np̂ = 202 successes out of n = 1000 (i.e., p̂ = 0.202) comes from the Normal distribution with mean p = 0.21 & standard deviation.
Alternatively, we could look at this as an example of the Binomial distribution. The probability of getting 202 successes out of 1000 is Binomial and can be approximated by the Normal distribution with mean np = 210 & standard deviation
The only difference is that, in the case of the Normal approximation to the Binomial, we add or subtract 0.5 depending on whether we’re interested in the probability of getting a value below or above 202.
But our approach based on sampling distribution for proportions uses the Normal distribution, which is continuous. Thus, you don’t add or subtract 0.5 in the calculations above.
The sampling distribution for proportions can therefore be regarded as equivalent to the Normal approximation to the Binomial distribution. Both approaches require the same assumptions & conditions, especially np > 10 and nq > 10.
Sample variability refers to the slight differences that are expected between one sample to another.
How Good is the Normal Model?
We’ve said that if we draw repeated random samples of the same size, n, from some population and measure the proportion, p̂, we get for each sample, then the collection of these proportions will pile up around the underlying population proportion, p, in such a way that a histogram of the sample proportions can be modelled well by a Normal model.
But there’s a catch. The sample size needs to be an appropriate size. The model becomes a better representation of the distribution of the sample proportions as the sample size gets bigger.
Assumptions & Conditions
- Independence Assumption: The sampled values must be independent of each other.
- Sample Size Assumption: The sample size, n, must be large enough.
The best we can do with assumptions is to think about whether they’re likely to be true, and we should do so. However, we can often check corresponding conditions that provide information about the assumptions.
Think about the Independence Assumption and check the following corresponding conditions before using the Normal model to model the distribution of sample proportions:
- Randomization Condition: If data is from an experiment, subjects should’ve been randomly assigned to treatments. If data is from a survey, sample should be completely random.
- 10% Condition: The sample size, n, must be no larger than 10% of the population.
- Success/Failure Condition: The sample size must be big enough so that both the number of successes, np, and the number of failures, nq, are expected to be at least 10.
Central Limit Theorem: Fundamental Theorem of Statistics
It turns out that means also have a sampling distribution that we can model with a Normal model.
Simulating the Sampling Distribution of a Mean
Here’s a simple simulation with a quantitative carriable. The following histogram shows the results of 10,000 simulated dice rolls.
That’s called a uniform distribution. Now let’s roll a pair of dice & record the average of the 2.
Now, let’s take the mean of 20 dice.
It’s the Normal distribution.
The Central Limit Theorem
The sampling distribution of almost any mean becomes Normal as the sample size grows. All we need is for the observations to be independent and collected with randomization. This is called the Central Limit Theorem (CLT).
This is true regardless of the shape of the population distribution! Even if we sample from a skewed or bimodal population, the CLT holds. However, it does work better & faster if the population distribution that was initially Normal.
The Sampling Distribution of the Mean
Any Normal model is specified by its mean & standard deviation. For proportions, the sampling distribution is centred at the population proportion. For means, it’s centred at the population mean.
What about standard deviation? The Normal model for the sampling distribution of the mean has a standard deviation equal to , where σ is the standard deviation of the population. To emphasize that this is a standard deviation parameter of the sampling distribution model for the sample mean, y, we write SD(ȳ) or σ(ȳ).
Which Sampling Distribution Model Do We Use?
We now have two closely related sampling distribution models. Which one we use depends on which kind of data we have:
- When we have categorical data, we calculate a sample proportion, p̂. Its sampling distribution follows a Normal model with a mean at the population proportion, p, and a standard deviation
- When we have quantitative data, we calculate a sample mean, ȳ. Its sampling distribution has a Normal model with a mean at the population mean, µ, and a standard deviation .
Assumptions & Conditions
- Independence Assumption: The sampled values must be independent of each other.
- Randomization Condition: The data values must be sampled randomly.
- 10% Condition: The sample size, n, should be no more than 10% of the population.
- Large-Enough Sample Condition: There’s no one-size-fits-all rule. If the population is unimodal and symmetric, even a fairly small sample is okay. For highly skewed distributions, it may require samples of several hundred.
Sample Size – Diminishing Returns
The standard deviation of the sampling distribution declines only with the square root of the sample size. The mean of a random sample of four has half (1/ = 1/2) the standard deviation of an individual data value. To cut it in half again, we’d need a sample of 16, and a sample of 64 to halve it once more. In practice, random sampling works well, and means have smaller standard deviations than the individual data values that were averaged.
If we could afford a much larger sample, we could get the standard deviation of the sampling distribution under control so that the sample mean could tell us still more about the unknown population mean. But the square root limits how much we can make a sample tell about the population. This is an example of something that’s known as the Law of Diminishing Returns.
Often we only know the observed proportion, p̂, or the observed sample standard deviation, s. So we estimate. Whenever we estimate the standard deviation of a sampling distribution, we call it a standard error (SE).
For a sample proportion, p̂, the standard error is:
For the sample mean, ȳ, the standard error is:
Relating the Concepts
At the heart is the idea that the statistic itself (the proportion or the mean) is a random quantity. A different random sample would have given a different result. This sample-to-sample variability is what generates the sampling distribution, the distribution of all the possible values that the statistic could have had.
We could simulate the distribution by pretending to take lots of samples. Fortunately, for the mean & proportion, the CLT tells us we can model their sampling distribution directly with a Normal model.
We start with a population model, which can have any shape. It can even be bimodal or skewed. We label the mean of this model µ and its standard deviation s.
We draw one real sample (solid line) of size n & show its histogram and summary statistics. We imagine (or simulate) drawing many other samples (dotted lines), which have their own histograms & summary statistics.
We imagine gathering all the means into a histogram.
The CLT tells us that we can model the shape of this histogram with a Normal model. The mean of this Normal is µ, and the standard deviation is . When we don’t know σ, we estimate it with the standard deviation of the one real sample. That gives us the standard error .
What Can Go Wrong?
Don’t confuse the sampling distribution with the distribution of the sample.
The sampling distribution is an imaginary collection of the values that a statistic, such as a proportion or a mean, might have taken for all the random samples—the one you got and the ones you didn’t get. Use the sampling distribution model to make statements about how the statistic varies.
The distribution of the sample is information regarding one sample.
Beware of observations that aren’t independent.
The CLT depends on the assumption of independence. Think about how the data were gathered.
Watch out for small samples with dealing with proportions.
Ensure that you’re following the Success/Failure condition and keep in line with the 10% condition.
Watch out for small samples from skewed populations when dealing with means.
The CLT assures us the sampling distribution model is Normal if n is large enough. If the population is nearly Normal, even small samples may work. If the population is very skewed, then n will have to be large before the Normal model will work well. Unfortunately, there’s no good rule to handle this.