**Modelling Sample Proportions**

### What is True Proportion?

Say we take many random samples of 1000 people and find the proportion of each sample being male. Let’s collect all those proportions into a histogram. Where would the centre of the histogram be?

We don’t know. But it’s reasonable to think that it’ll be very close to the **true proportion**. We’ll likely never know the value of the true proportion, but it’s important so we give it a label (** p**).

### Using Simulations

A computer can pretend to draw random samples of 1000 people from some population of values over & over. In this way we can *model* the process of drawing many samples from a real population. This is a **simulation**, and it can help us understand how sample proportions vary due to random sampling.

### Variability of Data & Sample Size

Does the variability in a sample proportion depend on the size of the sample? The answer is ** Yes**, and it’s

*higher for small samples*.

**Notation**

### p & q for Sample VS. Population

We use *p* for the proportion of successes in the population and *p̂* for the observed proportion of successes in a sample.

We also use *q* for the proportion of failures in the population and *q̂* for its observed value in a sample.

**The Sampling Distribution for Proportions**

### What is the Sampling Distribution?

The distribution of proportions over many independent samples from the same population is called the **sampling distribution** of the proportions.

To model it using the Normal distribution, we need to know its mean & **standard deviation**.

#### The Mean of the Sampling Distribution

The **mean is p**, the

*true proportion*.

#### The Standard Deviation of the Sampling Distribution

Where *n* is the sample size, and *p̂* is the proportion of success in the sample.

**$SD(\hat{p}) = \sqrt{\frac{p(1-p)}{n}} = \sqrt{\frac{pq}{n}}$**

#### Why is the Sample Distribution valuable?

First, because it is known from mathematics to be a good model (& one that gets better as the sample size gets larger), the standard deviation tells us how accurately our sample mean represents the true population mean.

#### Sampling Distribution & the Normal Approximation to the Binomial

If the true population proportion is *p* = 0.21, we’ve seen that the probability of getting *np̂* = 202 successes out of *n* = 1000 (i.e., *p̂* = 0.202) comes from the Normal distribution with mean *p* = 0.21 & standard deviation.

$SD=√(pq/n)=√((0.21×0.79)/1000)=0.0129.$

$z=(p̂-p)/√(pq/n)=(0.202-0.21)/0.0129=-0.621$

Alternatively, we could look at this as an example of the Binomial distribution. The probability of getting 202 successes out of 1000 is Binomial and can be approximated by the Normal distribution with mean *np* = 210 & standard deviation

$SD=√npq=√(1000×0.21×0.79)=12.9.$

$z=(np̂±0.5-np)/√npq=(202±0.5-210)/12.9=-0.621$ .

The only difference is that, in the case of the Normal approximation to the Binomial, we add or subtract 0.5 depending on whether we’re interested in the probability of getting a value below or above 202.

But our approach based on sampling distribution for proportions uses the Normal distribution, which is continuous. __Thus, you don’t add or subtract 0.5 in the calculations above.__

The sampling distribution for proportions can therefore be regarded as **equivalent** to the Normal approximation to the Binomial distribution. Both approaches require the same assumptions & conditions, especially *np* > 10 and *nq* > 10.

### Sampling Variability

**Sample variability** refers to the slight differences that are expected between one sample to another.

### How Good is the Normal Model?

We’ve said that if we draw repeated random samples of the same size, *n*, from some population and measure the proportion, *p̂*, we get for each sample, then the collection of these proportions will pile up around the underlying population proportion, *p*, in such a way that a histogram of the sample proportions can be modelled well by a Normal model.

But there’s a catch. The **sample size needs to be an appropriate size**. The model becomes a better representation of the distribution of the sample proportions as the sample size gets bigger.

### Assumptions & Conditions

#### Assumptions

The sampled values must be*Independence Assumption:**independent*of each other.The sample size,*Sample Size Assumption:**n*, must be*large*enough.

#### Conditions

The best we can do with assumptions is to think about whether they’re likely to be true, and we should do so. However, we can often check corresponding **conditions** that provide information about the assumptions.

Think about the **Independence Assumption** and check the following corresponding conditions before using the Normal model to model the distribution of sample proportions:

If data is from an experiment, subjects should’ve been randomly assigned to treatments. If data is from a survey, sample should be completely random.*Randomization Condition:*The sample size,*10% Condition:**n*, must be no larger than 10% of the population.The sample size must be big enough so that both the number of*Success/Failure Condition:**successes*,*np*, and the number of*failures*,*nq*, are expected to be at least 10.

**Central Limit Theorem: Fundamental Theorem of Statistics**

It turns out that **means** also have a sampling distribution that we can model with a Normal model.

### Simulating the Sampling Distribution of a Mean

Here’s a simple simulation with a quantitative carriable. The following histogram shows the results of 10,000 simulated dice rolls.

That’s called a **uniform distribution**. Now let’s roll a *pair* of dice & record the average of the 2.

Now, let’s take the mean of 20 dice.

It’s the Normal distribution.

### The Central Limit Theorem

**The sampling distribution of almost any mean becomes Normal as the sample size grows**. All we need is for the observations to be

**independent**and collected with

**randomization**. This is called the

**Central Limit Theorem**(CLT).

This is true regardless of the shape of the population distribution! Even if we sample from a skewed or bimodal population, the CLT holds. However, it does work better & faster if the population distribution that was initially Normal.

**The Sampling Distribution of the Mean**

### Introduction

Any Normal model is specified by its **mean** & **standard deviation**. For proportions, the sampling distribution is **centred** at the population proportion. For means, it’s **centred** at the population mean.

What about **standard deviation**? The Normal model for the sampling distribution of the mean has a standard deviation equal to $SD(ȳ)=(σ)/√n,$ , where *σ* is the standard deviation of the population. To emphasize that this is a standard deviation *parameter* of the sampling distribution model for the sample mean, y, we write *SD*(*ȳ*) or *σ*(*ȳ*).

### Which Sampling Distribution Model Do We Use?

We now have two closely related sampling distribution models. Which one we use depends on which kind of data we have:

- When we have categorical data, we calculate a sample proportion,
*p̂*. Its sampling distribution follows a Normal model with a mean at the population proportion,*p*, and a standard deviation

$SD(\hat{p}) = \sqrt{\frac{pq}{n}} = \frac{\sqrt{pq}}{\sqrt{n}}$

- When we have quantitative data, we calculate a sample mean,
*ȳ*. Its sampling distribution has a Normal model with a mean at the population mean,*µ*, and a standard deviation .

### Assumptions & Conditions

The sampled values must be independent of each other.*Independence Assumption:*The data values must be sampled randomly.*Randomization Condition:*The sample size,*10% Condition:**n*, should be no more than 10% of the population.There’s no one-size-fits-all rule. If the population is unimodal and symmetric, even a fairly small sample is okay. For highly skewed distributions, it may require samples of several hundred.*Large-Enough Sample Condition:*

### Sample Size – Diminishing Returns

The standard deviation of the sampling distribution declines only with the square root of the sample size. The mean of a random sample of four has half (1/ = 1/2) the standard deviation of an individual data value. To cut it in half again, we’d need a sample of 16, and a sample of 64 to halve it once more. In practice, random sampling works well, and means have smaller standard deviations than the individual data values that were averaged.

If we could afford a much larger sample, we could get the standard deviation of the sampling distribution under control so that the sample mean could tell us still more about the unknown population mean. But the square root limits how much we can make a sample tell about the population. This is an example of something that’s known as the **Law of Diminishing Returns**.

**Standard Error**

Often we only know the *observed* proportion, *p̂*, or the *observed* sample standard deviation, *s*. So we estimate. Whenever we estimate the standard deviation of a sampling distribution, we call it a **standard error** (SE).

For a sample proportion, *p̂*, the standard error is:

$SE(p̂)=√((p̂q̂)/n)$

For the sample mean, *ȳ*, the standard error is:

$SE(ȳ)=s/√n$

**Relating the Concepts**

### Chapter Summary

At the heart is the idea that the statistic itself (the proportion or the mean) is a **random quantity**. A different random sample would have given a different result. This sample-to-sample variability is what generates the **sampling distribution**, the distribution of all the possible values that the statistic could have had.

We could simulate the distribution by pretending to take lots of samples. Fortunately, for the mean & proportion, the **CLT** tells us we can model their sampling distribution directly with a Normal model.

### Summary Diagram

We start with a population model, which can have any shape. It can even be bimodal or skewed. We label the mean of this model *µ* and its standard deviation *s*.

We draw one real sample (solid line) of size *n* & show its histogram and summary statistics. We imagine (or simulate) drawing many other samples (dotted lines), which have their own histograms & summary statistics.

We imagine gathering all the means into a histogram.

The CLT tells us that we can model the shape of this histogram with a Normal model. The mean of this Normal is *µ*, and the standard deviation is $SD(ȳ)=σ/√n$ . When we don’t know *σ*, we estimate it with the standard deviation of the one real sample. That gives us the standard error $SE(ȳ)=s/√s.$ .

**What Can Go Wrong?**

#### Don’t confuse the sampling distribution with the distribution of the sample.

The **sampling distribution** is an imaginary collection of the values that a statistic, such as a proportion or a mean, might have taken for all the random samples—the one you got and the ones you didn’t get. Use the sampling distribution model to make statements about how the statistic varies.

The **distribution of the sample** is information regarding one sample.

#### Beware of observations that aren’t independent.

The CLT depends on the assumption of independence. Think about how the data were gathered.

#### Watch out for small samples with dealing with proportions.

Ensure that you’re following the Success/Failure condition and keep in line with the 10% condition.

#### Watch out for small samples from skewed populations when dealing with means.

The CLT assures us the sampling distribution model is Normal if *n* is large enough. If the population is nearly Normal, even small samples may work. If the population is very skewed, then *n* will have to be large before the Normal model will work well. Unfortunately, there’s no good rule to handle this.

Give kudos to your peers!