7.1 The Central Limit Theorem for Sample Means (Averages) and Proportions

SAMPLING DISTRIBUTIONS

Let’s say we are interested in knowing the average incubation period for Covid-19 virus in humans. This average is for all humans and is, therefore, a parameter, which we will label as [latex]\mu[/latex]. Since it’s practically impossible to find out the virus incubation period for all humans and average that up to get the value of [latex]\mu[/latex], we will have to resort to sampling. But recall from UNIT 1 that results from random samples naturally vary from one sample to another purely due to the randomness of sampling (sampling error) which means we can’t be certain that our sample mean that we’ve calculated from our sample will always be our best estimate for the overall mean. So, our question is, how much do the sample results vary due to randomness alone? In order to investigate that, we will repeatedly and tirelessly sample from the population of our interest, compute the sample statistic of our interest from each of those samples we’ve taken, and create a histogram/dotplot of those sample results to see the shape of the distribution of the sample statistic.

In our example of the virus incubation period, let’s say we took our initial sample of 50 individuals from all over the world and recorded the incubation periods they experienced. From those 50 observations, we will calculate their average incubation period, called the sample average or the sample mean, [latex]\bar x[/latex]. Let’s say first sample average was 4.5 days. Now, we’ll repeat the whole process under identical conditions by taking another random sample of 50 individuals and computing the sample mean for our second sample. Suppose the sample mean for the second sample was 5.6 days. If we repeat this process — take a sample, compute the sample mean — many times, we will obtain sample means from each of these samples:\[\begin{align*}
\bar x_1 = 4.5\\
\bar x_2 = 5.6\\
\bar x_3 = 2.8 \\
\bar x_4 = 3.4\\
…\\
…\\
… \end{align*}\]
If we created a histogram or a dot plot of these sample means, the distribution that we’d obtain is called the sampling distribution of the sample means. Each sample mean here will be an estimator for the population mean. That is, looking at our initial sample mean of 4.5 days, our best estimate for the average virus incubation period in humans is also 4.5 days. However, if we instead take an average of all of the sample means, that average will be a better estimate of the population mean. In fact, as we add more and more sample results to find the average of the sample means, we’ll get closer and closer to the actual population mean, and this happens with just using the samples without us knowing anything about the population mean. As the number of samples increases without bound, the average of the sample means, [latex]\mu_{\bar x}[/latex], will equal [latex]\mu[/latex]. Additionally, the standard deviation of these sample means, [latex]\sigma_{\bar x}[/latex] will equal [latex]\dfrac{\sigma}{\sqrt n}[/latex]. The standard deviation of the sampling distribution of the sample means is also known as the standard error of the mean (SE).

The Central Limit Theorem (CLT) gives us a simple and elegant picture of what the sampling distribution of sample statistics (such as the sample mean or sample proportion) would be like given certain conditions are met. CLT is one of the most powerful and useful ideas in all of statistics.

The Central Limit Theorem (CLT) for Sample Means
The sampling distribution of the sample means approaches a normal distribution more and more as the sample size [latex]n[/latex] increases regardless of the distribution of the population. The sampling distribution of sample means is centered at the population mean, [latex]\mu_{\bar x}=\mu[/latex], and has a standard deviation of [latex]\sigma_{\bar x} = \dfrac{\sigma}{\sqrt n}[/latex].

Suppose a random variable has a mean of [latex]\mu_x[/latex] and a standard deviation of [latex]\sigma_x[/latex]. For samples of size [latex]n[/latex] from this population, the random variable [latex]\bar X[/latex] representing sample means increasingly approximates a normal distribution as the sample size [latex]n[/latex] increases. \[\bar X \sim N\left(\mu_x, \frac{\sigma_x}{\sqrt n}\right)\]

How large should the sample size be in order for the CLT to kick in? It is generally accepted that a sample size of at least [latex]30[/latex] is sufficient if the population distribution unknown, with larger sample sizes providing better approximation. If the population itself is normally distributed, then the sampling distribution is normal regardless of the sample size.

EXAMPLE

The amount of coffee that people drink per day is normally distributed with a mean of [latex]16[/latex] ounces and a standard deviation of [latex]6[/latex] ounces. [latex]18[/latex] randomly selected people are surveyed. Round all answers to [latex]4[/latex] decimal places where possible.

  1. What is the distribution of [latex]X[/latex]?
  2. What is the distribution of [latex]\bar X[/latex]?
  3. What is the probability that one randomly selected person drinks between [latex]15.5[/latex] and [latex]16.5[/latex] ounces of coffee per day?
  4. For the [latex]18[/latex] people, find the probability that the average coffee consumption is between [latex]15.5[/latex] and [latex]16.5[/latex] ounces of coffee per day.
  5. For part [latex](4)[/latex], is the assumption that the distribution is normal necessary?
  6. Find the IQR for the average of [latex]18[/latex] coffee drinkers.
SHOW SOLUTION

In this question, the random variable [latex]X[/latex] is the amount of coffee that people drink per day, which we’re told is normally distributed with a mean of [latex]16[/latex] ounces and a standard deviation of [latex]6[/latex] ounces.

  1. [latex]X \sim N\left(16,6\right)[/latex].
  2. [latex]\bar X[/latex] is a random variable representing sample means, and from CLT we know that the distribution of sample means is approximately normal with mean [latex]\mu_{\bar x} =\mu[/latex] and standard deviation [latex]\sigma_{\bar x} = \dfrac{\sigma}{\sqrt n}[/latex]. So the answer is: [latex]\bar X \sim N\left(16,\dfrac{6}{\sqrt {18}}\right).[/latex]
ADVICE  Since the number of decimal places required is [latex]4[/latex] for this question, and StatKey result are to [latex]3[/latex] decimal places, we’ll focus on DESMOS for this question. However, StatKey method is super simple, and therefore, you may want to use StatKey first to get an idea of what the answer should be and then use DESMOS for the desired level of accuracy as needed.
imageDESMOS CALCULATOR
Calculator Usage Guide
3. …probability that one randomly selected person drinks between [latex]15.5[/latex] and [latex]16.5[/latex] ounces of coffee per day
In the first input box on Desmos calculator, enter: \[X=\text{normaldist}(16,6)\] and click on the magnifying glass icon to see the normal curve. Since we want the probability/area, select the checkbox for Find Cumulative Probability (CDF) and enter 15.5 for Min, and 16.5 for Max. Area is shaded and the answer is displayed. Area/Probability [latex]=0.0664135037052[/latex].

4. For the [latex]18[/latex] people, find the probability that the average coffee consumption is between [latex]15.5[/latex] and [latex]16.5[/latex] ounces of coffee per day.
First click on the circle graph icon to the left of the input entry in the first box to turn off the graph from the last part. In the second input box on Desmos calculator, enter: \[Y=\text{normaldist}(16,6/\text{sqrt}18)\] and click on the magnifying glass icon to see the normal curve. Since we want the probability/area, select the checkbox for Find Cumulative Probability (CDF) and enter 15.5 for Min, and 16.5 for Max. Area is shaded and the answer is displayed. Area/Probability [latex]=0.276326390168[/latex].

5. YES. The assumption that the population distribution is normal necessary for the CLT to work (that is for the sampling distribution of sample means to be approximately normal) here because the sample size is less than [latex]30[/latex].

6. …IQR for the average of [latex]18[/latex] coffee drinkers
Recall that to find the value given area, we need Inversecdf, and takes area to the left.
For [latex]Q_1[/latex]: If the area to the left is 25%, then in input box #3, we enter: \[Y.\text{inversecdf}(0.25)\] which gives us the value of [latex]Q_1[/latex] as [latex]15.0461274476[/latex].

For [latex]Q_3[/latex]: The area to the left of [latex]Q_3[/latex] is 75%.
In input box #4, enter: \[Y.\text{inversecdf}(0.75)\] Answer (displayed underneath the entry) = [latex]16.9538725524[/latex].
IQR = [latex]Q_3 - Q_1 = 1.9077451048[/latex].

View on Desmos

Show me the steps for STATKEY

Details here!

3. …probability that one randomly selected person drinks between [latex]15.5[/latex] and [latex]16.5[/latex] ounces of coffee per day
Which distribution applies here: [latex]X[/latex] or [latex]\bar X[/latex] ?
We’re asked about the amount ounces of coffee per day, which is the random variable [latex]X[/latex]. So, we’ll need to use the distribution of [latex]X[/latex] from part (1).
Probability that one randomly selected person drinks between [latex]15.5[/latex] and [latex]16.5[/latex] ounces of coffee per day [latex]= P(15.5 \le X \le 16.5)[/latex]
Between 15.5 and 16.5 is an interval of values of the random variable [latex]X[/latex]. We’ll need to find the probability, which is to say we need to find the area under the curve in between 15.5 and 16.5.
On StatKey main page, click on NORMAL under Theoretical Distributions.
Click on Edit Parameters and enter your MEAN and STANDARD DEVIATION. For this question, mean [latex]=16[/latex] and standard deviation [latex]=6[/latex].
Select both Right Tail and Left Tail checkboxes at the top left. Click on the blue box below the left tail and change that to 15.5 and hit enter. Next, click on the blue box below the right tail and change that to 16.5 and hit enter. We’re looking for the area in between. Why?
Answer: 0.066. 

4. For the [latex]18[/latex] people, find the probability that the average coffee consumption is between [latex]15.5[/latex] and [latex]16.5[/latex] ounces of coffee per day.
Again, which distribution applies here: [latex]X[/latex] or [latex]\bar X[/latex] ?
Notice that in this question, we are not asked about an individual amount, but rather an average amount of daily coffee consumption for the sample of [latex]18[/latex] people. Because we’re asked about the sample average (sample mean), we’ll need to use the sampling distribution of sample means, [latex]\bar X[/latex].
Click on Edit Parameters and enter your MEAN and STANDARD DEVIATION for the random variable [latex]\bar X[/latex], which from part (2) is: mean [latex]=16[/latex] and standard deviation =[latex]\frac{6}{\sqrt {18}}[/latex].
Select both Right Tail and Left Tail checkboxes at the top left. Click on the blue box below the left tail and change that to 15.5 and hit enter. Next, click on the blue box below the right tail and change that to 16.5 and hit enter. We’re looking for the area in between. Answer: 0.276

5. YES. The assumption that the population distribution is normal necessary for the CLT to work (that is for the sampling distribution of sample means to be approximately normal) here because the sample size is less than [latex]30[/latex].

6. …IQR for the average of [latex]18[/latex] coffee drinkers
Once again the applicable distribution to use is that of [latex]\bar X[/latex], Why?
For IQR we need [latex]Q_1[/latex] and [latex]Q_3[/latex]. Lower quartile is the location where [latex]25%[/latex] of the values are to the left of the quartile value. So, what’s the location that has [latex]25%[/latex] or [latex]0.25[/latex] of values to its left? We need to find that cut off point.
Deselect all checkboxes except the Left Tail checkbox, or make sure Left Tail is selected. Click on the blue box representing area (in the middle of the graph) and change the value to 0.25 if it is not so already. When you hit enter, the value below the number line will be your cut off point. [latex]Q_1[/latex]: 15.046.
For [latex]Q_3[/latex], the area to its left will be [latex]75%[/latex] or [latex]0.75[/latex]. Alternatively, the area to the right of [latex]Q_3[/latex] will be [latex]25%[/latex] or [latex]0.25[/latex]. So, you can select either the left tail or right tail and supply the appropriate area to find the cut off point that will be the [latex]Q_3[/latex]. [latex]Q_3[/latex]: 16.954
Answer: IQR = 16.954 – 15.046  = 1.908

 

CLT for Sample Proportions

In our earlier example with the coronavirus, if instead we were interested in knowing the proportion of people who do not believe that Covid-19 has caused a public health crisis [latex](p)[/latex], then in our sample we will be computing the sample proportion [latex](\hat p[/latex], pronounced p-hat[latex])[/latex]. Sample proportion is calculated as: [latex]\hat p=\dfrac{x}{n}[/latex], where [latex]x=[/latex] the number of people in the sample who do not believe that the virus has caused a health crisis, and [latex]n[/latex] is the sample size. If we repeat the same process of repeated sampling as with the means and draw a histogram/dot plot of sample proportions from each sample, the distribution that we’d obtain is called sampling distribution of the sample proportions. The average of sample proportions from those repeated samples will get closer and closer to the actual population proportion, [latex]\mu_{\hat p} \rightarrow p[/latex], and the standard deviation of the sample proportions, [latex]\sigma_{\hat p}[/latex] will equal [latex]\sqrt {\dfrac{pq}{n}}[/latex], where [latex]q = 1-p[/latex]. The standard deviation of the sampling distribution of sample proportions, [latex]\sigma_{\hat p}[/latex], is also known as the standard error of the sample proportions (SE).

The Central Limit Theorem (CLT) for Sample Proportions
The sampling distribution of the sample proportion is approximately normal distribution with a mean of [latex]\mu_{\hat p}=p[/latex], and has a standard deviation of [latex]\sigma_{\hat p} = \sqrt {\dfrac{pq}{n}}[/latex], where [latex]q = 1-p[/latex] provided [latex]np> 5[/latex] and [latex]nq > 5[/latex].

EXAMPLE

A furniture company wants to know if the percent of their customers satisfied with their furniture purchase has decreased from last year’s rating of 91%. The company surveys 152 customers to gather feedback.
1. What is the distribution of [latex]\hat P[/latex]?

2. What is the probability that more than 95% of those surveyed are satisfied with their purchase?

3. What is the probability that between 88% and 92% of those surveyed are satisfied with their purchase?

4. What is the range of satisfaction percentage rating that will place customer satisfaction in the middle 95% of the sampling distribution?

SHOW SOLUTION

1. We’re comparing with last year’s ratings of 91%. So, [latex]p=0.91[/latex]. Sample size [latex]n=152[/latex]. Since [latex]np = 138.32[/latex] and [latex]nq = 13.68[/latex] are both at least 5, we know from CLT that the sampling distribution of sample proportions ([latex]\hat p[/latex]s) will be approximately normal with a center (mean) at [latex]\mu_{\hat p}=0.91[/latex] and standard deviation (std. err.) of [latex]\sqrt {\dfrac{p(1-p)}{n}} = \sqrt {\dfrac{0.91(1-0.91)}{152}} = 0.0232124059389[/latex]. The random variable [latex]\hat P[/latex] follows a normal distribution, we can write:  [latex]\hat P \sim N(0.91,0.0232124059389 )[/latex].For the rest of the questions, we’ll need to use technology.

ADVICE  Since the number of decimal places required is [latex]4[/latex] for this question, and StatKey result are to [latex]3[/latex] decimal places, we’ll focus on DESMOS for this question. However, StatKey method is super simple, and therefore, you should use StatKey first to get an idea of what the answer should be and then use DESMOS for the desired level of accuracy. See previous examples for STATKEY usage.
imageDESMOS CALCULATOR
Calculator Usage Guide

In the first input box on Desmos calculator, enter: \[X=\text{normaldist}(0.91,0.0232124059389)\] and click on the magnifying glass icon to see the normal curve.2. What is the probability that more than 95% of those surveyed are satisfied with their purchase?Since we want the probability/area, select the checkbox for Find Cumulative Probability (CDF) and enter 0.95 for Min. Leave the Max blank. Area is shaded and the answer is displayed. Area/Probability [latex]=0.0424246940756[/latex].

3. What is the probability that between 88% and 92% of those surveyed are satisfied with their purchase?
Change the Min to 0.88 and Max to 0.92 for the input boxes under Find Cumulative Probability (CDF). Area is shaded and the answer is displayed. Area/Probability [latex]=0.568587404585[/latex].
NOTE: Leaving Min or Max entry blank will default their values to [latex]-\infty[/latex] and [latex]\infty[/latex] respectively. To enter infinity [latex](\infty)[/latex], just delete any input entered in the box and click elsewhere on the window.

4. What is the range of satisfaction percentage rating that will place customer satisfaction in the middle 95% of the sampling distribution?
Unlike in the previous two parts of the question, the 95% here refers to the area in the middle of the sampling distribution. We need to find the cut off points that separate the middle 95% of the sampling distribution of sample proportions. We use inversecdf to find cutoff points. Inversecdf takes area to the left.

If the area in the middle is 95%, then the two tail areas shaded in blue must have their areas add up to 5% which means the area of each tail is 2.5%. So the left boundary cut off point for the middle 95% has 2.5% area to its left whereas the right boundary cut off has 97.5% area to its left.

image Lower(Left) cut off = [latex]X.\text{inversecdf}(0.025)=0.864504520365[/latex]

Upper(Right) cut off = [latex]X.\text{inversecdf}(0.975)=0.955495479635[/latex]

SUMMARY

Suppose [latex]X[/latex] is a random variable describing a population. If we repeatedly take samples of size [latex]n[/latex] from this population, then:

CLT for SAMPLE MEANS

  • The shape of the sampling distribution of sample means will be:
    1. approximately normal if the sample size [latex]n[/latex] is large enough (usually [latex]n\ge30[/latex]) OR
    2. normal if the random variable [latex]X[/latex] is normally distributed (that is, the population is normally distributed) regardless of the sample size [latex]n[/latex]
  • The sampling distribution of sample means [latex]\bar x[/latex] will have a mean [latex]\mu_{\bar x} = \mu[/latex] and standard deviation (also called the standard error of the mean) [latex]\sigma_{\bar x}=\dfrac{\sigma}{\sqrt n}[/latex]

CLT for SAMPLE PROPORTIONS

  • The shape of the sampling distribution of sample means will be approximately normal if [latex]np> 5[/latex] AND [latex]nq > 5[/latex] where [latex]q = 1-p[/latex].
    Note that some textbooks require at least 10 for the above.
  • The sampling distribution of sample proportions [latex]\hat p[/latex] will have a mean [latex]\mu_{\hat p} = p[/latex] and standard deviation (also called the standard error of the proportion) [latex]\sigma_{\hat p}=\sqrt {\dfrac{p(1-p)}{n}}[/latex]

 

License

Icon for the Creative Commons Attribution-ShareAlike 4.0 International License

Statistics Study Guide Copyright © by Ram Subedi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book