2.7 Measures of the Spread of the Data

Ram Subedi

2.7 Measures of the Spread of the Data

Spread of the Data

Understanding the center of a dataset alone doesn't provide us sufficient information to paint the picture. Two datasets can have the same measure of central tendency (such as the mean) but can be entirely different in values. We're also going to be interested in how spread out the data values are. More importantly, we will formalize a process to measure the spread of the data values in our data set.

Range of a data set gives a quick look at the spread of the data. Range is the distance between the minimum and maximum of a data set.

\[\text{Range} = \text{Maximum Value}−\text{Minimum Value}\]

Practice

Although range can be quickly calculated, it has one major disadvantage in that it doesn't tell us where on the number line the data values are. A data set ranging from -10 to 15 has the same range as another that ranges from 75 to 100. If we had been told that they have a range of 25, we still can't tell much about general shape, clusters, centers, etc., of the data set.

The standard deviation is a number that measures how far data values are from their mean. It is the most common measure of variation, or spread.

The standard deviation provides a measure of the overall variation in a data set.

The standard deviation is always positive or zero. The standard deviation is small when the data are all concentrated close to the mean, exhibiting little variation or spread. The standard deviation is larger when the data values are more spread out from the mean, exhibiting more variation.

If all the values of a data set are the same number, what can be said about the standard deviation of the data set?

The standard deviation can be used to determine whether a data value is close to or far from the mean.
We'll look at z-scores to answer the question: How many standard deviations away from the mean?
Often we hear the term variance, which is simply the square of the standard deviation.

Also note that if the standard deviation (or variance) is computed from a sample data, then it's a sample statistic whereas if the data set represents a population, then the the standard deviation (or variance) measure will be called a parameter. Recall that we typically use Greek letters for parameters, so in this case population standard deviation is represented by [latex]\sigma[/latex] (pronounced sigma) and sample standard deviation is represented by the letter [latex]s[/latex].

Video: Measures of Spread (You may skip the part about Coefficient of Variation)

Summary of Common Symbols

Statistic/Parameter	Population	Sample
Mean	[latex]\mu[/latex]	[latex]\bar x[/latex]
Standard Deviation	[latex]\sigma[/latex]	[latex]s[/latex]
Variance	[latex]\sigma^2[/latex]	[latex]s^2[/latex]

What is one major difference between the way standard deviation is computed for samples vs populations?

Computing standard deviations is generally done with the help technology. In this course, you can use SUBEDI calculators, DESMOS, StatKey, Ti83/84+ for calculations rather than computing the whole thing by hand. Your focus should be on what the standard deviation tells us about the data. The standard deviation is a number which measures how far the data are spread from the mean. Let a calculator or computer do the arithmetic.

USING TECHNOLOGY to Calculate Standard Deviations (You do NOT need to sort data for any of the following)
This is the same method that was used for calculating 5-number summary earlier.

SUBEDI Calculator

Go to One Variable Statistics @ rsubedi.com

Data Input

✅ RAW DATA

🔲 FREQUENCY TABLE

Data Type

✅ SAMPLE

🔲 POPULATION

Enter Data
Enter your data in the spreadsheet column shown for data entry. You can copy your data from the original source (comma or separated list, spreadsheet data, etc.) and paste that into the table on the calculator.

CALCULATE (Results show in a panel to the right. Five number summary will be at the bottom of the results panel.

DESMOS

Enter data as a list. For example, type: $L=[22,35,15,26,40,28,18,20,25,34,39,42,24,22,19,27,22,34,40,20,38,28]$ .

For sample standard deviation, type the following in the next input box:
stddev(L) or stdev(L)

For population standard deviation, type:

stddevp(L) or stdevp(L)

STATKEY

Go to Descriptive Statistics for One Quantitative Variable from the main StatKey page.

Click on Edit Data to enter your data.

Results will show on the right under Summary Statistics.

Sample size, $n$
Mean, $\bar x$
Standard Deviation, $s$
FIVE NUMBER SUMMARY

Practice

Comparing Values from Different Data Sets: z-Scores
The standard deviation is useful when comparing data values that come from different data sets. If the data sets have different means and standard deviations, then comparing the data values directly can be misleading. But if we calculate how far (or how many standard deviations) away a given data value lies from the mean, we can then meaningfully make the comparison.

So, for a data value, the answer to the question "How many standard deviations away is this value from the mean?" is its z-score. A z-score of particular data point is the number of standard deviations it lies from the mean of the data set. We can calculate the z-scores using the following formula:

\[z=\frac{\text{value}-\text{mean}}{\text{standard deviation}} = \frac{x-\mu}{\sigma}.\]

Video: About Z-scores

Practice

The Empirical Rule: If the distribution is bell-shaped and symmetric, then approximately:

68% of the data values in the data set are within 1 standard deviation of the mean (above or below)
95% of the data values in the data set are within 2 standard deviations of the mean (above or below)
99.7% of the data values in the data set are within 3 standard deviations of the mean (above or below)

We can see that almost all of the data values are expected to be within 3 standard deviations from the mean.

Practice

For additional practice, visit: Empirical rule @KhanAcademy

The Chebyshev’s Rule states that the percentages of data within 2, 3, and 4.5 standard deviations of the mean are approximately 75%, 89%, and 95%, respectively for ANY data set, regardless of the distribution of the data.

Video: Empirical and Chebyshev's Rules

License

Icon for the Creative Commons Attribution-ShareAlike 4.0 International License

Spread of the Data

Summary of Common Symbols

License

Share This Book