2.7 Measures of the Spread of the Data
Spread of the Data
Understanding the center of a dataset alone doesn’t provide us sufficient information to paint the picture. Two datasets can have the same measure of central tendency (such as the mean) but can be entirely different in values. We’re also going to be interested in how spread out the data values are. More importantly, we will formalize a process to measure the spread of the data values in our data set.
Range of a data set gives a quick look at the spread of the data. Range is the distance between the minimum and maximum of a data set.
Range = Maximum Value−Minimum Value
Practice
Although range can be quickly calculated, it has one major disadvantage in that it doesn’t tell us where on the number line the data values are. A data set ranging from -10 to 15 has the same range as another that ranges from 75 to 100. If we had been told that they have a range of 25, we still can’t tell much about general shape, clusters, centers, etc., of the data set.
The standard deviation is a number that measures how far data values are from their mean. It is the most common measure of variation, or spread.
The standard deviation provides a measure of the overall variation in a data set.
The standard deviation is always positive or zero. The standard deviation is small when the data are all concentrated close to the mean, exhibiting little variation or spread. The standard deviation is larger when the data values are more spread out from the mean, exhibiting more variation.
The standard deviation can be used to determine whether a data value is close to or far from the mean.
We’ll look at z-scores to answer the question: How many standard deviations away from the mean?
Often we hear the term variance, which is simply the square of the standard deviation.
Also note that if the standard deviation (or variance) is computed from a sample data, then it’s a sample statistic whereas if the data set represents a population, then the the standard deviation (or variance) measure will be called a parameter. Recall that we typically use Greek letters for parameters, so in this case population standard deviation is represented by [latex]\sigma[/latex] (pronounced sigma) and sample standard deviation is represented by the letter [latex]s[/latex].
Video: Measures of Spread (You may skip the part about Coefficient of Variation)
Summary of Common Symbols
Population | Sample | |
Mean | [latex]\mu[/latex] | [latex]\bar x[/latex] |
Standard Deviation | [latex]\sigma[/latex] | [latex]s[/latex] |
Variance | [latex]\sigma^2[/latex] | [latex]s^2[/latex] |
USING TECHNOLOGY to Calculate Standard Deviations (You do NOT need to sort data for any of the following)
This is the same method that was used for calculating 5-number summary earlier.
SUBEDI Calculator
Go to One Variable Statistics @ rsubedi.com
Data Input
Data Type
Enter Data
Enter your data in the spreadsheet column shown for data entry. You can copy your data from the original source (comma or separated list, spreadsheet data, etc.) and paste that into the table on the calculator.
CALCULATE (Results show in a panel to the right. Five number summary will be at the bottom of the results panel.
DESMOS
Enter data as a list. For example, type: .
For sample standard deviation, type the following in the next input box:
stddev(L) or stdev(L)
For population standard deviation, type:
STATKEY
Results will show on the right under Summary Statistics.
Mean,
Standard Deviation,
FIVE NUMBER SUMMARY
Practice
Comparing Values from Different Data Sets: z-Scores
The standard deviation is useful when comparing data values that come from different data sets. If the data sets have different means and standard deviations, then comparing the data values directly can be misleading. But if we calculate how far (or how many standard deviations) away a given data value lies from the mean, we can then meaningfully make the comparison.
So, the question to HOW MANY STANDARD DEVIATIONS AWAY FROM THE MEAN is the z-score. A z-score of particular data point is the number of standard deviations it lies away from the mean of the data set. We can calculate the z-scores using the following formula:
\[z=\frac{\text{value}-\text{mean}}{\text{standard deviation}} = \frac{x-\mu}{\sigma}.\]
Video: About Z-scores
Practice
The Empirical Rule: If the distribution is bell-shaped and symmetric, then approximately:
- 68% of the data values in the data set are within 1 standard deviation of the mean (above or below)
- 95% of the data values in the data set are within 2 standard deviations of the mean (above or below)
- 99.7% of the data values in the data set are within 3 standard deviations of the mean (above or below)
We can see that almost all of the data values are expected to be within 3 standard deviations from the mean.
Practice
For additional practice, visit: Empirical rule @KhanAcademy
The Chebyshev’s Rule states that the percentages of data within 2, 3, and 4.5 standard deviations of the mean are approximately 75%, 89%, and 95%, respectively for ANY data set, regardless of the distribution of the data.