Statistics 101: Exploring measures of dispersion

Catalogue number: 892000062020003

Release date: May 3, 2021 Updated: February 7, 2023

How do we describe data in just a few simple terms? Two really important features of a dataset are the location of the centre―or balance point―and the size of the spread.

Try thinking of it this way: if we were to hold the data in our hands, would they be densely concentrated in one spot like a golf ball, or all over the place like cotton candy? The balance point of data is called the central tendency. But, the size of region the data cover and how spread out it is―is called dispersion. In this video, we will explore the concept of dispersion. However, as a prerequisite to this video, we highly recommend first watching our video called "Statistics 101: Exploring measures of central tendency" as some concepts such as the mean will be discussed in this video.

Data journey step
Explore, clean, describe
Data competency
  • Data exploration
  • Data interpretation
Audience
Basic
Suggested prerequisites
Statistics 101: Exploring measures of central tendency
Length
12:07
Cost
Free

Watch the video

Statistics 101: Exploring measures of dispersion - Transcript

Statistics 101: Exploring measures of dispersion - Transcript

(The Statistics Canada symbol and Canada wordmark appear on screen with the title: "Statistics 101 Exploring measures of dispersion")

Statistics 101: Exploring measures of dispersion

How do we describe data in just a few simple terms? Two really important features of a dataset are the location of the center, or balance point, and the size of the spread. Try thinking of it this way: if we were to hold data in our hands, would they be densely concentrated in one spot like a golf ball, or all over the place like cotton candy? The balance point of data is called the central tendency. But, the size of the region the data covers and how spread out it is, is called dispersion. In this video, we will explore the concept of dispersion. However, as a prerequisite to this video, we highly recommend first watching our video called "Exploring Measures of Central Tendency" as some concepts such as the mean will be discussed in this video.

Learning goals

By the end of this video, you should have a basic understanding of such measures of dispersion as range, interquartile range and standard deviation. This video is intended for learners looking to gain a basic understanding of the concept of dispersion, also called variability, what it means and some key related concepts that are used to explore data.

Measures of dispersion

In statistics, dispersion is the extent to which a distribution is stretched or squeezed. Imagine you are expecting a package in the mail. Usually, the mail arrives anytime between 8 a.m. and 4 p.m., which means, if you want to be there when it arrives, your whole day may be spent at home waiting. But, if you know that the mail usually arrives between 8 and 10 a.m., you have a better indication when to expect it. Measures of dispersion also give an indication of how well the measures of central tendency, such as the mean, describe the distribution of values in the dataset. This is useful when using sample data to draw conclusions about behaviors and characteristics of the entire population. Measures of dispersion are also important because they help us make informed decisions about how to analyze the data and how much uncertainty it contains.

Steps of a data journey

(Text on screen: Supported by a foundation of stewardship, metadata, standards and quality)

(Diagram of the Steps of the data journey: Step 1 - define, find, gather; Step 2 - explore, clean, describe; Step 3 - analyze, model; Step 4 - tell the story. The data journey is supported by a foundation of stewardship, metadata, standards and quality.)

This diagram is a visual representation of the data journey from collecting the data to cleaning, exploring, describing and understanding the data, to analyzing the data and lastly, to communicating with others the story the data tell.

Step 2: Explore, clean and describe

(Diagram of the Steps of the data journey with an emphasis on Step 2 - explore, clean and describe)

Exploring measures of dispersion is part of the explore, clean and describe step in the data journey.

What does the spread of data look like?

(Graph representing the number of pizza deliveries as a function of delivery times in a bell shape Normal distribution)

Before we begin, let's take a quick look at some common ways that data are spread, that is, are clustered together or spread out. The distribution of data is often represented using scatter plots or histograms. Their shape show the spread of the dataset. Data can be represented graphically in a symmetrical, bell shape, as can be seen here, where most of the values are clustered in the middle between 20 and 40 minutes, as we see here in a graph of pizza delivery times, while some pizza deliveries take less time and others take longer. This is what is called a normal distribution, and we will talk more about that later.

(2 seperate graphs on the left and right representing a Normal distribution that is positively and negatively skewed, respectively)

If the data set is not symmetrical, but instead has more values located to the left or right of the graph, the symmetrical shape becomes skewed, which creates a longer tail on one side or another. A dataset is considered to be skewed in the direction of the longer tail. When data are positively skewed, there is a large number of values located on the left side or "low end" of the graph, causing a tail stretched out to the right. When data are negatively skewed, we see a larger number of values located in the high end of the graph and the tail stretched out toward the left hand or low section of the graph.

Measures of dispersion

(Flowchart presenting the three common measures of dispersion: Range, Interquartile range and Standard deviation)

Now back to our measures of dispersion. In order to determine the dispersion, three commonly used measures are the range, the interquartile range and the standard deviation. The next few slides look at each individually.

Range

The range of the difference between the largest and the smallest values in a dataset. It provides a quick and easy measure of the spread of these values. The range is best used with data that do not have extreme values. Like our package delivery. If we know the package will be delivered between sometime at 10 a.m. and noon, we feel safe making plans to do other things in the day. This kind of range is a very useful information. However, if we are told the package will arrive between 8 a.m. and 8 p.m., well, how useful is this information really? How confident would you feel stepping out to run a quick errand at any point in the day and not missing your delivery? Probably not very.

Knowing that the range is the distance between the largest value and the smallest value, we will now put that into the form of an equation. The range is simply the highest value minus the lowest value. In this example, the lowest value is 1, while the highest value is 7. Therefore, the range is 7 minus 1, which is 6. Here, the range is an appropriate measure because the data points are clustered together.

Example

(Table presenting the exam scores of students. the columns, from left to right, are titled: # | Student | Exam score. The first line to the last contains the following: 1 | John | 80%; 2 | Amy | 85%; 3 | Tony | 85%; 4 | Moe | 86%; 5 | Ali | 87%; 6 | Sofia | 88%; 7 | Jose | 90%; 8 | Maria | 90%; 9 | Hugo | 92%; 10 | Louise | 94%; 11 | Sylvain | 95%; 12 | Jade | 95%)

Let's look at an example. Here we have exam scores from a group of 12 students. The highest exam score is 95%. To determine the range, we subtract the lowest exam score, which is 80%. This makes the range 15% which is quite narrow. An advantage of using the range as a measure of dispersion is that it is easy to calculate.

(Table presenting the exam scores of students. the columns, from left to right, are titled: # | Student | Exam score. The first line to the last contains the following: 1 | John | 10%; 2 | Amy | 85%; 3 | Tony | 85%; 4 | Moe | 86%; 5 | Ali | 87%; 6 | Sofia | 88%; 7 | Jose | 90%; 8 | Maria | 90%; 9 | Hugo | 92%; 10 | Louise | 94%; 11 | Sylvain | 95%; 12 | Jade | 95%)

Now, let's look at a similar example, but with one major difference. Here, we have exam scores from the same group of 12 students. The highest exam score is again 95%. To determine the range, we subtract the lowest exam score, which is now 10%. This makes the range 85%. This is a very wide spread. Upon closer inspection, we see one student, John, did quite poorly on the exam, while everyone else did very well. This makes John's score an outlier because 11 out of the 12 students scored between 85% and 95%. His single score is the main cause of this wide spread. And, because the range is the comparison of the smallest to the largest values, we see here, how the range can be a misleading measure of dispersion when there are outliers in the data.

Interquartile range

Similar to the range is the interquartile range, the interquartile range is also the distance between the largest and the smallest value, but only amongst the middle 50 percent of the whole distribution. This makes it slightly more stable than the full range because it does not consider the bottom and top 25% of the data helping insulate against the impact of most outliers.

Well it's true that the interquartile range is slightly more stable than the full range, it is important to know that when using the interquartile range as a measure of dispersion you will lose detail about what is happening at the ends of your distribution.

How to find the interquartile range?

(Text: Dataset= 3, 1, 8, 5, 3, 6, 4, 8, 6, 7)

To find the interquartile range, first, you need to order the data from least to greatest. After placing the 10 numbers that make up the dataset on this slide in a list from smallest to largest, and using the knowledge you obtained in this video on measures of central tendency, you would find the median of the entire dataset, which is the midpoint when you order all observations from smallest to largest and in this case, because there is an even number of observations, we add the two middle numbers and divide by two, which is 5.5. By calculating the median, we are able to break the data into two halves. This allows us to move on to our next step.

Next, you would again calculate the median, but this time for both the upper and lower halves of the data, which would be 3 for the lower half and 7 for the upper half. Then, you subtract the lower median from the upper. The interquartile range is the difference between those two numbers, which in this case equals 4. It is important to note that this method works well for simple and short list of values. But for more complicated datasets, Q1 and Q3 can easily be obtained using software such as Excel.

Knowledge check

(Table presenting the time it takes for pizza to be delivered for each household. the columns, from left to right, are titled: Household | Minutes taken for pizza to be delivered. The first line to the last contains the following: 1 | 15; 2 | 20; 3 | 25; 4 | 30; 5 | 30; 6 | 35; 7 | 35; 8 | 40; 9 | 45; 10 | 50)

Your turn. Imagine you have ordered a pizza and they tell you it should take around 30 minutes to be delivered. Then imagine 9 other households have done the same thing. What in this case does around "30 minutes" really mean? Here we have a table showing exactly how long each of the ten households had to wait to receive their pizza. To test your knowledge so far. Pause the video and try to calculate the range of time, in minutes, each household should expect their pizza to arrive. Then, calculate the interquartile range. Pause the video now and restart once you are ready to check your answers. Did you get 35 for the range and 15 for the interquartile range? If so, good for you! Now we can move on to our next measure of dispersion: standard deviation.

Standard deviation

(Table presenting the exam scores of students. the columns, from left to right, are titled: # | Student | Exam score. The first line to the last contains the following: 1 | John | 80%; 2 | Amy | 85%; 3 | Tony | 85%; 4 | Moe | 86%; 5 | Ali | 87%; 6 | Sofia | 88%; 7 | Jose | 90%; 8 | Maria | 90%; 9 | Hugo | 92%; 10 | Louise | 94%; 11 | Sylvain | 95%; 12 | Jade | 95%)

So far, this video has explained how both the range and interquartile range can give you a good idea of the median or average value in a dataset. But they do not tell you how close the rest of the numbers in the dataset are to that median. This can be very important information to know. For example, going back to a class of students. When the teacher adds up everyone's score, she gets a total of 907. And when she divides that number by the number of scores, which is 12, she gets a mean score of 76%. 76% could be a good score, but is everyone performing at that level? In a class of 12, it is not that difficult to see that a few are struggling. But what about in a class of 200?

(2 seperate graphs on the left and right representing a bell shaped Normal distribution with a low and high standard deviation, respectively)

The standard deviation tells you how spread out measurements for a group of values are from the average or mean. It is a number which can be quickly and easily calculated using software such as Microsoft Excel and is considered the most robust of the three different measures of dispersion. Therefore, it is the measure used most often when doing statistical analysis. A low standard deviation means that most of the numbers are close to the mean. So when it comes to a teacher determining how well each of her students is performing, a low standard deviation would tell her that most of her students are performing at around the same level. A high standard deviation would tell her that not everyone is performing at around the same level. So, if the class average were high, a high standard deviation would mean that some students are still struggling.

(2 seperate graphs on the left and right representing a bell shaped Normal distribution with a low and high standard deviation with their means remaining at the center of the distribution, respectively)

But in situations where you just observe and record data, a high standard deviation isn't necessarily a bad thing, it just reflects a large amount of variability in the group that is being studied. For example, if you look at all salaries with any large company, including everyone from the co-op students to the CEO, the standard deviation may be very high. On the other hand, if you narrow the group down by looking only at the co-op students, the standard deviation is lower, because the individuals within this group have salaries that are more similar. The second dataset isn't better, it simply has less variability.

Standard deviation and the Normal distribution

The Normal distribution is one example of a distribution that could help you better understand the concept of standard deviation. In the context of data, a distribution is a mathematical model that mimics how the data points are distributed or dispersed. We often visualize the Normal distribution as a curve shaped like a hilltop or bell. It represents the presence of small and big data points on the left- and right- hand side of the curve, respectively. While most of the data points are somewhere in the center, where the summit is found. In the Normal distribution, the data points fall in a symmetrical pattern that looks like the curve you see on this slide, which is called a bell curve.

Normal distribution

The Normal distribution is symmetrical, which causes the mean, median and mode to be the same number. These are represented by the line down the center of the bell curve.

(Graph representing a Normal distribution with the mean = median = mode at the sommet of the distribution)

For the standard normal distribution, the dispersion measurement method we call standard deviation, or SD on the slide, has some pretty neat properties. It tells us where to expect the data points to be in the distribution. Sampling theory and the Normal distribution tell us that approximately 68% percent of the data values in the whole population will fall between the mean +/-1 standard deviation. Similarly, approximately 95% of the data values will fall within the mean +/- 2 times the standard deviation, and approximately 99.7% of the data values will fall within the mean +/- 3 times the standard deviation.

Recap of key points

Measures of dispersions provide a quantitative indication of the degree to which data values are spread out or clustered together. In this video, we looked at three common measures of dispersion: range, interquartile range and standard deviation. And we learned that sometimes data can be bell shaped, with most values clustered in the middle, which is often called a Normal distribution.

(The Canada Wordmark appears.)

What did you think?

Please give us feedback so we can better provide content that suits our users' needs.