Statistics 101: Correlation and causality

Catalogue number: 892000062021002

Release date: May 3, 2021 Updated: December 1, 2021

In this video, you will learn how to prove the existence of a relationship, or lack thereof, between two variables. This is a very important part of data analysis.

By the end of this video, you will learn the answers to the following questions:

  • What is correlation?
  • How can you measure, quantify, or interpret correlation when analyzing your data?
  • What is causality?
  • And finally, what are the differences between the two?
Data journey step
Analyze, model
Data competency
  • Data analysis
  • Data driven decision making
  • Data interpretation
  • Data visualisation
Audience
Basic
Suggested prerequisites
N/A
Length
17:27
Cost
Free

Watch the video

Statistics 101: Correlation and causality - Transcript

Statistics 101: Correlation and causality - Transcript

(The Statistics Canada symbol and Canada wordmark appear on screen with the title: "Statistics 101: Correlation and causality")

Statistics 101: Correlation and causality

This video is intended for viewers who wish to gain a basic understanding of correlation and causality. As a prerequisite, before beginning this video, we highly recommend having already completed our videos titled "What is Data" and "Types of Data".

Learning goals

By the end of this video, you will learn the answers to the following questions: What is correlation? How can you measure, quantify or interpret correlation when analyzing your data? What is causality? And finally, what are the differences between the two?

Steps of a data journey

(Diagram of the Steps of the data journey: Step 1 - define, find, gather; Step 2 - explore, clean, describe; Step 3 - analyze, model; Step 4 - tell the story. The data journey is supported by a foundation of stewardship, metadata, standards and quality.)

This diagram is a visual representation of the data journey, from collecting the data to cleaning, exploring, describing and understanding the data to analyzing the data, and lastly, to communicating with others the story the data tell.

Step 3 and 4: Analyze, model and tell the story

(Diagram of the Steps of the data journey with an emphasis on Step 3 - analyze, model and Step 4 - tell the story)

Correlation and causality fall under the final two steps of the data journey: analysis and modeling, and telling the story.

Patterns and relationships

(Image combining a hockey stick and a toilet that equals a Stanley Cup with a question mark)

Have you ever noticed the way the human mind really likes patterns? So much so in fact, that the mind will often create patterns. When two variables appear to be so closely associated, it can seem that one is dependent on the other. For example, Ottawa Senators hockey player Bruce Gardiner, was so superstitious, he was convinced the only way he could break the occasional slump in his performance was to dunk his hockey stick in a toilet bowl. Superstitions like this are a great example of how the brain likes to perceive relationships between two things, even when in reality, no such relationships exist. In this video, you will learn how to prove the existence of a relationship, or lack thereof, between two variables. This is a very important part of data analysis.

Correlation in data analysis

In the world of data, correlation refers to the existence of a relationship between two variables. Correlation plays a big part in data analysis. When studying a potential relationship between two variables, it is important to ask yourself the following questions: Does a relationship exist between the two variables? If so, is the relationship positive or negative? What is the strength of this relationship? Is it a strong correlation, a weak correlation, or somewhere in the middle? Correlation can exist between all types of variables, but in statistics, correlation can only be calculated for numeric variables.

What is correlation?

(Table containing data on the change in water temperature in a kettle over time)

Let's start by talking about correlation in everyday life. When we say two or more things are correlated, this means there is a mutual relationship between them. This relationship can be either positive or negative. In a positive correlation, the values of the two related items move in the same direction. Take a kettle full of water, for example: the longer the kettle is on, the hotter the temperature of the water inside the kettle will get. In a negative correlation, the values move in opposite directions. Meaning as one variable increases, the other decreases, and vice versa. For example, imagine you've taken a freshly brewed cup of tea outside on a winter day. The more time you spend outside, the colder your tea will become. In this case, as the time variable increases, the temperature decreases.

Visualizing our data

(Scatter plot visualizing data from the previous slide on water temperature in a kettle over time)

Using a scatterplot is an effective way to show the relationship between two different variables. Here, we used Microsoft Excel to plot the seven points in the table from the previous slide. You can do the same in many other spreadsheet applications. The number of seconds the water is in the kettle or plotted along the horizontal x-axis. And the water temperature is plotted along the vertical y-axis. We can clearly see here that, as the x value increases, so do the Y values. This verifies that we have a strong positive correlation.

(scatter plot visualizing the water temperature slide data in a kettle over time with a trend line intercepting the data)

This positive correlation is more clearly seen with the addition of a linear trend line. A trend line is a straight line we draw over the data which gets as close as possible to all of the data points. This can be automatically generated using your choice of software. As shown in the scatter plot, it provides an even clearer visualization, which allows us to see how strongly our variables are correlated. In this example, the line is very obviously trending upwards, which represents a positive correlation. If the line was trending downwards, it would represent a negative correlation.

Measuring correlation

For numeric variables, correlation is measured by a correlation coefficient. Where a scatter plot or trend line can help you visualize your data, a correlation coefficient is a measure of the strength of the linear relationship between two variables and is represented by "r". The value of r is always between a minimum of -1 and a maximum of 1. The correlation coefficient, or r, can be calculated easily in Excel by using the Pearson function. This function is available in multiple spreadsheets or statistical applications. Use the one you know and trust!

When r is equal to 1, we are saying that two variables have a perfectly positive relationship, meaning, the two variables always increase or decrease together. When r is equal to -1, the variables have a perfectly negative relationship. This would mean that one variable always increases while the other one decreases. Finally, when r is equal to zero, there is no linear relationship between the two variables.

Interpreting the correlation coefficient

(Table containing information on the interpretation of the value of the correlation coefficient. The columns, from left to right, are named as follows: Value of r | Correlation | Direction | Force. From the first to the last line: 1 | Yes | Positive | Perfect; 0.99 to 0.6 | Yes | Positive | Strong our very strong; 0.59 to 0.20 | Yes | Positive | Low to moderate; 0.19 to -0.19 | No | - | -; -0.2- to -0.59 | Yes | Negative | Low to moderate; -0.6- to -0.99 | Yes | Negative | Strong or very strong; -1 | Yes | Negative | Perfect)

The correlation coefficient, or r, provides information about the existence, direction and strength of a relationship between two variables. In reality, and r value is rarely equal to exactly -1 or 1. This table provides general guidelines about the strength of a relationship between two variables. If an r value is -0.6 or lower, we have a strong negative relationship. Likewise, if its value is 0.6 or higher, we have a strong positive relationship. If an r value is between -0.59 and -0.2, we have a weak negative relationship. Likewise, if its value is between 0.2 and 0.59, we have a weak, positive relationship. Finally, if the correlation coefficient is between -0.19 and 0.19, we do not have enough evidence to say that the two variables are correlated.

Example 1

(Table containing data on the change in water temperature in a kettle over time. the columns, from left to right, are named as follows: Time in the kettle (seconds) | Water temperature (Celsius). From the first line to the last: 30 sec | 20 C; 60 sec | 35 C; 90 sec | 50 C; 120 | 65 C; 150 | 80 C; 180 sec | 90 C; 210 sec | 100C;)

Let's go back to our example of water boiling in a kettle. This data table provides the temperature of water in a kettle at seven equally spaced moments in time. After the first 30 seconds, the water is at a temperature of 20 degrees Celsius. At the final moment, the water has reached its boiling point of 100 degrees Celsius. Using the value of r, we can prove there is a positive correlation between time and temperature through both the correlation coefficient and data visualization.

Calculating the correlation coefficient

(Table containing the same data as the previous slide)

(Scatter plot with a trend line viewing data from the same table)

(Text: Use Pearson function --> r-0.997)

As we mentioned earlier, the correlation coefficient, or r, can be calculated easily by using the Pearson function. The values in the first column represent the first variable: number of seconds spent in the kettle. The values in the second column represent the water temperature at each point in time. Here, we see that the r value turns out to be greater than 0.99. Remember, an r value of one would have indicated a perfect positive correlation. This means that our r value indicates a positive correlation that is close to perfect. In other words, for these two variables, there is a strong positive correlation between time and temperature, which is visible on the scatter plot and trend line.

Example 2

(Scatter plot representing the rate of Cybercrime per 100,000 population as a function of the Growth Rate (%) in 2017-2018. The trend line rises slightly)

In reality, the relationship between two values is unlikely to be as obvious as the link between the amount of time in a kettle and water temperature. Let's look at a real life example that compares population growth with cybercrime in 2018. What does the scatterplot tell us? First, on the X-axis we see that, as the population growth rate values increase, so do the cybercrime rate values on the Y axis. This implies that we should have a positive correlation. At the same time, we noticed that the data points are well spread out. It's hard to draw a straight line through these data points, while keeping each data point close to the line. This would lead us to believe that there is not a strong correlation. To be sure, we decide to use software to calculate our correlation coefficient and we see that r equals 0.3, this signifies a weak positive correlation. Therefore, after visualizing the data and determining the correlation coefficient, we can conclude that in 2018, there was a weak positive correlation between population growth and cybercrime.

Knowledge check

(Scatter plot representing where the data points appear to decrease in value depending on the X-axis)

Let's take a break to test your knowledge about correlation. Take a look at the scatter plot on the right hand side of the slide. What is it telling us? Is there A) positive correlation between these two variables? B) A negative correlation? Or C) no correlation at all? The answer is B! This scatterplot is visualizing a strong negative correlation between these two variables.

Next, imagine that you are analyzing three pairs of variables. The correlation coefficient for these three pairs are -0.8; 0.03 and 0.42. Which r value indicates the strongest relationship? The answer is a), r equals -0.8. This indicates a strong negative relationship. The weakest of these three options is b), r equals 0.03, which indicates no relationship between the variables.

Correlation =/= Causality

Now, let's move on to causality. In fact, if there is one key message you take away from this video, let it be this: Correlation and causality, though sometimes use incorrectly as interchangeable concepts, are anything but. So far, we've learned that the correlation coefficient tells us how strongly a pair of variables are linearly related and change together. However, it does NOT tell us the reason why or how. Causality does. Causality is when there is a real world explanation for WHY this is logically happening. You may have also heard this referred to as "cause and effect".

Causality

Causality is a relationship between two events, or variables, in which one event or process causes an effect on the other event or process. For example, research tells us that there is a positive correlation between ice cream sales and sunburns. Meaning, as ice cream sales increase, so do instances of sunburns. But this doesn't mean that buying an ice cream cone causes a sunburn now does it? Of course not. Causality adds real world context and meaning to the correlation.

(Series of images showing that ice cream sales and the number of sunburns are correlated but that each is caused by the sun)

Causality refers to a relationship between two events, or variables, which has a valid explanation. Unlike correlation, with causality, this valid explanation turns possibility into actuality. To say something causes an effect on another variable means the result of one event is directly influenced by the other. Either the cause precedes the effect, or the effect changes when the cause changes. For example, dry, hot and sunny weather will cause people to buy more ice cream than in cold weather. Dry, hot and sunny weather will also cause an increase in sunburns when compared to colder, rainy weather. This can make it appear that buying ice cream causes sunburns, but this is just not true. When it comes to hot, sunny weather, ice cream sales and sunburns, all three are correlated, but the only causal relationships in this scenario are between the weather and ice cream sales and the weather and sunburnt people.

Beware the confirmation bias!

Similar to how the human mind loves to see patterns, it also tends to more easily accept evidence that agrees with existing beliefs, rather than that which refutes them. This is called confirmation bias. So, when analyzing your data, it is very important to scrutinize conclusions you like just as rigorously as ones you don't, in order to avoid claiming a causal relationship exists between two things when in fact, it does not.

How to determine a causal relationship

There isn't an easy statistical test to test for causal relationship, statistical confirmation of causality typically requires advanced modeling techniques. However, when trying to establish whether causality is present, typically, if the following 4 criteria are met, the greater the chance of causality between your two variables. First, just as with correlation, the two variables must vary together, meaning, a positive or negative correlation coefficient has been shown to exist. Next, that relationship must be plausible. And really, what this is saying is that the relationship needs to make sense. Third, the cause must precede the effect in time. Meaning, the cause must take place first, in order for the effect to occur. And finally, the relationship must not be due to a third variable. A relationship that appears to be between two variables but could also be explained by third is also referred to as spurious relationship. We previously saw this in our example referring to increased ice cream sales being correlated with increased instances of sunburns, but really, both increases were the effect of a third variable, the sun.

Knowledge check: Is this relationship causal?

(Scatter plot representing the hours before the person eats again based on the weight of the cake consumed (kg). The trend line is rising)

Now let's take a look at the scatter plot and try to determine whether or not there is a causal relationship between the amount of cake a person eats and how how full they feel, which we measure by the amount of time that passes before the person eats again. In this example, we will assume that all respondents are similar except for the amount of cake they have consumed. Think about the four criteria we just went through: do the two variables vary together? Is the relationship plausible? Does the cause precede the effect in time? And is the relationship due to a third variable?

(Text: Yes - r = 0.918; Yes - digestion processes; Yes - cake is eaten first; Not likely - if controlled for other food eaten)

After addressing the four criteria we established to help determine whether the relationship is causal, we have determined that first, the variables do indeed vary together. Yes, there is a plausible relationship. Yes, the cake is eaten first and that's what causes the effect of fullness. And, in this instance, it is unlikely that the feeling of fullness has been caused by third variable, since we have controlled for all non-cake-based foods.

The importance of knowing the difference

(Scatter plot representing the Grade Point Average (GPA) as a function of the years of music lesson. The trend line appears to be rising)

A common problem occurs when two correlated trends are presented as one phenomenon causing the other. For example, this scatter plot shows a relationship between taking music lessons and achieving a high grade point average, or GPA. The graph seems to indicate that there is a correlation between the years of music lessons and average GPA. But do music lessons directly impact or cause an increase in GPA? Social research shows these high performing students are also more likely to have grown up in an environment with large emphasis on education and the resources needed to succeed academically. It is therefore possible that these students would have higher academic achievements with or without music lessons, and that their socio-economic status would actually explain the relationship. So while music lessons and academic achievements are correlated, there are other factors that should prevent us from establishing causality.

Recap of key points

Here is a review of the key points we've covered in this video. First, correlation refers to the relationship between two variables. It is important to look for the existence, direction, and strength of the relationship. Then, we learned how to assess the strength and direction of a correlation by calculating the correlation coefficient, r. Data visualization also provides us with a quick way to identify correlations. Next, we explained how causality refers to a relationship between two events or variables, which has a valid explanation. And finally, it is important to remember that correlation does not always imply causation. Even if two variables are strongly correlated, it could just be a coincidence.

(The Canada Wordmark appears.)

What did you think?

Please give us feedback so we can better provide content that suits our users' needs.