Types of Data: Understanding and exploring data - Transcript
(The Statistics Canada symbol and Canada wordmark appear on screen with the title: "Data types")
Types of data: Understanding and exploring data
It's important to define the different types of data and understand them in order to choose the appropriate method for analyzing data and presenting the results.
Learning goals
In this video, you will learn about data and statistical information, and explore the different types of data. After completing this video, you will be able to identify categorical and quantitative data, nominal and ordinal data, and discrete and continuous data. This video is intended for learners who want to acquire a basic understanding of data's concepts and types.
Steps of a data journey
(Diagram of the Steps of the data journey: Step 1 - define, find, gather; Step 2 - explore, clean, describe; Step 3 - analyze, model; Step 4 - tell the story. The data journey is supported by a foundation of stewardship, metadata, standards and quality.)
This diagram is a visual representation of the data journey from collecting the data to cleaning, exploring, describing and understanding the data to analyzing the data, and Lastly to communicating with others the story the data tell.
Step 2: Explore, clean and describe
(Diagram of the Steps of the data journey with an emphasis on Step 2 - explore, clean, describe.)
Exploring the different types of data is part of the explore, clean and describe step of the data journey. Understanding the various data types will help with the analyze and model steps.
Difference between data and statistical information: Data
Data are the raw material for making information. It can be, for example, in the form of numbers, texts, observations, or recordings. Data can be structured, meaning that they are organized into predefined categories or concepts such as lists, tables, datasets, databases or spreadsheets.
Data can also be unstructured, which means they're not organized. Unstructured data need to be processed or parsed to become structured before any further work can be done on them.
A paragraph of text is an example of unstructured data, since the main ideas have to be extracted or the phrases have to be parsed into smaller segments to use the text as data.
Satellite images are another example of unstructured data. The images have to be interpreted, encoded, such as type of crop or type of building.
Difference between data and statistical information: Statistical information
When we apply statistical methods to data, we produce statistical information such as means, totals, ratios, percentiles, frequency distributions, and parameter estimates. Data have meaning and value, but they're difficult to identify. Statistical methods are a way of summarizing the data so that the meaning becomes clear.
Turning data into statistical information
Statistical methods are applied to data to derive meaning or find relationships. The end product is statistical information which is interpreted and used to increase knowledge about the topic in question.
Data types
(Image of a tree diagram of the different types of data. the root of the tree diagram is "data" that branches out into "categorical"and "quantitative" data. Categorical data branches out into "nominal" and "ordinal" categorical data. Quantitative data branches out into "discrete" and "continuous" quantitative data.)
Data can be divided into 2 main categories. Categorical and quantitative. Categorical data can be further subdivided into nominal and ordinal data. Quantitative data can be discrete or continuous and are also known as numerical data. These concepts are explored further in the next few slides.
Categorical data
Categorical data represent characteristics such as gender languages, spoken type of diseases or clothing sizes.
For example, the languages spoken by a particular person could be French, English, German and Spanish. The categories are referred to as classes or classifications. Every possible value for a characteristic should be in one and only one category.
Categorical data: Nominal
When the categories have no inherent order, the data are called nominal. The data values in this situation are labels.
Examples of categories are types of diseases or languages spoken. Nominal data can be analyzed in summarized using frequencies, proportions, percentages, cross tabulations, and the mode, and they can be visualized using pie charts and bar graphs.
Categorical data: Ordinal
Ordinal values represent categorical data that can be ordered. Ordinal data are very similar to nominal data, but as the name implies, order is important. The categories follow some logical order such as size is categorized as small, medium and large. Similarly to nominal data, ordinal data can be analyzed, summarized and visualized. However, ordinal data can also be described using percentiles, medians and modes. If the ordinal data are numeric, interquartile ranges can also be used.
For example, you could look at the interquartile range of exam scores that are in percentages and arranged from lowest to highest, but it would not make any sense to try to find the interquartile range of clothing sizes that go from extra small to extra large. For an example of when to use interquartile range, check out the video on exploring measures of dispersion.
Quantitative data
Quantitative data, also called numerical data, can be either discrete or continuous. When the data values are distinct and separate, and they can take on certain values only, they're called discrete data. Discrete data can be only counted, not measured.
For example, the number of sheep on a farm. continuous data, on the other hand, represent measurements, not counts. Continuous data can take on an infinite number of values, but for practical reasons continuous data are measured using a discrete scale. Distance is an example of continuous data. It is continuous and that you could keep adding or removing small and the distance would change. However, centimeters or kilometers are used to measure distance on a discrete scale.
Exemple: How old are the people in a community ?
Let's look at an example of working with different types of data. Let's say we want to know how old the people in a community are so that we can plan appropriate services and activities for them. In our example we have the birth dates of the people in a particular community. Because time can be divided in an infinite number of ways, for example, every second or millisecond it is a continuous variable. However, for practical reasons, a hospital usually records the year, month, day, hour and minute of birth. For administrative purposes, we usually just report the year, month and day of birth, which means we're using a discrete representation of a continuous variable. To determine someone's age from their date of birth, we calculate the time between the current date and their date of birth. For convenience sake, let's round their age to the nearest year, which is also a discrete value.
If our community is very small, we could look at all the ages on a list and be able to interpret them. However, if there are a lot of people, it would be very hard to look at a list of ages and say anything meaningful about them, especially if they were in no particular order. When converting age data into statistical information, it's common practice to group the ages into categories. Let's use ranges of 10 years for our example. Now the data are ordinal because there is a particular order to the age categories.
Exemple: How old are the people in a community ?
(Image of a table where the left column called "Age category" representing the different age groups and the right column represents the "count of people". The following is the table content:
- 0 to 10 years: 5
- 11 to 20 years: 12
- 21 to 30 years: 25
- 31 to 40 years: 30
- 41 to 50 years: 23
- 51 to 60 years: 14
- 61 to 70 years: 3
- 71 to 80 years: 0
- 81 years and older: 0)
Let's use the same example. Now that we have age categories, we want to know how many people are in each category. The statistical method we apply to the ordinal data produces a frequency distribution which is shown in the table on the right. Now it becomes quite clear that the community is relatively young. This table is statistical information that can be used by community planners and organizers to plan services and activities that are age appropriate for the community members. It's much easier to interpret the statistical information in this table than it would be to interpret a long list of birth dates.
Quantitative data: Be careful with Zero
There's one very important value to be careful with, in quantitative data. The value of 0, sometimes 0, means there is none of something. For example, zero apples means there are no apples. Sometimes 0 does mean something. For example, zero degrees Celsius means it's cold outside, not that there is no temperature. In some cases, negative values are valid. For example, if I have -$5, it means I owe $5. However, sometimes negative values are not valid. For example, there can't be minus five sheep on a farm. Be mindful of the meaning of 0 when working with quantitative data.
Quantitative data: Basic statistics
There are many basic statistics that can be used with Quantitative data. In fact, all of the basic statistics shown on this slide can be used in a meaningful way with quantitative data.
(Text on screen: Basic statistics include counts, ranks, means, totals and varainces. Other basic statistics include: Proportions, frequencies and cross-tabulations; mode, median, ranks and percentiles; means totals and variances.)
Data types
Remember that data can be categorical or quantitative. Categorical data can be nominal, labels only, or ordinal, having a particular order. Quantitative data can be discreet things we count or continuous, which are things we measure. The next slide provides examples of different types of data and you will have to determine the data type: nominal, ordinal, discrete or continuous.
Guided practice: What is the data type ?
Pause the video here and take the time you need to determine whether each example is nominal, ordinal, discrete or continuous. Continue to play the video to see the answers.
(Image on screen where 4 different examples need to be answered: 1) Names of instruments in an orchestra; 2) Temperature outside right now; 3) Number of pounds gained over the holidays; 4) Rank in a household based on age.)
Do you agree with our suggestions ?
The names of instruments in an orchestra are categorical, nominal data because they can be in any order. Although violin players would probably say they should come first.
Temperature is quantitative continuous data because it can be measured in small increments. We use degrees Celsius for convenience.
Number 3 is a trick question. Weight is measured in pounds in kilograms, which are continuous, but the question asked for the number of pounds gained, which is a count, meaning that these are quantitative discrete data.
Lastly, a person's rank in a household by age is categorical ordinal data because rank by age means putting households in order from youngest to oldest. How did you do?
Summary of key points
Data can be in the form of numbers, texts, observations or recordings. Statistical methods are applied to data to produce statistical information. Data can be nominal, which are categories, or ordinal which are categories in a particular order. Numerical or quantitative data can be continuous, in which case we need to take measurements or discrete, in which case we need to count. We also learned to be careful of the value of zero, which could mean different things depending on the nature of the data.
(The Canada Wordmark appears.)