Why do we conduct this survey?

The purpose of this survey is to produce monthly statistics on production and stocks of various dairy products and sales of fluid milk and cream from dairy processors in Canada.

The information is grouped with other dairy statistics to provide valuable information for milk marketing agencies, farmers and processor associations, and government departments.

Your information may also be used by Statistics Canada for other statistical and research purposes.

Your participation in this survey is required under the authority of the Statistics Act.

Other important information

Authorization to collect this information

Data are collected under the authority of the Statistics Act, Revised Statutes of Canada, 1985, Chapter S-19.

Confidentiality

By law, Statistics Canada is prohibited from releasing any information it collects that could identify any person, business, or organization, unless consent has been given by the respondent, or as permitted by the Statistics Act. Statistics Canada will use the information from this survey for statistical purposes only.

Record linkages

To enhance the data from this survey and to reduce respondent burden, Statistics Canada may combine it with information from other surveys or from administrative sources.

Data-sharing agreements

To reduce respondent burden, Statistics Canada has entered into data-sharing agreements with provincial and territorial statistical agencies and other government organizations, which have agreed to keep the data confidential and use them only for statistical purposes. Statistics Canada will only share data from this survey with those organizations that have demonstrated a requirement to use the data.

Section 11 of the Statistics Act provides for the sharing of information with provincial and territorial statistical agencies that meet certain conditions. These agencies must have the legislative authority to collect the same information, on a mandatory basis, and the legislation must provide substantially the same provisions for confidentiality and penalties for disclosure of confidential information as the Statistics Act. Because these agencies have the legal authority to compel businesses to provide the same information, consent is not requested and businesses may not object to the sharing of the data.

For this survey, there are Section 11 agreements with the provincial statistical agencies of Newfoundland and Labrador, Nova Scotia, New Brunswick, Quebec, Ontario, Manitoba, Saskatchewan, Alberta and British Columbia. The shared data will be limited to information pertaining to business establishments located within the jurisdiction of the respective province.

Section 12 of the Statistics Act provides for the sharing of information with federal, provincial or territorial government organizations. Under Section 12, you may refuse to share your information with any of these organizations by writing a letter of objection to the Chief Statistician, specifying the organizations with which you do not want Statistics Canada to share your data and mailing it to the following address:

Chief Statistician of Canada
Statistics Canada
Attention of Director, Enterprise Statistics Division
150 Tunney's Pasture Driveway
Ottawa, Ontario
K1A 0T6

You may also contact us by email at statcan.esdhelpdesk-dsebureaudedepannage.statcan@statcan.gc.ca or by fax at 613-951-6583.

For this survey, there is a Section 12 agreement with the Prince Edward Island statistical agency.

For agreements with provincial and territorial government organizations, the shared data will be limited to information pertaining to business establishments located within the jurisdiction of the respective province or territory.

Business or organization and contact information

1. Please verify or provide the business or organization's legal and operating name and correct where needed.

Note: Legal name modifications should only be done to correct a spelling error or typo.

Legal Name

The legal name is one recognized by law, thus it is the name liable for pursuit or for debts incurred by the business or organization. In the case of a corporation, it is the legal name as fixed by its charter or the statute by which the corporation was created.

Modifications to the legal name should only be done to correct a spelling error or typo.

To indicate a legal name of another legal entity you should instead indicate it in question 3 by selecting 'Not currently operational' and then choosing the applicable reason and providing the legal name of this other entity along with any other requested information.

Operating Name

The operating name is a name the business or organization is commonly known as if different from its legal name. The operating name is synonymous with trade name.

  • Legal name
  • Operating name (if applicable)

2. Please verify or provide the contact information of the designated business or organization contact person for this questionnaire and correct where needed.

Note: The designated contact person is the person who should receive this questionnaire. The designated contact person may not always be the one who actually completes the questionnaire.

  • First name
  • Last name
  • Title
  • Preferred language of communication
    • English
    • French
  • Mailing address (number and street)
  • City
  • Province, territory or state
  • Postal code or ZIP code
  • Country
    • Canada
    • United States
  • Email address
  • Telephone number (including area code)
  • Extension number (if applicable)
    The maximum number of characters is 5.
  • Fax number (including area code)

3. Please verify or provide the current operational status of the business or organization identified by the legal and operating name above.

  • Operational
  • Not currently operational (e.g., temporarily or permanently closed, change of ownership)
    Why is this business or organization not currently operational?
    • Seasonal operations
      • When did this business or organization close for the season?
        • Date
      • When does this business or organization expect to resume operations?
        • Date
    • Ceased operations
      • When did this business or organization cease operations?
        • Date
      • Why did this business or organization cease operations?
        • Bankruptcy
        • Liquidation
        • Dissolution
        • Other - Specify the other reasons why the operations ceased
    • Sold operations
      • When was this business or organization sold?
        • Date
      • What is the legal name of the buyer?
    • Amalgamated with other businesses or organizations
      • When did this business or organization amalgamate?
        • Date
      • What is the legal name of the resulting or continuing business or organization?
      • What are the legal names of the other amalgamated businesses or organizations?
    • Temporarily inactive but will re-open
      • When did this business or organization become temporarily inactive?
        • Date
      • When does this business or organization expect to resume operations?
        • Date
      • Why is this business or organization temporarily inactive?
    • No longer operating due to other reasons
      • When did this business or organization cease operations?
        • Date
      • Why did this business or organization cease operations?

4. Please verify or provide the current main activity of the business or organization identified by the legal and operating name above.

Note: The described activity was assigned using the North American Industry Classification System (NAICS).

This question verifies the business or organization's current main activity as classified by the North American Industry Classification System (NAICS). The North American Industry Classification System (NAICS) is an industry classification system developed by the statistical agencies of Canada, Mexico and the United States. Created against the background of the North American Free Trade Agreement, it is designed to provide common definitions of the industrial structure of the three countries and a common statistical framework to facilitate the analysis of the three economies. NAICS is based on supply-side or production-oriented principles, to ensure that industrial data, classified to NAICS , are suitable for the analysis of production-related issues such as industrial performance.

The target entity for which NAICS is designed are businesses and other organizations engaged in the production of goods and services. They include farms, incorporated and unincorporated businesses and government business enterprises. They also include government institutions and agencies engaged in the production of marketed and non-marketed services, as well as organizations such as professional associations and unions and charitable or non-profit organizations and the employees of households.

The associated NAICS should reflect those activities conducted by the business or organizational units targeted by this questionnaire only, as identified in the 'Answering this questionnaire' section and which can be identified by the specified legal and operating name. The main activity is the activity which most defines the targeted business or organization's main purpose or reason for existence. For a business or organization that is for-profit, it is normally the activity that generates the majority of the revenue for the entity.

The NAICS classification contains a limited number of activity classifications; the associated classification might be applicable for this business or organization even if it is not exactly how you would describe this business or organization's main activity.

Please note that any modifications to the main activity through your response to this question might not necessarily be reflected prior to the transmitting of subsequent questionnaires and as a result they may not contain this updated information.

The following is the detailed description including any applicable examples or exclusions for the classification currently associated with this business or organization.

Description and examples

  • This is the current main activity.
  • This is not the current main activity.

Please provide a brief but precise description of this business or organization's main activity.

e.g., breakfast cereal manufacturing, shoe store, software development

Main activity

5. You indicated that is not the current main activity.

Was this business or organization's main activity ever classified as: ?

  • Yes
    When did the main activity change?
    • Date
  • No

6. Please search and select the industry classification code that best corresponds to this business or organization's main activity.

Select this business or organization's activity sector (optional)

  • Farming or logging operation
  • Construction company or general contractor
  • Manufacturer
  • Wholesaler
  • Retailer
  • Provider of passenger or freight transportation
  • Provider of investment, savings or insurance products
  • Real estate agency, real estate brokerage or leasing company
  • Provider of professional, scientific or technical services
  • Provider of health care or social services
  • Restaurant, bar, hotel, motel or other lodging establishment
  • Other sector

Method of collection

1. Indicate whether you will be answering the remaining questions or attaching files with the required information.

  • Answering the remaining questions
  • Attaching files

Attachments

1. Please attach the files that will provide the information required for the Monthly Dairy Factory Production and Stocks Survey.

To attach files

  • Press the Attach files button
  • Choose the file to attach. Multiple files can be attached

Note:

  • Each file attached must not exceed 5 MB
  • All attachments combined must not exceed 50 MB .
  • The name and size of each file attached will be displayed on the page.

Dairy products and by-products

1. Which of the following products did this location manufacture or stock in [month] ?

Select all that apply.

  • Dairy products and by-products
    • Include all manufacturer's stocks owned whether they are stored in your storage room, a public warehouse, a cheese grading station or ready for shipment.
    • Exclude stocks held on Canadian Dairy Commission accounts.
  • Other varieties of cheese
    • Exclude cheddar and mozzarella.
  • Cottage cheese, yogurt or sour cream
    • Include both spoonable and drinkable yogurt, and kefir.
  • Butter and butter oil (creamery butter, whey butter, butter oil or ghee)
  • Cheddar cheese
  • Mozzarella cheese
  • Other varieties of cheeses
    e.g., Brick, Colby, Gouda
  • Processed cheese products
  • Cottage cheese, yogurt or sour cream
  • Concentrated products
    e.g., concentrated milks, milk powders
  • Frozen products
    e.g., ice cream, frozen yogurt, milkshake mix
  • None of the above

2. In [month], did this location process and sell any fluid milk or cream in [Province/Territory] ?

Milk and Cream Sales

This question covers all fluid milk and cream processed and packaged in your plant and sold in your province.

Exclude bulk cream sent to other processing plants for packaging into fluid creams.

  • Yes
  • No

Butter and butter oil

3. What were the total production and stocks in kilograms (kg) for the following butter and butter oil products?

Butter and butter oil

Include:

  • production for the entire month
  • stocks on the last day of the month
  • all manufacturer's stocks owned whether they are stored in your storage room, a public warehouse, a cheese grading station or ready for shipment.

Exclude stocks held on Canadian Dairy Commission accounts.

Butter oil and ghee

Butter oil and ghee is the pure butterfat left after milk solids and water are removed from the butter.

Table summary
This is an empty data table used by respondents to provide data to Statistics Canada. This table contains no data.
  Total production for [month] (kg) Total stocks on the last day of [month] (kg)
a. Creamery butter
Include salted, unsalted, whipped, light, cultured, sweet, calorie-reduced butter and dairy spread.
Exclude reworked butter and manufacturing cream.
   
b. Whey butter    
c. Butter oil and ghee    

Cheddar cheese

4. What were the total production and stocks in kilograms (kg) for cheddar cheese?

Include all sizes: block, stirred curd, curd and cheddar cheese used to make processed cheese.

Cheddar cheese

Include:

  • 'light' or 'lite' varieties of cheddar cheeses
  • production for the entire month
  • stocks on the last day of the month
  • all manufacturer's stocks owned whether they are stored in your storage room, a public warehouse, a cheese grading station or ready for shipment.

Exclude stocks held on Canadian Dairy Commission accounts.

Total production for [month]:

Total stocks on the last day of [month]:

5. Of the total cheddar cheese stocks reported above, what was the total stocks in kilograms (kg) for the following types of cheddar cheese?

Cheddar cheese

Include:

  • 'light' or 'lite' varieties of cheddar cheeses
  • stocks on the last day of the month
  • all manufacturer's stocks owned whether they are stored in your storage room, a public warehouse, a cheese grading station or ready for shipment.

Exclude stocks held on Canadian Dairy Commission accounts.

Table summary
This is an empty data table used by respondents to provide data to Statistics Canada. This table contains no data.
  Total stocks on the last day of [month] (kg)
a. Mild cheddar
Include stocks of cheddar cheese matured for less than 3 months or processed,
sold and labeled as 'mild' cheddar cheese.
 
b. Medium cheddar
Include stocks of cheddar cheese matured for 3 to 9 months or processed,
sold and labeled as 'medium' cheddar cheese.
 
c. Old, strong, extra-old cheddar
Include stocks of cheddar cheese matured for more than 9 months or processed,
sold and labeled as 'old', 'strong', 'extra-old' cheddar cheese.
 
Total stocks for [month] for cheddar cheese  

Mozzarella cheese

6. What were the total production and stocks in kilograms (kg) for the following types of mozzarella cheese?

Mozzarella cheese

Include:

  • production for the entire month
  • stocks on the last day of the month
  • all manufacturer's stocks owned whether they are stored in your storage room, a public warehouse, a cheese grading station or ready for shipment.

Exclude stocks held on Canadian Dairy Commission accounts.

Table summary
This is an empty data table used by respondents to provide data to Statistics Canada. This table contains no data.
  Total production for [month] (kg) Total stocks on the last day of [month] (kg)
a. Mozzarella American full fat
27% to 28% Butter fat
   
b. Mozzarella American low fat
17% to 20% Butter fat
   
c. Mozzarella Italian full fat
22% to 24% Butter fat
   
d. Mozzarella Italian low fat
15% Butter fat
   
e. All other mozzarella cheese    
Total production and stocks for [month] for mozzarella cheese    

Other varieties of cheeses other than cheddar and mozzarella

7. What was the total production in kilograms (kg) for the following other varieties of cheeses?

Other varieties of cheeses other than cheddar and mozzarella

Report varieties of 'light' or 'lite' cheeses with the respective category of cheese, for example: report 'light' feta cheese at question m. Feta.

Include production for the entire month.

Exclude:

  • cheddar and mozzarella
  • stocks held on Canadian Dairy Commission accounts.
Table summary
This is an empty data table used by respondents to provide data to Statistics Canada. This table contains no data.
  Total production for [month] (kg)
a. Bakers  
b. Bocconcini  
c. Brie  
d. Brick  
e. Caciocavallo  
f. Camembert  
g. Casata  
h. Colby  
i. Cream cheese  
j. Edam  
k. Emmental  
l. Farmer's  
m. Feta  
n. Friulano  
o. Gouda  
p. Havarti  
q. Marble  
r. Monterey Jack  
s. Parmesan  
t. Pizza
Include cheeses other than mozzarella cheese that are used as topping for pizza.
 
u. Provolone  
v. Ricotta  
w. Romano  
x. Skim milk  
y. Swiss  
z. Curd cheese
Include cheese curd other than cheddar curd.
 
aa. Other - specify other variety of cheese 1:  
ab. Other - specify other variety of cheese 2:  
ac. Other - specify other variety of cheese 3:  
ad. Other - specify other variety of cheese 4:  
ae. Other - specify other variety of cheese 5:  
af. Other - specify other variety of cheese 6:  
ag. Other - specify other variety of cheese 7:  
ah. Other - specify other variety of cheese 8:  
ai. Other - specify other variety of cheese 9:  
aj. Other - specify other variety of cheese 10:  
Total production for [month] for other varieties of cheeses  

8. What was the total stocks in kilograms (kg) for other varieties of cheeses?

Exclude cheddar and mozzarella.

Total stocks on the last day of [month]:

Processed cheese products

9. What were the total production and stocks in kilograms (kg) for processed cheese products?

Include processed cheese, processed cheese food, processed cheese spread made from cheddar cheese or other cheeses.

Processed cheese products

Include:

  • production for the entire month
  • stocks on the last day of the month
  • all manufacturer's stocks owned whether they are stored in your storage room, a public warehouse, a cheese grading station or ready for shipment.

Exclude stocks held on Canadian Dairy Commission accounts.

Total production for [month]:

Total stocks on the last day of [month]:

Cottage cheese, yogurt and sour cream

10. What was the total production in kilograms (kg) for the following products?

Cottage cheese, yogurt and sour cream

Include production for the entire month.

Table summary
This is an empty data table used by respondents to provide data to Statistics Canada. This table contains no data.
  Total production for [month] (kg)
a. Cottage cheese
Include curds and creamed cottage cheeses.
 
b. Yogurt
Include both spoonable and drinkable yogurt, and kefir.
Exclude volumes of fruits and additives.
 
c. Sour cream
Include regular and light sour cream.
 

Concentrated products

11. What were the total production and stocks in kilograms (kg) for the following concentrated products?

Concentrated products

Include:

  • production for the entire month
  • stocks on the last day of the month
  • all manufacturer's stocks owned whether they are stored in your storage room, a public warehouse, a cheese grading station or ready for shipment.

Exclude stocks held on Canadian Dairy Commission accounts.

Table summary
This is an empty data table used by respondents to provide data to Statistics Canada. This table contains no data.
  Total production for [month] (kg) Total stocks on the last day of [month] (kg)
a. Concentrated milk (evaporated whole milk)    
b. Sweetened concentrated milk (condensed whole milk)    
c. Concentrated skim milk (evaporated skim milk)    
d. Sweetened concentrated skim milk (condensed skim milk)    
e. Concentrated partly skimmed milk - 2% (evaporated partly skimmed milk - 2%)    
f. Skim milk powder
Include instantized.
   
g. Whole milk powder    
h. Buttermilk powder    
i. Whey powder    
j. Other - specify other concentrated product 1:    
k. Other - specify other concentrated product 2:    
l. Other - specify other concentrated product 3:    
m. Other - specify other concentrated product 4:    
n. Other - specify other concentrated product 5:    

Frozen products

12. What was the total production in litres (L) for the following frozen products?

Frozen products

Include production for the entire month.

Table summary
This is an empty data table used by respondents to provide data to Statistics Canada. This table contains no data.
  Production mix for [month] (L) Production frozen for [month] (L)
a. Soft ice cream
Over 5% Butter fat
   
b. Hard ice cream
Over 5% Butter fat
   
Total ice cream mix    
c. Soft frozen yogurt mix    
d. Hard frozen yogurt mix
Less than 5% Butter fat.
   
e. Ice milk mix    
f. Milkshake mix    
g. Sherbet    
h. Water ices    
i. Other - specify all other frozen products:    

Milk and cream sales

13. What was the total volume in litres (L) sold for the following milk and cream products?

Include all fluid milk and cream processed and packaged in your plant and sold in [Province/Territory] .

Exclude bulk cream sent to other processing plants for packaging into fluid creams.

Table summary
This is an empty data table used by respondents to provide data to Statistics Canada. This table contains no data.
  Total volume of sales for [month] (L)
a. Standard milk
3.25% Butter fat and over
 
b. 2% partly skimmed milk
1.9% to 2.1% Butter fat
 
c. 1% partly skimmed milk
0.9% to 1.1% Butter fat
 
d. Skim milk
Under 0.3% Butter fat
 
e. Buttermilk  
f. Chocolate milk and other flavoured milk  
g. Light cream
5.0% to 9.9% Butter fat
 
h. Cereal cream
10.0% to 15.9% Butter fat
 
i. Table cream
16.0% to 31.9% Butter fat
 
j. Whipping cream
32.0% Butter fat and over
 
k. Eggnog  
l. Other - specify other milk or cream product 1:  
m. Other - specify other milk or cream product 2:  
n. Other - specify other milk or cream product 3:  
o. Other - specify other milk or cream product 4:  
p. Other - specify other milk or cream product 5:  

Specify other milk or cream product 5

Changes or events

14. Indicate any changes or events that affected the reported values for this business or organization, compared with the last reporting period.

Select all that apply.

  • Strike or lock-out
  • Exchange rate impact
  • Price changes in goods or services sold
  • Contracting out
  • Organizational change
  • Price changes in labour or raw materials
  • Natural disaster
  • Recession
  • Change in product line
  • Sold business or business units
  • Expansion
  • New or lost contract
  • Plant closures
  • Acquisition of business or business units
  • Equipment failure
  • Seasonal operations
  • Increased market demand
  • Decreased market demand
  • Other - specify the other changes or events:
  • No changes or events

Contact person

15. Statistics Canada may need to contact the person who completed this questionnaire for further information.

Is Provided Given Names, Provided Family Name the best person to contact?

  • Yes
  • No

Who is the best person to contact about this questionnaire?

  • First name
  • Last name
  • Title
  • Email address
  • Telephone number (including area code)
  • Extension number (if applicable)
    The maximum number of characters is 5.
  • Fax number (including area code)

Feedback

16. How long did it take to complete this questionnaire?

Include the time spent gathering the necessary information.

  • Hours
  • Minutes

17. Do you have any comments about this questionnaire?

Formative Evaluation of Microdata Access: Virtual Data Lab Project

Evaluation Report

December 2020

The report in short

The Data Access Division (DAD) is responsible for providing microdata access to researchers outside Statistics Canada and for maintaining the repository of all deemed employees of the agency. This includes providing support, expertise and standards of data provision to subject-matter areas within Statistics Canada, and ensuring that the confidentiality of information is protected in all microdata access agreements.

Building on the foundation of the research data centres (RDCs) and the Canadian Centre for Data Development and Economic Research (CDER), Statistics Canada created the Virtual Data Lab (VDL) pilot that—in its end state—will provide remote access to detailed anonymized social and business microdata for research and analysis through a secure cloud-based interface. The VDL aims to increase the number of researchers that use Canadian data, increase the number of data holdings that users can access, modernize existing IT infrastructure, develop shared-risk partnerships and move existing operations to a cloud-based architecture.

The evaluation was conducted by Statistics Canada's Evaluation Division in accordance with the Treasury Board's Policy on Results (2016) and Statistics Canada's Risk-Based Audit and Evaluation Plan (2019–20 to 2023–24). The objective of the evaluation was to provide a neutral, evidence-based assessment of program alignment with identified user needs. The evaluation reviewed whether existing VDL assessments conducted within the agency were adequately comprehensive in their coverage, and whether findings and recommendations from these assessments were integrated into the design, delivery, strategic planning and implementation of the VDL project. In addition, the evaluation assessed the extent to which performance measurement and risk assessment frameworks have been implemented.

The evaluation methodology consisted of a document review, key informant interviews with Statistics Canada professionals working in the DAD and Digital Solutions Field, and additional lines of inquiry, where applicable.

It should be noted that the conducting phase of the evaluation, during which all data for the report were collected, was completed before the COVID-19 pandemic. As a result, the findings and recommendations do not consider the activities or decisions that took place after March 2020.

Key findings and recommendations

Issue 1: user experience

Question 1

To what extent are existing DataLab assessments comprehensive in considering all users, identifying user needs and aligning project planning with desired outcomes?

Findings

The assessments are comprehensive in their coverage of existing users and in identifying user needs and feedback trends that can be used to inform strategic planning, particularly among federal government and academic users.

Collectively, existing assessments are comprehensive in their coverage of all project dimensions and in alignment with the desired outcomes set out in project planning and framework documents.

Issue 2: design, delivery and implementation

Question 2

To what extent has information from completed assessments been integrated into the DataLab project?

Findings

Assessments of user experience (UX), privacy and security have been leveraged to inform strategies and planning. However, timelines for delivery were not always clear.

The DAD performed an international scan of national statistical offices (NSOs) with similar capabilities and programs to determine best practices and assess similar program offerings around the world.

Question 3

To what extent have performance measurement and risk assessment frameworks been implemented?

Findings

Performance indicators have been articulated and work to improve them is ongoing.

A risk management framework exists and active risk management is ongoing. Contingency planning and a clarification of responsibilities would be useful.

Recommendation 1

The Assistant Chief Statistician, Strategic Engagement (Field 4), should ensure that given the level of complexity and dependence on other parts of the agency, contingencies and clearer timeframes be articulated for activities.

Recommendation 2

The Assistant Chief Statistician, Strategic Engagement (Field 4), should ensure that governance mechanisms are in place that effectively manage horizontal activities in a holistic manner including the clear establishment and understanding of roles and responsibilities.

Update

Since the end of the reference period for the evaluation (March 2020), the VDL Project team has launched and completed a variety of initiatives to meet urgent emerging needs and at the same time accelerate project development. These initiatives included: launching an interim access solution to facilitate COVID-19 research; piloting a cloud environment for a project on opioids with federal partners; working closely with governing bodies to approve frameworks and approaches; and creating and revising key documents such as contracts and agreements to ensure they reflected the VDL framework and governance. The team also updated the project plan to account for progress in several areas including the development and implementation of a corporate client relationship management system and initiated working groups to address issues such as onboarding roles and responsibilities.

What is covered

The evaluation was conducted in accordance with Treasury Board's Policy on Results(2016) and Statistics Canada's Integrated Risk-based Audit and Evaluation Plan (2019–20 to 2023–24). In support of decision-making, accountability and improvement, the main objective of the evaluation was to provide a neutral, evidence-based assessment of the Virtual Data Lab (VDL) pilot. Because of the status of the project at the time of the evaluation, a formative approach was employed.Footnote 1

Data Access Division

The Data Access Division (DAD) is responsible for governing access to confidential microdata to researchers—and deemed employees—outside and within Statistics Canada. This includes providing support, expertise and standards of data provision to subject-matter areas within the agency and ensuring that the confidentiality of the information is protected in all microdata access agreements. The DAD delivers microdata access through various programs based on the type of data requested, user-specific needs, the purpose of accessing the dataset and the sensitivity level of the requested data file. The provision of microdata access is then administered through a continuum in accordance with organizational security requirements and the sensitivity of the data files.

Statistics Canada microdata access continuum
Less sensitive, self-service access Most sensitive, restricted access
Data Liberation Initiative Access to public use microdata files Real Time Remote Access Research data centres, the Canadian Centre for Data Development and Economic Research, Virtual Data Labs
This is a subscription-based service that provides unlimited access to available, anonymized and non-aggregated data in the collection. Postsecondary institutions and Statistics Canada partnerships provide faculty members and students with unlimited access to a variety of public use data and geographic files. This is an online service that allows users to run SAS programs in real time using data located in a secured location. Secure Statistics Canada physical environments are made available to accredited researchers and federal government employees to access anonymized microdata for research purposes, ensuring that all personal information is removed from outputs.

As part of Statistics Canada's ongoing efforts to support its modernization agenda (Appendix D), it developed the Confidentiality Classification Tool (CCT)—a corporate tool for classifying confidential data. The CCT is a short questionnaire that generates a standard confidentiality classification along a continuum of risk (a score from 0 to 9). CCT scores are used to determine the nature and degree of confidentiality protections Statistics Canada employs in its information holdings. In particular, this tool is used to determine access conditions for trusted external data users, such as VDL partner organizations and researchers, through an access management framework adapted from the Five Safes framework developed by the United Kingdom's Office for National Statistics (this is a holistic approach to data protection that goes beyond information technology [IT] and physical site protection).

Figure 1 - Five Safes framework
Statistics Canada microdata access continuum
Description for Figure 1 - Five Safes framework

The figure depicts the five safes along with four elements that tie the safes together. Overarching the safes is the governance. The five safes are considered in combination and on a sliding scale, depending on the type of researcher, access and data.

The five safes are:

  • Safe people - Can the users be trusted to use it in an appropriate manner?
  • Safe projects - Is the use of the data appropriate?
  • Safe data - Is there a risk of disclosure?
  • Safe setting - Does the access facility prevent unauthorized use?
  • Safe outputs - Are the statistical results non-disclosive?

The four elements are:

  • Security and monitoring requirements
  • Accreditation of data users and sharing accountability with the host organization
  • Treatment of datasets – vetting of outputs
  • Review and approval process
Five Safes framework
Safe people Can the users be trusted to use it in an appropriate manner?
Safe projects Is the use of the data appropriate?
Safe data Is there a risk of disclosure?
Safe setting Does the access facility prevent unauthorized use?
Safe outputs Are the statistical results non-disclosive?

Virtual Data Lab pilot project

Building on the foundation of the research data centres (RDC) and the Canadian Centre for Data Development and Economic Research (CDER), Statistics Canada created the VDL pilot that—in its end state—will provide remote access to detailed, anonymized social and business microdata for research and analysis through a secure cloud-based interface. The VDL business case outlined the following key outcomes (Appendix B):

  • Enhance UX when accessing anonymized microdata.
  • Leverage new technology, methods and data.
  • Apply risk management practices (to optimize data accessibility with protective measures).
  • Increase collaboration and partnerships between data users and providers.
  • Build analytical and research capacity.
  • Support stronger evidence-based decision-making.

The first prototype of the VDL is the virtual Federal Research Data Centre, which is located at the Canada Mortgage and Housing Corporation (CMHC) headquarters in Ottawa. With new partnerships being developed, this shared-risk model will be expanded to additional partner agencies and their users. The CMHC facility—where deemed employees are provided secure access to anonymized data—is being used to test UX, access and security protocol, starting with a designated certified room similar to an RDC and moving into regulated access for approved CMHC employees to use anonymized research files in separate authorized workspaces. The VDL pilot will evolve gradually toward remote cloud-based access once Statistics Canada's cloud infrastructure becomes available.

The anonymized data used by CMHC researchers are housed on secure Statistics Canada servers and authorized researchers are required to use Statistics Canada secure encrypted devices to connect to these central servers through Virtual Desktop Infrastructure. Less sensitive (i.e., with a lower CCT score) information may be accessed in authorized workspaces, while more sensitive anonymized data (CCT 8) are accessible only within the designated certified room.

The VDL project will also be expanding to include the Bank of Canada (BoC) and the Health Portfolio (Opioid Pathfinder Project), with additional partnerships planned.

Figure 2 - Location of access and file sensitivity for deemed employees
Location of access and file sensitivity for deemed employees
Description of Figure 2 - Location of access and file sensitivity for deemed employees

The figure depicts the types of locations of access mapped against the accreditation level and file sensitivity level (Confidentiality Classification Tool [CCT] score).

In the figure, the file sensitivity (CCT score) ranges from 1 to 9 with 1 being the lowest and 9 the highest.

The accreditation level ranges from level 0 to level 4 with level 0 being the lowest.

Location access includes open office (cubicle environment), closed office (personal office or conference room), physical enclave (designated certified room including virtual federal research data centre, research data centre, and federal research data centre).

In terms of location access:

  • A accreditation level of 0 provides no access regardless the file sensitivity score
  • For file sensitivity scores of 9, there is no access outside of Statistics Canada headquarters regardless accreditation level
  • Level 1 accreditation means access via physical enclave for file sensitivity scores of 1 to 8
  • Level 2 accreditation means access via open office for file sensitivity scores 4 and below, closed office for file sensitivity scores 5 to 7, and physical enclave for a file sensitivity score of 8
  • Level 3 accreditation means access via open office for file sensitivity scores 5 and below, closed office for file sensitivity scores 6 and 7, and physical enclave for a file sensitivity score of 8
  • Level 4 accreditation means access via open office (experimental research) for file sensitivity scores of 5 and below, closed office for file sensitivity scores 6 and 7, and physical enclave for a file sensitivity score of 8

The highest CCT score of the datasets associated with a project dictates the mode and location of access.

The length and complexity of the process to apply for access to Statistics Canada microdata files, as well as the limitations on use and outputs, are dependent on the sensitivity level (CCT score) of the data file to which access has been requested (and granted) and—therefore—vary by access program. The RDC, CDER and VDL programs provide access to more sensitive data files, so application, access and use protocols are similar across these programs.

Progress toward the implementation of a cloud-based platform for VDL users is contingent on progress made by other Statistics Canada areas, namely the Data Analytics as a Service (DAaaS) and IT teams, as well as partner agencies and cloud services. With relatively short timelines, triangulated efforts made by all teams involved must continue to properly test the product and UX, minimize risks, and work toward key project management timelines, deliverables and thresholds.

In summary, the VDL is leveraging new and innovative tools, as well as testing new delivery platforms and formats. The project is based on an incremental approach to project development and leveraging lessons learned to adapt and improve quickly.

Evaluation

The scope of the evaluation was established based on a document review and meetings with DAD experts. The following areas were identified for review:

Evaluation issues and questions

User experience

  1. To what extent are existing DataLab assessments comprehensive in considering all users, identifying user needs and aligning project planning with desired outcomes?

Design, delivery and implementation

  1. To what extent has information from completed assessments been integrated into the DataLab project?
  2. To what extent have performance measurement and risk assessment frameworks been implemented?

Guided by a formative evaluation approach, two main collection methods were used:

  • Internal stakeholder interviews
    Semi-structured interviews with the DAD and Digital Solutions
  • Document review
    Review of internal agency documents

Four main limitations were identified, and mitigation strategies were employed:

Limitations and Mitigation strategies
Limitations Mitigation strategies
Because of the large number of (potential) users, the perspectives gathered through user research and assessments may not be fully representative. User feedback was considered from a variety of assessments with differing primary objectives. Evaluators were able to identify consistent trends across these reports, which were representative of general user needs from all current and prospective user groups.
User feedback may contain self-reported bias, which occurs when individuals who are reporting on their own activities portray themselves in a more positive light. By reviewing information from a variety of assessments with different scopes, evaluators were able to find consistent overall patterns and trends on which to elaborate.
Evidential documents are not always available. Additional documentation was requested and provided as required and key DAD staff were interviewed to fill any gaps.
The program has not yet worked with corporate services to develop key performance indicators (KPIs) and associated risk profiles.

Using a formative evaluation approach allowed evaluators to assess indicators and intermediate objectives—a requirement of the Departmental Project Management Framework project approval guidelines (project plan and charter).

Desired outcomes for the pilot were identified and repeated throughout the document review process and could be used as an informal evaluation framework.

The conducting phase of the evaluation, during which all data for the report were collected, was completed before the COVID-19 pandemic. As a result, the findings and recommendations do not consider the activities or decisions that took place after March 2020.

What we learned

User experience

Evaluation question

To what extent are existing DataLab assessments comprehensive in their coverage of all users, in identifying user needs and aligning project planning with desired outcomes?

The assessments are comprehensive in their coverage of existing users and in identifying user needs and feedback trends that can be used to inform strategic planning, particularly among federal government and academic users.

The large majority of existing user feedback was sourced from federal government researchers. Additional insights were provided by RDC employees who were able to confirm that the trends identified in existing assessments align with those of RDC user feedback (mainly academic users and government employees). Although these assessments are representative of current microdata users, future user needs assessments should be expanded to cover a broader range of users and identify project opportunities and prospective partners.

For improved understanding of VDL user requirements, the team will need to extend its coverage of user consultations to provincial and territorial governments to assess the unique needs of these key potential user groups, in adherence with the intended business outcome of establishing new and increased data partnerships with other government departments. In the longer term, consulting private industry and civil society organizations would provide additional insights on the user needs of these groups, in alignment with the long-run business outcome of increasing the number of researchers that use Canadian data and microdata files.

Total number of users consulted in all assessments, by user type
User type Users consulted, by assessment
DAaaS UX journey maps Internal risk review CMHC feedback MAP usability testing All
Private 2 - - - 2
Federal government 25 6 17 5 53
Provincial government 5 - - - 5
Academic 5 - - - 5
Civil society 3 - - - 3
Total 68

Collectively, existing assessments are comprehensive in their coverage of all project dimensions and aligned with the desired outcomes set out in project planning and framework documents.

A review of the existing assessments found that the user process was covered effectively. Information was gathered on all seven steps, beginning with the proposal and ending with the vetting.

Figure 3 - Statistics Canada microdata access user process, user trends identified
Statistics Canada microdata access user process, user trends identified
Description - Statistics Canada microdata access user process, user trends identified

The figure depicts the user process from start to finish:

  • proposal
  • application
  • security clearance, MRC (microdata research agreement), and oath
  • deemed employee
  • training and orientation
  • controlled access
  • vetting

Five trends are identified:

  • At the proposal step - metadata and dataset information need to be more readily available and discoverable.
  • At the application step - the application process should be made more user friendly, with clear service delivery standards and timelines.
  • At the training and orientation step - training seems to be limited to security, business processes and vetting. Some self-learning is required. Virtual support and training will be needed for remote users.
  • At the controlled access step - access should be available remotely 24/7. Collaboration with other researchers is essential. Visualization tools are required.
  • At the vetting step - vetting is clunky and unclear to many researchers. Procedures vary between staff and rules are not standard across datasets. Clearer processes and standards could be implemented.

The evaluation team identified areas of the microdata access process in which trends in user feedback were consistent, including increasing data discoverability, simplifying the proposal process, implementing modern and user-centric tools and support mechanisms, allowing collaboration between researchers within and across research projects, and improving the vetting process. These key areas were examined further in interviews with experts from the VDL team and the DAD.

In terms of alignment with the VDL desired outcomes, the evaluation found that the assessments covered four key outcomes sufficiently:

  • Enhance UX when accessing anonymized microdata.
  • Leverage new technology, methods and data.
  • Apply risk management practices (to optimize data accessibility with protective measures).
  • Increase collaboration and partnerships between data users and providers.
Figure 4 - Alignment of assessments with Virtual Data Lab project outcomes
Alignment of assessments with Virtual Data Lab project outcomes
Description of Figure 4 - Alignment of assessments with Virtual Data Lab project outcomes

The figure depicts how the assessments align with the virtual data lab project outcomes. There are four project outcomes: enhance user experience (UX) for users accessing microdata; leverage new technology, methods and data; apply risk management practices; increase collaboration and partnerships. There are 6 assessments: MAP usability testing, UX research, CMHC feedback, pilot testing, international scan, internal risk review.

The outcomes are covered by the assessments as follows:

  • Enhance user experience (UX) for users accessing microdata is covered by 4 assessments: MAP usability testing, UX research, CMHC feedback, pilot testing
  • Leverage new technology, methods and data is covered by 5 assessments: MAP usability testing, UX research, CMHC feedback, pilot testing, international scan
  • Apply risk management practices is covered by 4 assessments: CMHC feedback, pilot testing, international scan, internal risk review
  • Increase collaboration and partnerships is covered by 4 assessments: UX research, CMHC feedback, pilot testing, internal risk review
Virtual data lab project outcomes and findings
VDL outcome Findings
Enhance the user experience when accessing anonymized microdata
  1. UX research provided first-hand user feedback on the full microdata access process, including categorizing different types of users and their needs by "persona" and mapping each group's specific tasks, needs and concerns.
  2. CMHC feedback provided context-specific insights on the UX of researchers and the Trusted Individual Responsible for Access Controls (TIRAC) in the first partner location of the VDL pilot, including business and security processes, ease of access and IT.
  3. Microdata Access Portal (MAP) usability testing assessed whether the platform was easy and intuitive to use and navigate.
  4. The VDL team was allowing the Public Health Agency of Canada, Health Canada and Statistics Canada researchers to test the cloud interface prior to its implementation.
Leverage new technology, methods and data
  1. A collaborative assessment of the UX research findings was performed collaboratively by the IT, DAaaS and VDL teams to identify possible interim and long-term solutions, many of which will rely on new platforms (e.g., DAaaS, Jupiter Hub), cloud-based secure infrastructure, data discoverability solutions, and the fostering and promotion of an open and innovative business culture.
  2. An international scan provided insights on microdata access projects and programs to identify best practices (including the Five Safes framework and the Australian Bureau of Statistics risk-sharing approach) to be integrated into the VDL project.
  3. Recommendations from a comprehensive internal risk review included the development and implementation of a partner agency capability assessment tool.
Apply risk management practices (to optimize data accessibility with protective measures)
  1. Collaborative assessments done by the DAD, IT and DAaaS provided brainstormed solutions ("how might we" [HMW] statements) for optimizing UX, including—but not limited to—optimizing data accessibility.
  2. A comprehensive internal risk review identified key recommendations for business processes, risk management and security practices.
  3. The risk management portion of the VDL project plan assessed risks and identified mitigation strategies, as well as mapped internal and external dependencies to be considered throughout project development and how they will be managed.
  4. The privacy impact assessment provided insights on considerations to privacy and associated risks, which are used to inform active monitoring plans and procedures.
Increase collaboration and partnerships between data users and providers
  1. Assessments done by the DAD, IT and DAaaS teams promoted collaborative work practices and shared accountability.
  2. CMHC feedback provided first-hand UX partner agency feedback from users, TIRAC and support staff to identify solutions.
  3. The VDL team continued to work with partners in the Health Portfolio and BoC to consult on user needs, system requirements, training and support.

Design, delivery and implementation

Evaluation question

To what extent has information from completed assessments been integrated into the DataLab project?

The evaluation considered whether findings of all assessments were leveraged to inform decision-making, project management and strategic planning, including the degree to which they were integrated into the development of tools and processes for the VDL. Assessment findings and recommendations have been integrated into VDL plans to a significant degree.

Assessments of UX, privacy and security have been leveraged to inform strategies and planning. However, timelines for delivery were not always clear.

The evaluation found that the VDL team—and the DAD more generally—has made a significant effort to consult users and test products prior to implementation. Assessments of user needs and UX research were leveraged to develop business requirements for the new MAP and Client Relationship Management System (CRMS). During the planning stages, the VDL team also performed an international scan to identify best practices that could be leveraged for the project, its services and framework.

Based on the UX assessments and discussions with RDC experts, the evaluation team was able to identify recurring trends in user feedback across the microdata access process. These have been further articulated below, including planned and proposed solutions.

User needs identified and planned or proposed solutions
Microdata access process User issue identified Planned or proposed solution
Proposal for data discoverability Researchers reported experiencing difficulties in determining which datasets to request and identifying which variables were available in each one. Improved discoverability and availability of metadata and documents, such as data dictionaries, codebooks and variable lists, would facilitate the proposal process for users, as they could be used to inform proposal writing and develop research plans. The VDL team is working with key partners such as DAaaS, IT, Corporate Services, other units within the DAD and the Canadian Research Data Centre Network to implement a data discoverability tool. This process has been delayed because of the CRMS acquisition, which will act as the administrative back end of the MAP. The data discoverability function was included as a business requirement for the development of the MAP and CRMS. In the interim, the DAD is working with subject-matter areas to include variables in the dataset descriptions available through the Integrated Metadatabase. Timelines for a fully functioning data discoverability tool were not identified, but were considered to be more long term.
Application

Users noted that the application process could be made more accessible with clear directions, as well as defined timelines of service standards with which they could better plan their research.

The microdata access application process should be streamlined or automated, where possible, to reduce wait times and limit redundant or unnecessary processes that may strain resources.

The MAP (user interface) is being developed to include specifications, such as register an account, modify information, upload or download documents, apply for and be assigned a CCT access score, join a project or invite team users, and save information securely. The MAP is currently being used for researcher applications, but will be expanded to cover other aspects of the microdata access process. The MAP was first user tested and updated before its rollout.

Timelines for a final MAP were not identified.

Training Training is limited to security, business processes and vetting with self-learning being necessary. Virtual support and training modules would be beneficial, especially once users work exclusively from the cloud platform.

Training and orientation modules would be administered through the MAP and repeated at specific frequencies to ensure users are informed and aware of processes and policies.

Timelines for this component of the MAP were identified during discussions as being short to medium term.

Controlled access

UX could be improved through the implementation of a "one-stop-shop" MAP (with one set of login credentials, virtual training modules, project and security reminders, and a collaboration space for researchers).

Access should also be available remotely on a 24/7 basis.

The VDL team (DAD) plans to use the MAP to manage administrative processes, including the proposal, fees, training, support and vetting (currently limited to the proposal process). An additional planned MAP capability provides access to the VDL interface. The implementation of the MAP (in its full capacity) is contingent on a new CRMS that will support the VDL in managing user processes, as well as in extracting KPI data.

The cloud interface will provide a centralized data access point and may eventually include a collaboration space for researchers, but this will require a review of the Policy on Microdata Access.

As part of the planned shift toward data access through the cloud, users would also be able to access data from their personal device remotely, subject to the CCT score of the dataset for which they have been granted access. When data are deemed highly sensitive, researchers who must access their data from a secure facility or approved closed office space would be required to follow the security access protocol (business hours) of their respective organization.

Timelines for the use of personal devices were not identified.

Vetting Users described the vetting process as clunky and otherwise unclear or inconsistent. Procedures vary between staff and rules are not always standard across datasets. The vetting process should be streamlined or automated, where possible, to reduce inconsistencies and facilitate understanding among users, in particular to standardize training and approaches, and support users through vetting rules and the vetting process.

The DAD is currently exploring tools and programs available to streamline this process. They are examining a system developed by another jurisdiction that—while limited in its ability to vet most VDL outputs—may be used as a basis upon which further development can take place.

Vetting is being considered for inclusion in the MAP training modules to help increase consistency.

Timelines for improved vetting processes were not articulated, but appear to be long term (one to two years or more).

Overall, the evaluation found that the VDL team worked collaboratively with the DAaaS team to identify trends in UX and possible solutions (HMW statements). Although these statements were beneficial in presenting possible solutions, they were not included in project planning, nor were timelines established for their integration. An additional area for improvement in UX could come from formalized lessons learned over the first few months of the project rollout, especially as new partners are onboarded.

Lastly, because of the complex integrated nature of the project, timelines are heavily reliant on other areas of the agency. Some activities related to the MAP and cloud services are scheduled to take place in the near term, while others have been pushed to future iterations of the project. The timing of those activities and items being pushed was generally unclear.

The DAD performed an international scan of NSOs with similar capabilities and programs to determine best practices and assess similar program offerings worldwide.

The approach taken by the VDL team was developed by completing a point-in-time international scan of similar offerings among other NSOs. The use of the Five Safes framework came from this scan. Best practices of the NSOs in Australia, Denmark, Finland, France, Germany, Sweden and the United Kingdom were assessed when developing the possible format of the VDL, and its services align with international best practices from this point in time.

It would be beneficial for the VDL team to establish an international network of microdata access experts to share best practices proactively to replace point-in-time international scans that may miss key elements. In this way, the VDL team would remain informed of new and innovative approaches to microdata access that could be used or adopted.

Evaluation question

To what extent have performance measurement and risk assessment frameworks been implemented?

Performance indicators have been articulated and work to improve them is ongoing.

Members of the VDL team noted that, while the performance measurement framework and indicators are outlined in the business case and project plan, they should be viewed as preliminary (Appendix C). At the time of the evaluation, the DAD was set to begin working with internal performance measurement experts to develop divisional performance indicators. It was unclear whether this would include additional KPIs for the VDL project. The VDL team could work proactively with other DAD project teams to identify KPIs that balance outcomes at the divisional level while also monitoring the progress of the VDL project. In terms of monitoring, there was little indication that regular monitoring of progress against the indicators in the project plan was taking place.

A risk management framework exists and active risk management is taking place. Contingency planning and a clarification of responsibilities would be useful.

A comprehensive review of the business process and risk framework was completed in 2019 and documentation and regular reporting on risks take place through the Departmental Project Management Framework. Members of the VDL team noted that they will take an iterative approach to amending risks and proposed mitigations or response strategies as changes arise within the project and as they learn from partners and users, although no formal documentation was provided.

Risk management includes effective contingency planning as part of mitigation development. The evaluation found that the articulation of contingency plans could be improved. Some examples where such plans would be beneficial include the proposal and vetting processes (Appendix E). For example, changes to the proposal process are highly dependent on both a new CRMS and MAP. However, no contingencies were found in the case of either of them being delayed or not meeting requirements. For the vetting process, a standard operating procedure could be established in the short term to improve consistency and clarity.

Lastly, the evaluation found that a clarification of responsibilities would be useful, which is particularly important given the complexity of the project and the strong dependency on partners. In particular, it was noted that, while there was a difference between what was considered a VDL project responsibility and what was considered a DAD responsibility, it was unclear how and when things were intended to be shared, who was ultimately making decisions, and how the work would be distributed.

Performance measurement and risk management

  • The Data Access Division (DAD) and the virtual data lab (VDL) team will work with internal performance measurement experts to develop and implement a logic model.
  • As part of existing assessments, a comprehensive review of security procedures and processes of the virtual Federal Research Data Centre (vFRDC) was completed.
  • Service-centric metrics and reporting were identified as a mandatory and high-priority business requirement for a new Client Relationship Management System (CRMS).
  • The DAD will take a responsive approach to amending policies and directives as needs arise. This includes amending the risk management framework to reflect risks from the transition to remote access.

How to improve the program

Recommendation 1

The Assistant Chief Statistician, Strategic Engagement (Field 4), should ensure that given the level of complexity and dependence on other parts of the agency, contingencies and clearer timeframes be articulated for activities.

Recommendation 2

The Assistant Chief Statistician, Strategic Engagement (Field 4), should ensure that governance mechanisms are in place that effectively manage horizontal activities in a holistic manner including the clear establishment and understanding of roles and responsibilities.

Management response and action plan

Recommendation 1

The Assistant Chief Statistician, Strategic Engagement (Field 4), should ensure that given the level of complexity and dependence on other parts of the agency, contingencies and clearer timeframes be articulated for activities.

Management response

Management agrees with the recommendation.

The VDL project will maintain evergreen plans, schedules and contingencies. An enhanced tracking process will be implemented to ensure plans, schedules, risks and contingencies for VDL project activities and deliverables are complete, continually updated, and regularly reviewed with service providers. These will be reported on using Departmental Project Management Framework tools such as the monthly Dashboard, and monthly Interdependency Report to ensure clear timeframes are articulated across the agency.

In collaboration with service partners, the VDL project will continue to report to committees (such as the Strategic Management Committee [SMC]) in order that interdependencies, risks, contingencies and issues be shared and addressed with other parts of the agency.

Deliverables and timelines

The Assistant Chief Statistician, Strategic Engagement (Field 4), will ensure the delivery of:

  • An enhanced tracking process for plans, schedules, risks and contingencies. (July 2021 – Director General [DG] Data Access and Dissemination Branch)

Recommendation 2

The Assistant Chief Statistician, Strategic Engagement (Field 4), should ensure that governance mechanisms are in place that effectively manage horizontal activities in a holistic manner including the clear establishment and understanding of roles and responsibilities.

Management response

Management agrees with the recommendation.

The existing governance structure and processes for the VDL for the regular reporting of progress, planning of upcoming activities, reviewing of roles and responsibilities, and evaluating of risks will be formalized to ensure they are complete and conducted in a holistic horizontal manner; the updated structure will be presented to a senior level committee such as SMC. The VDL project will regularly report progress through governance and oversight mechanisms to ensure issues are addressed consistently, effectively, and holistically.

The VDL project will continue to participate in a number of multi-divisional working groups and committees where horizontal activities can be discussed. In collaboration with service partners, a complete review of roles and responsibilities for all remaining key activities will be undertaken. Findings will be reported to senior management committees such as the SMC, where changes can be approved. The effectiveness of the redefined roles and responsibilities will be evaluated through the pilots and Service Level Agreements based on the results will be prepared before the launch of the VDL in production.

Deliverables and timelines

The Assistant Chief Statistician, Strategic Engagement (Field 4), will ensure the delivery of:

  1. A formalized governance structure and processes for the VDL process. (June 2021 - DG Data Access and Dissemination Branch)
  2. A completed review of roles and responsibilities for the VDL project. (July 2021 - DG Data Access and Dissemination Branch)
  3. Completed Service Level Agreements. (October 2021 - DG Data Access and Dissemination Branch)

Appendix A

Evaluation types
Evaluation types
Description for Evaluation types

The figure depicts three types of evaluations mapped against a program lifecycle which includes four stages (planning, implementation, maturity, and decline/termination).

Formative evaluations are conducted during program implementation to gather insights on how to improve or strengthen the process. Process evaluations are also conducted during this time. These evaluations seek to gather data to understand what's actually going on in a program and whether the intended service recipients are or will be receiving the services they need.

Summative evaluations are conducted near the end of the program cycle and are intended to show whether the program has achieved its intended outcomes and determine whether the program should be continued, replicated or curtailed.

Impact evaluations (or outcome evaluations) gather and analyze data to show the ultimate and longer-lasting effects of a program. Impact evaluations determine causal effects, including measuring whether the program achieved its intended outcomes.

Appendix B – Virtual Data Lab business case

The historical background for this project

In 2017, Statistics Canada developed a modernization vision, which included an expansion of access to social and business microdata and administrative data. The approved direction is to develop new ways of accessing microdata (e.g., through virtual access), modernize existing tools that support access to microdata and make more types of data available to the user.

In modernizing our access to microdata, this project plans to

  • enhance UX when accessing anonymized microdata
  • leverage new technology, methods and data
  • apply risk management practices (to optimize data accessibility with protective measures)
  • increase collaboration and partnerships between data users and providers
  • build analytical and research capacity
  • support stronger evidence-based decision-making.

The overall business context into which the project fits

Research communities have expressed the need to better access Statistics Canada data holdings. As we improve data access, we are

  • enhancing users' experience when working with anonymized microdata
  • harnessing new technologies
  • changing the culture in terms of our risk measures, reducing barriers and reconsidering how we can share risk with our partners
  • increasing our collaborative efforts and increasing our analytical and research capacities
  • providing more types of data and more integrated datasets to researchers and policy makers, and allowing for stronger evidence-based decision making and policy making.

In increasing data access, our objectives are to

  • facilitate its use for leading-edge research and analysis across the country
  • drive innovation and inclusion
  • support evidence-based decision-making and policy-making
  • better enable the mobilization of data across the federal government using Statistics Canada's expertise and capacity.

The drivers that triggered the change

Because of advances in technology, the current access programs no longer meet the needs of researchers. Existing tools are becoming outdated and need to be modernized. Our metadata systems also need to be strengthened and modernized.

Appendix C – Business outcome indicators and performance measurement method

Business outcome indicators and performance measurement method
Business outcome title Indicator Baseline Target Timeframe to achieve outcomes Performance measurement method Responsible
Growth in the number of researchers using Canadian data and microdata files Number of researchers and public servants accessing microdata virtually 1,000/year in research data centres and Federal Research Data Centres (FRDCs); 100 at the Canadian Centre for Data Development and Economic Research. Gradual increase to meet demand 31/03/2021 Tracking methods, vetting reports, notification emails, feedback Data Access Division (DAD)
New microdata sets made available to researchers Data holdings increase   Increase business data into RDCs and FRDCs 2021   DAD
New and increased data partnerships with other government departments Data holdings increase   Increase number of holdings by 100% 31/03/2021   DAD
Faster and more timely onboarding of researchers to the Virtual Data Lab (VDL) Number of processing days   One week Six months after creation of VDL Tracking methods DAD
Increased performance of system surrounding data retrieval and export Processing time, user feedback on system performance Variable based on scope of research project by researchers   March 2021 Tracking methods, feedback DAD
Increased data accessibility Number of data files made available for access     March 2021   DAD
Increased client satisfaction Percentage increase in user satisfaction measured through client survey     March 2021   DAD

Appendix D – Statistics Canada's modernization initiative

The Vision: A Data-driven Society and Economy

Modernizing Statistics Canada's workplace culture and its approach to collecting and producing statistics will result in greater and faster access to needed statistical products for Canadians. Specifically, the initiative and its projects will:

  • Ensure more timely and responsive statistics – Ensuring Canadians have the data they need when they need it!
  • Provide leadership in stewardship of the Government of Canada's data asset: Improve and increase alignment and collaboration with counterparts at all levels of government as well as private sector and regulatory bodies to create a whole of government, integrated approach to collection, sharing, analysis and use of data
  • Raise the awareness of Statistics Canada's data and provide seamless access
  • Develop and release more granular statistics to ensure Canadians have the detailed information they need to make the best possible decisions.

The Pillars

User-Centric Delivery Service:

  • Users have the information/data they need, when they need it, in the way they want to access it, with the tools and knowledge to make full use of it.
  • User-centric focus is embedded in Statistics Canada's culture.

Leading-edge Methods and Data Integration:

  • Access to new or untapped data modify the role of surveys.
  • Greater reliance on modelling and integration capacity through R&D environment.

Statistical Capacity Building and Leadership:

  • Whole of government, integrated approach to collection, sharing, analysis and use of data.
  • Statistics Canada is the leader identifying, building and fostering savvy information and critical analysis skills beyond our own perimeters.

Sharing and Collaboration:

  • Program and services are delivered taking a coordinated approach with partners and stakeholders.
  • Partnerships allow for open sharing of data, expertise and best practices.
  • Barriers to accessing data are removed.

Modern Workforce and Flexible Workplace:

  • Organization is agile, flexible and responsive to client needs.
  • Have the talent and environment required to fulfill our current business needs and be open and nimble to continue to position ourselves for the future.

Expected Outcome

Modern and Flexible Operations: Reduced costs to industry, streamlined internal processes and improved efficiency/support of existing and new activities.

Appendix E – Contingency plans

  • Proposal process: Data Analytics as a Service's "how might we" statements (developed in collaboration with the Virtual Data Lab [VDL] team) and research data centre (RDC) experts noted this as a challenge that should be overcome for VDL users. This sentiment was echoed by the business outcome indicators of the VDL project plan, according to which the project should facilitate faster and timelier onboarding of researchers to the VDL (a goal of one week of onboarding within the first six months of the VDL).
    • Although the proposal process itself was identified as a challenge, the discoverability of data and variables included in the files for which researchers are seeking access was also noted as an obstacle to developing a proposal. Although this challenge has been partly resolved through collaboration with subject-matter areas (including variables in the Integrated Metadatabase), contingency plans for providing this service were not identified. Data discoverability improvements are contingent on the acquisition of a new Client Relationship Management System that includes these capabilities, for which there was no clear timeline. No secondary strategy was identified.
    • Many administrative challenges related to the proposal process are intended to be overcome through the implementation of the Microdata Access Portal. However, no contingency plan was identified in the event of delivery being delayed or specifications being changed.
    • The project plan outlines an intended onboarding process of one week per user, although monitoring of progress toward this and other business indicators was not identified. It would be beneficial to begin monitoring indicators laid out in the project plan as a means to reprioritize and provide clarity on project timelines and intended outcomes.
  • Vetting process: The VDL business case suggested that the VDL will "grant researchers and administrative personnel access to a variety of modern tools such as analytical software, visualization and client relations management tools that will have the appropriate monitoring controls and vetting procedures built-in." Much like the proposal process, users expressed challenges with the existing vetting process. This feedback was consistent across the Canadian Research Data Centres Network, but was also included in initial feedback received from Canada Mortgage and Housing Corporation researchers on the VDL pilot.
    • Addressing the current vetting process was not identified as a short-term priority. It was noted that the process could be enhanced by implementing automated or built-in vetting tools (e.g., the Output Checker Workflow Application, which was being analyzed at the time of the evaluation). In the shorter term, the VDL team could work with other areas of the Data Access Division to implement a standard operating procedure to ensure that—at a bare minimum—the consistency and clarity of vetting training would be improved.

Retail Trade Survey (Monthly): CVs for Total sales by geography - March 2021

CVs for Total sales by geography - March 2021
This table displays the results of Annual Retail Trade Survey: CVs for Total sales by geography - March 2021. The information is grouped by Geography (appearing as row headers), Month and Percent (appearing as column headers).
Geography Month
202103
%
Canada 0.6
Newfoundland and Labrador 1.0
Prince Edward Island 6.7
Nova Scotia 2.0
New Brunswick 2.1
Quebec 1.5
Ontario 1.2
Manitoba 1.0
Saskatchewan 3.1
Alberta 1.5
British Columbia 1.3
Yukon Territory 0.7
Northwest Territories 0.4
Nunavut 1.0

2021 Census - Data Quality Project - Dwelling Classification Survey

Form 91Q

Confidential when completed.

This information is collected under the authority of the Statistics Act. R.S.C., 1985, c. S-19

Control use

SSID

  • Prov.
  • CD No.
  • CU No.
  • VR Line No.

Contact person

Office Use Only

Result of interview

  1. Completed questionnaire
  2. Incomplete questionnaire

Section I — Address or Exact Location of This Dwelling

Transcribe from the Assignment List (Form 1B)

  • Street and No. or lot and concession
  • Apt. No.
  • City, town, village
  • Province/territory
  • Postal Code
  • AD

Section II — Verification of Dwelling

Interviewer check item:
1. Is there a dwelling (a set of living quarters with a private entrance) at the address listed above?

  1. Yes; Continue with Question 2
  2. No; What is located at this address?
    1. Business or professional office of some sort (e.g., dentist office, gas station); Continue with Question 2
    2. Dwelling under construction; Continue with Question 2
    3. Dwelling demolished; Continue with Question 2
    4. Empty lot; Continue with Question 2
    5. Could not locate address; End interview
    6. Apartment no longer used as a separate dwelling; Continue with Question 2
    7. Other – Specify; End interview

Read script: refer to Form 91R
2. On Census Day, Tuesday May 11, was there a single set of living quarters at this address, or was there more than one?

  1. None; End interview
  2. One; Continue with Question 3
  3. More than one; Did each have a private entrance?
    1. Yes; Go to Question 3 and complete a separate questionnaire for each dwelling
    2. No; Go to Question 3

Interviewer check item:
3. Identify person contacted.

  • Family name:
  • Given name and initial(s)
  1. Occupant
  2. Neighbour
  3. Superintendent or building manager
  4. Other; Specify

Section III — Dwelling Occupancy Status on May 11, 2021

4. Was someone living in the dwelling on Census Day?

  1. Yes; Continue with Question 5
  2. No; Go to Question 6
  3. Don't know; End interview and find another contact
  • If the dwelling is now occupied but the occupancy on May 11, 2021 is unknown, check "Don't know".
  • Only check "Yes" or "No" based on the occupancy on Census Day, Tuesday, May 11, 2021.

5. On Census Day, were they living in the dwelling on a temporary or occasional basis, or was it their usual home?
A temporary or occasional basis would include such things as staying at a summer home or a second home.

  1. Temporary – Specify; Go to Question 15
  2. Usual Home; Go to Question 17
  3. Don't know; Go to Question 17

6. Is the dwelling generally occupied on a temporary or occasional basis, or is it someone's usual home?
A temporary or occasional basis would include such things as a summer home or a second home.

  1. Temporary – Specify; Go to Question 15
  2. Usual Home; Continue with Question 7
  3. Don't know; Continue with Question 7

7. Were the usual residents temporarily away, or staying outside of Canada on Census Day, Tuesday, May 11?
Temporarily away includes being away on business, at a summer home, on vacation, or at school.

  1. Yes – Specify; Go to Question 17
  2. No; Continue with Question 8
  3. Don't know; Continue with Question 8

Section IV— Interview A — Dwelling Unoccupied on May 11, 2021

8. Was anyone living in the dwelling at any time between May 1st and Census Day, Tuesday, May 11?

  1. Yes; Continue with Question 9
  2. No; Go to Question 11
  3. Don't know; Go to Question 11

9. Were they living in this dwelling on a temporary or occasional basis, or was it their usual home?
A temporary or occasional basis would include such things as staying at a summer home or a second home.

  1. Temporary; Go to Question 15
  2. Usual home; Continue with Question 10
  3. Don't know; Go to Question 11

10. Could you tell me when these former occupants moved out of the dwelling?

  1. On or before May 10, 2021; Continue with Question 11
  2. On or after May 11, 2021; Return to Question 4, and obtain information about the dwelling for Census Day. Use a new questionnaire if necessary
  3. Don't know; Continue with Question 11

Interviewer check item:
11. Is the person being interviewed an occupant of the dwelling listed in SECTION I?

  1. Yes; Continue with Question 12
  2. No; Go to Question 13

12. On what date did your household move into this dwelling?

  1. On or before May 11, 2021; Return to Question 4, and obtain information about the dwelling for Census Day. Use a new questionnaire if necessary.
  2. On or after May 12, 2021; Go to Question 15

13. Is someone currently living in the dwelling?

  1. Yes; Continue with Question 14
  2. No; Go to Question 15
  3. Don't know; Go to Question 15

14. On what date did the current occupant(s) move into the dwelling?

  1. On or before May 11, 2021; Return to Question 4, and obtain information about the dwelling for Census Day. Use a new questionnaire if necessary.
  2. On or after May 12, 2021; Continue with Question 15
  3. Don't know; Continue with Question 15

15. Was this dwelling suitable for year-round occupancy on Census Day, Tuesday, May 11?
That is, did it have a source of heat or power, and provide complete shelter from the elements?

  1. Yes; Continue with Question 16
  2. No; Continue with Question 16
  3. Don't know; Continue with Question 16

16. Was this dwelling under construction or major renovation on Census Day, Tuesday, May 11?

  1. Yes; End interview and complete Question 24 and Question 25
  2. No; End interview and complete Question 24 and Question 25
  3. Don't know; End interview and complete Question 24 and Question 25

Section V — Interview B — Dwelling Occupied on May 11, 2021

17. How many persons were living in the dwelling on Census Day, Tuesday, May 11?

Include:

  • All persons who had their main residence at this address on May 11, 2021, including newborn babies, room-mates and person who were temporarily away,
  • Canadian citizens, landed immigrants (permanent residents), persons asking for refugee status (refugee claimants), persons from another country with a work or study permit and family members living here with them,
  • Persons staying at this address temporarily on May 11, 2021 who have no main residence elsewhere.

Exclude:

  • Visitors who had their main residence elsewhere in Canada,
  • Government representatives of another country or members of the Armed Forces of another country and their families,
  • Residents of another country visiting Canada, for example, on a business trip or on vacation.
  1. number of persons; Continue with Question 18
    If "00" persons (meaning ALL persons living in this dwelling are in the 'Exclude' group) End interview and complete Question 24 and Question 25
  2. Don't know; Continue with Question 18

18. When did these people move into this dwelling?

  1. On or before May 11, 2021; Go to Question 21
  2. On or after May 12, 2021; Continue with Question 19
  3. Don't know; Go to Question 23

19. Did anyone live in the dwelling prior to these people?

  1. Yes; Continue with Question 20
  2. No; Return to Question 4, and obtain information about the dwelling for Census Day. Use a new questionnaire if necessary.
  3. Don't know; Go to Question 23

20. When did these former occupants move out of this dwelling?

  1. On or before May 10, 2021; Return to Question 4, and obtain information about the dwelling for Census Day. Use a new questionnaire if necessary
  2. On or after May 11, 2021; Return to Question 17 and obtain information for the May 11 occupants
  3. Don't know; Go to Question 23

21. Do these people still live in the dwelling?

  1. Yes; Go to Question 23
  2. No; Continue with Question 22
  3. Don't know; Go to Question 23

22. When did these people move out of the dwelling?

  1. On or before May 10, 2021; Return to Question 4, and obtain information about the dwelling for Census Day. Use a new questionnaire if necessary.
  2. On or after May 11, 2021; Go to Question 23
  3. Don't know; Go to Question 23

23. What is the sex and age of each person usually living in the dwelling on Census Day, May 11?

Interviewer instructions:
Refer to Question 17 to obtain the total number of persons.
If Question 17 has no response or the number of persons is "00" or more than "06", End interview and complete Question 24 and Question 25

  1. Number of persons

Obtain the sex and age for each person.
If the age of a person is unknown, an approximate age is acceptable.

List of household members - Census Day, May 11, 2021
Table summary
This table contains no data. It is an example of an empty data table used by respondents to provide data to Statistics Canada.
  Person 1 Person 2 Person 3 Person 4 Person 5 Person 6
Male / Female            
Age            
For children under the age of 1, enter 0.

Section VI — Classification of Dwelling

Interviewer:
To be completed by interviewer upon completion of interview.

24. What is the "dwelling type" of the dwelling listed in Section I? Mark one circle only.
For a list of dwelling types and their definitions refer to page 6.

  1. Single-detached house
  2. Semi-detached house
  3. Row house
  4. Apartment or flat in a duplex
  5. Apartment in a building that has five or more storeys
  6. Apartment in a building that has fewer than five storeys
  7. Other single-attached house
  8. Mobile home
  9. Other movable dwelling

25. Is the dwelling listed in Section l suitable for year round occupancy?
That is, does it have a source of heat or power, and provide complete shelter from the elements?

  1. Yes
  2. No
  3. Don't know

Section VII — Situations in the Field

Interviewer check item:
Check all that apply. Explain situations in Section VIII - Comments.

  1. "No dwelling exists" or "Could not locate address" - explain the situation
  2. More than one dwelling at the same address – explain the situation & write down the exact number of dwellings at this address
  3. Two addresses describe the same dwelling (i.e. AD=01) – write down the SSID for each dwelling
  4. Only one of the two addresses associated with the dwelling is listed on the Assignment List, Form 1B, you will conduct one interview for the entire dwelling (i.e. AD=02) – write the other address associated with the dwelling
  5. Dwelling is a business or collective dwelling with a private dwelling at the address – explain the situation
  6. Refusal by the occupant
  7. Refusal by NON-occupant
  8. Received a completed Form 2A during DCS
  9. Other – explain the situation

Section VIII — Comments

(Space for comments)

Private dwelling type codes - Definitions

1. Single-detached house
A single dwelling not attached to any other dwelling or structure (except its own garage or shed).
A single-detached house has open space on all sides, and has no dwellings either above it or below it.
A mobile home fixed permanently to a foundation should be coded as a single-detached house. (See Code 8.)
2. Semi-detached house
One of two dwellings attached side by side (or back to back) to each other, but not attached to any other dwelling or structure (except its own garage or shed). A semi-detached house has no dwellings either above it or below it and the two units, together, have open space on all sides.
3. Row house
One of three or more dwellings joined side by side (or occasionally side to back), such as a townhouse or garden home, but not having any other dwellings either above it or below. If townhouses are attached to high-rise buildings, assign Code 3 to each townhouse.
4. Apartment or flat in a duplex
One of two dwellings, located one above the other. If duplexes are attached to triplexes or other duplexes or to other non-residential structures (e.g., a store), assign Code 4 to each apartment or flat in the duplexes.
5. Apartment in a building that has five or more storeys
A dwelling unit in a high-rise apartment building which has five or more storeys. Also included are apartments in a building that has five or more storeys where the first floor and/or second floor are commercial establishments.
6. Apartment in a building that has fewer than five storeys
A dwelling unit attached to other dwelling units, commercial units, or other non-residential space in a building that has fewer than five storeys.
7. Other single-attached house
A single dwelling that is attached to another building and that does not fall into any of the other categories, such as a single dwelling attached to a non-residential structure (e.g., store or church) or occasionally to another residential structure (e.g., apartment building).
8. Mobile home

A single dwelling, designed and constructed to be transported on its own chassis and capable of being moved to a new location on short notice. It may be placed temporarily on a foundation pad and may be covered by a skirt.

A mobile home must meet the following two conditions:

  • It is designed and constructed to be transported on its base frame (or chassis) in one piece.
  • The dwelling can be moved on short notice. This dwelling can be easily relocated to a new location, because of the nature of its construction, by disconnecting it from services, attaching it to a standard wheel assembly and moving it without resorting to significant renovations and reconstructions.
9. Other movable dwelling
A single dwelling, other than a mobile home, used as a place of residence, but capable of being moved on short notice, such as a tent, recreational vehicle, travel trailer, houseboat or floating home.

Private dwelling type codes — Chart

Is this dwelling attached to another dwelling or structure (other than its own garage or shed)?

  • No
    Can this dwelling be moved on short notice?
    • No: Code 1
    • Yes
      Is this dwelling designed and constructed to be transported on its own frame (i.e., mobile home)?
      • No: Code 9
      • Yes: Code 8
  • Yes
    Does this dwelling have any other dwelling(s) above or below it?
    • No
      Is this dwelling in a building that has more than two dwellings attached side by side or back to back?
      • No
        Is this dwelling attached to only one other dwelling side by side or back to back (i.e., semi-detached)?
        • No: Code 7
        • Yes: Code 2
      • Yes: Code 3
    • Yes
      Is this dwelling in a building that has five or more storeys?
      • No
        Are there exactly two dwellings in this building?
        • No: Code 6
        • Yes: Code 4
      • Yes: Code 5

Meltwater: Social Trends Monitoring Tool - Privacy impact assessment summary

Introduction

Statistics Canada will use Meltwater, a social trends monitoring tool, to search, monitor and analyze social media and traditional media trends and conversations on issues and topics relevant to the Agency’s mandate, in an effort to maintain and improve its public relations, communications, outreach and engagement activities with Canadians.

Objective

A privacy impact assessment for Meltwater was conducted to determine if there were any privacy, confidentiality or security issues with this tool and, if so, to make recommendations for their resolution or mitigation.

Description

Statistics Canada will use Meltwater, a social monitoring tool, to search, monitor and analyze social media and traditional media trends and conversations on issues and topics relevant to Statistics Canada. Using Application Programming Interfaces (APIs), Meltwater performs searches of social and traditional media content based on specific search query keywords relevant to the agency's mandate, indexes the related information found and then presents the results to licensees accessing this tool. The results are then aggregated in summary reports and can be shared internally within the Agency, on a need-to-know basis. The use of Meltwater will allow the Agency to better understand current opinion, sentiment and overall conversation on specific Statistics Canada issues to create communications products that resonate with target audiences.

Risk Area Identification and Categorization

The PIA identifies the level of potential risk (level 1 is the lowest level of potential risk and level 4 is the highest) associated with the following risk areas:

a) Type of program or activity

Program or activity that does not involve a decision about an identifiable individual.

Risk scale: 1

b) Type of personal information involved and context

Sensitive personal information, including detailed profiles, allegations or suspicions and bodily samples, or the context surrounding the personal information is particularly sensitive.

Risk scale: 4

c) Program or activity partners and private sector involvement

Within the institution (among one or more programs within the same institution)

Risk scale: 1

d) Duration of the program or activity

Long-term program or activity.

Risk scale: 3

e) Program population

The program's use of personal information is not for administrative purposes. Information is collected under the Statistics Act.

Risk scale: N/A

f) Personal information transmission

The personal information is used in a system that has connections to at least one other system.

Risk scale: 2

g) Technology and privacy

While the Meltwater web based tool is new to Statistics Canada, its functionalities and objectives are not new. Monitoring of social media and traditional media environments has been a common practice at Statistics Canada for years. The use of Meltwater will make this practice more efficient and will directly support the Agency's desire to further engage with Canadians and translate data stories via social media, media relations, outreach and engagement activities. It will also be a critical support system for the 2021 Census collection strategy. Daily reports on public sentiment will allow for immediate adjustments to our communication strategy in aid of collection. There will also be an update to the agency's social media terms of use to include a section detailing the use of social monitoring tools such as Meltwater.

h) Potential risk that in the event of a privacy breach, there will be an impact on the individual or employee.

Statistics Canada has robust risk mitigation procedures in place to guard against accidental and purposeful privacy breaches. The use of the Meltwater software provides a low risk to the individual because Meltwater will only access individuals' publicly-available personal information which is made available by the user through the social media accounts' privacy settings. Meltwater uses open Application Programming Interfaces (APIs) to access public social media and traditional media databases. APIs are a software intermediary that allow two applications to talk to each other. Meltwater performs searches of social and traditional media content based on specific search query keywords relevant to the agency's mandate, indexes the related information found, and then presents the results to licensees accessing this tool. The results are then aggregated in summary reports and can be shared internally within the agency, on a need-to-know basis.

The Agency's risk mitigation practices/measures will be fully utilized to guard against purposeful breaches. This includes the employee security screening process, leveraging Statistics Canada's culture (of privacy and confidentiality protection), and leveraging practices and training material used with the Agency's interviewers designed to physically control access to confidential data. Training material will outline the policies and directives governing employees' day-to-day activities.

There is a low risk of a breach of some of the personal information being disclosed without proper authorization, noting the impact on the individual would below. This is because the information available through Meltwater is limited to the specific search criteria, and because any personal information collected by the application is already publicly available and consistent with the parameters outlined by the social media platforms.

i) Potential risk that in the event of a privacy breach, there will be an impact on the institution.

There is a very low risk of a breach of some of the personal information being disclosed without proper authorization. The impact on the institution would be low.

j) Potential risk of Meltwater users being able to identify an individual by the information obtained such as their IP address, a picture or a chain of unique characters.

It is not intended for Meltwater users to access IP addresses, pictures or videos. None of that information is necessary and has no value to the Agency. However, certain identifiable information may be visible to our select Meltwater users due to the nature and settings of an individual's account or social media platform. For example, information about location is accessible in Meltwater if the individual has agreed and consented to a) publishing location information in social media posts, b) publishing location in their profile, or c) enabling social media platforms to access their devices internal GPS (location services are enabled). It is important to mention that access to Meltwater will be controlled through licenses administered to only a select group of individuals on the social monitoring and social media teams. Personal information accessed through the tool will be controlled through licenses administered to only a select group of individuals on the social monitor and social media teams. Personal information accessed through the tool will only be disclosed on a need-to-know basis.

There is a very low risk of an individual being identified by their IP address, pictures, videos or via a chain of unique characters in their messages. The impact on the institution would be low.

Conclusion

This assessment of Meltwater did not identify any privacy risks that cannot be managed using existing safeguards.

Why are we conducting this survey?

This survey collects data from plants in Western Canada that use grain mainly to produce ethanol or biodiesel. The data will be used by Statistics Canada to calculate grain deliveries and to produce supply and disposition statistics. Information from agricultural surveys is used by Agriculture and Agri-Food Canada and other federal and provincial departments for economic research, and to develop and administer agricultural policies.

Your information may also be used by Statistics Canada for other statistical and research purposes.

Your participation in this survey is required under the authority of the Statistics Act.

Other important information

Authorization to collect this information

Data are collected under the authority of the Statistics Act, Revised Statutes of Canada, 1985, Chapter S-19.

Confidentiality

By law, Statistics Canada is prohibited from releasing any information it collects that could identify any person, business, or organization, unless consent has been given by the respondent, or as permitted by the Statistics Act. Statistics Canada will use the information from this survey for statistical purposes only.

Record linkages

To enhance the data from this survey and to reduce the reporting burden, Statistics Canada may combine the acquired data with information from other surveys or from administrative sources.

Data-sharing agreements

To reduce respondent burden, Statistics Canada has entered into data-sharing agreements with provincial and territorial statistical agencies and other government organizations, which have agreed to keep the data confidential and use them only for statistical purposes. Statistics Canada will only share data from this survey with those organizations that have demonstrated a requirement to use the data.

Section 11 of the Statistics Act provides for the sharing of information with provincial and territorial statistical agencies that meet certain conditions. These agencies must have the legislative authority to collect the same information, on a mandatory basis, and the legislation must provide substantially the same provisions for confidentiality and penalties for disclosure of confidential information as the Statistics Act. Because these agencies have the legal authority to compel businesses to provide the same information, consent is not requested and businesses may not object to the sharing of the data.

For this survey, there are Section 11 agreements with the provincial statistical agencies of Newfoundland and Labrador, Nova Scotia, New brunswick, Quebec, Ontario, Manitoba, Saskatchewan, Alberta and british Columbia. The shared data will be limited to information pertaining to business establishments located within the jurisdiction of the respective province.

Business or organization and contact information

1. Verify or provide the business or organization's legal and operating name and correct where needed.

Note: Legal name modifications should only be done to correct a spelling error or typo.

Note: Press the help button (?) for additional information.

Legal Name

The legal name is one recognized by law, thus it is the name liable for pursuit or for debts incurred by the business or organization. In the case of a corporation, it is the legal name as fixed by its charter or the statute by which the corporation was created.

Modifications to the legal name should only be done to correct a spelling error or typo.

To indicate a legal name of another legal entity you should instead indicate it in question 3 by selecting 'Not currently operational' and then choosing the applicable reason and providing the legal name of this other entity along with any other requested information.

Operating Name

The operating name is a name the business or organization is commonly known as if different from its legal name. The operating name is synonymous with trade name.

  • Legal name
  • Operating name (if applicable)

2. Verify or provide the contact information of the designated business or organization contact person for this questionnaire and correct where needed.

Note: The designated contact person is the person who should receive this questionnaire. The designated contact person may not always be the one who actually completes the questionnaire.

  • First name
  • Last name
  • Title
  • Preferred language of communication
    • English
    • French
  • Mailing address (number and street)
  • City
  • Province, territory or state
  • Postal code or ZIP code
  • Country
    • Canada
    • United States
  • Email address
  • Telephone number (including area code)
  • Extension number (if applicable)
    The maximum number of characters is 10.
  • Fax number (including area code)

3. Verify or provide the current operational status of the business or organization identified by the legal and operating name above.

  • Operational
  • Not currently operational
    Why is this business or organization not currently operational?
    • Seasonal operations
      • When did this business or organization close for the season?
        • Date
      • When does this business or organization expect to resume operations?
        • Date
    • Ceased operations
      • When did this business or organization cease operations?
        • Date
      • Why did this business or organization cease operations?
        • Bankruptcy
        • Liquidation
        • Dissolution
        • Other - Specify the other reasons for ceased operations
    • Sold operations
      • When was this business or organization sold?
        • Date
      • What is the legal name of the buyer?
    • Amalgamated with other businesses or organizations
      • When did this business or organization amalgamate?
        • Date
      • What is the legal name of the resulting or continuing business or organization?
      • What are the legal names of the other amalgamated businesses or organizations?
    • Temporarily inactive but will re-open
      • When did this business or organization become temporarily inactive?
        • Date
      • When does this business or organization expect to resume operations?
        • Date
      • Why is this business or organization temporarily inactive?
    • No longer operating due to other reasons
      • When did this business or organization cease operations?
        • Date
      • Why did this business or organization cease operations?

4. Verify or provide the current main activity of the business or organization identified by the legal and operating name above.

Note: The described activity was assigned using the North American Industry Classification System (NAICS).

Note: Press the help button (?) for additional information, including a detailed description of this activity complete with example activities and any applicable exclusions.

This question verifies the business or organization's current main activity as classified by the North American Industry Classification System (NAICS). The North American Industry Classification System (NAICS) is an industry classification system developed by the statistical agencies of Canada, Mexico and the United States. Created against the background of the North American Free Trade Agreement, it is designed to provide common definitions of the industrial structure of the three countries and a common statistical framework to facilitate the analysis of the three economies. NAICS is based on supply-side or production-oriented principles, to ensure that industrial data, classified to NAICS , are suitable for the analysis of production-related issues such as industrial performance.

The target entity for which NAICS is designed are businesses and other organizations engaged in the production of goods and services. They include farms, incorporated and unincorporated businesses and government business enterprises. They also include government institutions and agencies engaged in the production of marketed and non-marketed services, as well as organizations such as professional associations and unions and charitable or non-profit organizations and the employees of households.

The associated NAICS should reflect those activities conducted by the business or organizational units targeted by this questionnaire only, as identified in the 'Answering this questionnaire' section and which can be identified by the specified legal and operating name. The main activity is the activity which most defines the targeted business or organization's main purpose or reason for existence. For a business or organization that is for-profit, it is normally the activity that generates the majority of the revenue for the entity.

The NAICS classification contains a limited number of activity classifications; the associated classification might be applicable for this business or organization even if it is not exactly how you would describe this business or organization's main activity.

Please note that any modifications to the main activity through your response to this question might not necessarily be reflected prior to the transmitting of subsequent questionnaires and as a result they may not contain this updated information.

The following is the detailed description including any applicable examples or exclusions for the classification currently associated with this business or organization.

Description and examples

  • This is the current main activity
    Provide a brief but precise description of this business or organization's main activity
    • e.g., breakfast cereal manufacturing, shoe store, software development
  • This is not the current main activity

Main activity

5. You indicated that is not the current main activity.

Was this business or organization's main activity ever classified as: ?

  • Yes
  • No

When did the main activity change?
Date

6. Search and select the industry classification code that best corresponds to this business or organization's main activity.

How to search:

  • if desired, you can filter the search results by first selecting this business or organization's activity sector
  • enter keywords or a brief description that best describes this business or organization main activity
  • press the Search button to search the database for an activity that best matches the keywords or description you provided
  • then select an activity from the list.

Select this business or organization's activity sector (optional)

  • Farming or logging operation
  • Construction company or general contractor
  • Manufacturer
  • Wholesaler
  • Retailer
  • Provider of passenger or freight transportation
  • Provider of investment, savings or insurance products
  • Real estate agency, real estate brokerage or leasing company
  • Provider of professional, scientific or technical services
  • Provider of health care or social services
  • Restaurant, bar, hotel, motel or other lodging establishment
  • Other sector

Enter keywords or a brief description, then press the Search button

7. You have indicated that the current main activity of this business or organization is:

Main activity

Are there any other activities that contribute significantly (at least 10%) to this business or organization's revenue?

  • Yes, there are other activities
    Provide a brief but precise description of this business or organization's secondary activity
    • e.g., breakfast cereal manufacturing, shoe store, software development
  • No, that is the only significant activity

8. Approximately what percentage of this business or organization's revenue is generated by each of the following activities?

When precise figures are not available, provide your best estimates.

Approximately what percentage of this business or organization's revenue is generated by each of the following activities?
  Percentage of revenue
Main activity  
Secondary activity  
All other activities  
Total percentage  

Grains purchased for industrial purposes

1. Which of the following grains did this company purchase for industrial purposes from the beginning of the crop year to the reference date?

Include:

  • purchases from farmers
  • quantities purchased from companies
  • imported grains.

Select all that apply

  • Wheat
    • Excluding durum.
  • Durum wheat
  • Canola
  • Corn
  • Barley
  • Oats
  • Flaxseed
  • Rye
  • Other grain purchased for industrial purposes
    • Specify the other grain purchased for industrial purposes

Quantity of grain purchased for industrial purposes

2. From the beginning of the crop year to the reference date, how much grain was purchased for industrial use from farmers and companies?

Include:

  • purchases from farmers
  • quantities purchased from companies
  • imported grains.

If your unit of measure is kilograms, please convert it to metric tonnes and round to one decimal place.

From the beginning of the crop year to the reference date, how much grain was purchased for industrial use from farmers and companies?
  Quantity purchased from farmers
(metric tonnes)
Quantity purchased from companies
(metric tonnes)
Grain    
Wheat (excluding durum)    
Durum wheat    
Canola    
Corn    
Barley    
Oats    
Flaxseed    
Rye    

Grain stocks

3. On the reference date, what were the stocks in metric tonnes of the following grains held in your company's elevators?

Include imported grains.

If your unit of measure is kilograms, please convert it to metric tonnes and round to one decimal place.

On the reference date, what were the stocks in metric tonnes of the following grains held in your company's elevators?
  Total stocks
(metric tonnes)
Grain  
Wheat (excluding durum)  
Durum wheat  
Canola  
Corn  
Barley  
Oats  
Flaxseed  
Rye  

Changes or events

1. Indicate any changes or events that affected the reported values for this business or organization, compared with the last reporting period.

Select all that apply.

  • Strike or lock-out
  • Exchange rate impact
  • Price changes in goods or services sold
  • Contracting out
  • Organizational change
  • Price changes in labour or raw materials
  • Natural disaster
  • Recession
  • Change in product line
  • Sold business or business units
  • Expansion
  • New or lost contract
  • Plant closures
  • Acquisition of business or business units
  • Other
    Specify the other changes or events:
  • No changes or events

Contact person

1. Statistics Canada may need to contact the person who completed this questionnaire for further information.

Is the provided given names and the provided family name the best person to contact?

  • Yes
  • No

Who is the best person to contact about this questionnaire?

  • First name:
  • Last name:
  • Title:
  • Email address:
  • Telephone number (including area code):
  • Extension number (if applicable):
    The maximum number of characters is 5.
  • Fax number (including area code):

Feedback

1. How long did it take to complete this questionnaire?

Include the time spent gathering the necessary information.

  • Hours:
  • Minutes:

2. Do you have any comments about this questionnaire?

Topic Modelling and Dynamic Topic Modelling : A technical review

By: Loic Muhirwa, Statistics Canada

In the machine learning subfield of Natural Language Processing (NLP), a topic model is a type of unsupervised model that is used to uncover abstract topics within a corpus. Topic modelling can be thought of as a sort of soft clustering of documents within a corpus. Dynamic topic modelling refers to the introduction of a temporal dimension into a topic modelling analysis. The dynamic aspect of topic modelling is a growing area of research and has seen many applications, including semantic time-series analysis, unsupervised document classification, and event detection. In the event detection case, if the semantic structure of a corpus represents real world phenomenon, a significant change in that semantic structure can be used to represent and detect real world events. To that end, this article presents the technical aspects of a novel Bayesian dynamic topic modelling approach in the context of event detection problems.

A proof-of-concept dynamic topic modelling system has been designed, implemented and deployed using the Canadian Coroner and Medical Examiner Database (CCMED), a new database developed at Statistics Canada in collaboration with the 13 provincial and territorial Chief Coroners, Chief Medical Examiners and the Public Health Agency of Canada. The CCMED contains standardized information on the circumstances surrounding deaths reported to coroners and medical examiners in Canada. In particular, the CCMED contains unstructured data in the form of free-text variables, called narratives, that provide detailed information on the circumstances surrounding these reported deaths. The collection of the narratives forms a corpus (a collection of documents) that is suitable for text-mining and so the question follows: can machine learning (ML) techniques be used to uncover useful and novel hidden semantic structures? And if so, can these semantic structures be analyzed dynamically (over time) to detect emerging death narratives?

The initial results look promising and the next step is twofold: firstly, further fine-tuning of the system and construction of event detections. Secondly, since this system will be used as an aid for analysts to study and investigate the CCMED, the insights coming out of the system will need to be aligned with human interpretability. This article gives a technical overview of the methodology behind topic modelling, explaining the basis of Latent Dirichlet Allocation and introducing a temporal dimension into the topic modelling analysis. A future article will showcase the application of these techniques on the CCMED.

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA)Footnote 1 is an example of a topic model commonly used in the ML community. Due to the performance of LDA models, it has several production-level implementations in popular data-oriented scripting languages like PythonFootnote 2. LDA was first introduced as a generalization to Probabilistic Latent Semantic Analysis (PLSA)Footnote 3 presenting significant improvements; one of them being fully generativeFootnote 4.

The model

LDA is regarded as a generative model as the joint distribution (likelihood-prior product) is explicitly defined; this allows documents to be generated by simply sampling from the distribution. The model assumptions are made clear by examining the generative process that describes how each word in a given document is generated.

Put formally, suppose T ,   V N are the number topics and the size of our entire vocabulary respectively. The vocabulary refers to the set of all terms used to generate the documents. Furthermore, suppose that θ R T and ϕ R V are vectors representing discrete distributions over topics and vocabulary respectively. In LDA a document is represented by a distinct topic distribution and a topic is represented by a distinct word distribution. Let w0,1V be a one-hot vector representing a particular word in the vocabulary and let z 0 , 1 T be a one-hot vector representing a particular topic.

The notation θ and ϕ can be used to describe the generative process that generates a word in a document by sampling a topic distribution and a word distribution. LDA assumes that these distributions are drawn from Dirichlet distributions, that is θ~Dirα and ϕ~Dirβ where α and β are the sparsity parameters. Then using those distributions, first draw a topic assignment ~Multinomialθ and then from that topic draw a word w ~Multinomialϕ. In other words, the words within a document are sampled from a word distribution governed by a fixed topic distribution representing that document. Figure 1 demonstrates this generative process in graphical plate notation, for a corpus of size M with documents of a fixed size N. Though the document size is usually assumed to come from an independent Poisson process, for now, to simplify the notation, it is assumed without loss of generality, that the documents have a fixed size.

Plate notation of the generative process. The boxes are “plates” representing replicates and shaded nodes are observed.
Description Figure 1 - Plate notation of the generative process. The boxes are “plates” representing replicates and shaded nodes are observed.

An illustration of the LDA generative process in plate notation. The diagram is composed of a directed acyclic graph, where the nodes represent variables and the edges represent variable dependencies. The outermost nodes of the directed graph are the model hyperparameters and these nodes have no in-edges, meaning they do not depend on any other parameter of the model. From the hyperparameters, edges lead into the other variables until they reach a final node, representing a word. From one end, the topic hyperparameter node, leads into a word distribution node, which finally leads into the word node. From another endpoint, the document hyperparameter, leads into a topic distribution node, which leads into a word-topic assignment node and then finally the word node. This word node is shaded and is the only node that is shaded, the shading indicates that the node in question represents an observed variable; implying that every other node in the graph is unobserved. Some nodes are enclosed in a rectangular box with a variable at the bottom-right corner of the box. The boxes represent repetitions and the bottom-right variable represents the size of that repetition. The word distribution node is enclosed in a box with a variable number of repetitions, T. The word-topic assignment and word nodes are enclosed in a box with a variable number of repetitions, N. This previous box is further enclosed into a larger box that includes the topic distribution node a variable number of repetitions, M. Since the word-topic assignment and word nodes are enclosed in two boxes, those two variables have a number of repetitions that is equal to the product of the variable at the bottom-right corner of the two boxes, in this case, N times M.

Table 1: Notation
Variable Description
D A set representing all raw documents, i.e., the corpus
T Number of topics
V Number of words in the vocabulary
θi Topic distribution representing the jth document; this is an RT dense vector
Nj Word count in the jth document
ϕt Word distribution representing the tth topic; this is an RV dense vector
Zij Topic assignment for the ith word in the jth document; this is an RT one-hot vector
wij Vocabulary assignment for the ith word in the jth document; this is an RV one-hot vector
β Dirichlet sparsity parameter for topics
α Dirichlet sparsity parameter for documents

Let Z be a set representing the collection of all topic assignments, this is a set of size j|D|·Nj and let θ be a set representing the collection of all topic distributions (documents) and finally, let ϕRV×RT be a random matrix representing the collection of all word distributions (topics), i.e., ϕ=ϕ1...,ϕT. It follows that if the tth entry in a given topic assigned, say zij is 1 then:

Equation 1: ϕt=Φ·zij

Following the notation from above, the joint distribution can be defined as follows:

Equation 2: p(W,Z,Θ,Φ|α,β)=p(Φ|β)j=1|D|p(θj|α)i=1Njp(zij|θj)p(wij|Φ,zij)

Since one of the model assumptions is that the topic distributions are conditionally independent on β, the following form is equivalent:

Equation 3: p(W,Z,Θ,Φ|α,β)=t=1Tp(ϕt|β)j=1|D|p(θj|α)i=1Njp(zij|θj)p(wij|Φ·zij)

Now that the model is specified, the generative process might seem clearer in pseudo-code. Following the joint distribution, the generative process goes as follows:

Given: V, T, |D|,α,β
for t[1,...,T ]  do
ϕt~Dir(β)
end for
Φ[ϕ1,...,ϕT]
for j[1,...,|D|] do
θj~Dir(α)
for i[1,...,Nj] do
zij~Mutinomial(θj)
wij~Mutinomial(Φzij)
end for
end for

It is worth pointing out that T, the number of topics, is fixed and being fixed is indeed a model assumption and requirement—this also implies, in the Bayesian setting, that T is a model parameter and not a latent variable. This difference is far from trivial, as shown in the inference section.

It is important to distinguish LDA from a simple Dirichlet-multinomial clustering model. A Dirichlet-multinomial clustering model would involve a two-level model in which a Dirichlet is sampled once for a corpus, a multinomial clustering variable is selected once for each document in the corpus, and a set of words is selected for the document conditional on the cluster variable. As with many clustering models, such a model restricts a document to being associated with a single topic. LDA, on the other hand, involves three levels, and notably, the topic node is sampled repeatedly within the document. Under this model, documents can be associated with multiple topicsFootnote 1.

Inference

Inference with LDA amounts to reverse-engineering of the generative process described in the previous section. As the generative process goes from topic to word, the posterior inference will therefore go from word to topic. With LDA we assume that Θ,Φ and Z are latent variables rather than model parameters. This difference has a drastic impact on the way the quantities of interest are inferred, which are the distributions Θ and Φ. In contrast, if Θ and Φ were modelled as parameters the Expectation Maximization (EM) algorithm could be used to find the maximum likelihood estimate (MLE). After the convergence of the EM algorithm, it retrieves the learned parameters to accomplish the goal of finding the abstract topics within the corpus. EM provides point estimates of the model parameter by marginalizing out the latent variables. The issues here are that the quantities of interest are being marginalized out and the point estimation wouldn't be faithful to the Bayesian inference approach. For a true Bayesian inference, access to the posterior distribution of the latent variables Θ,Φ and Z would be needed. Next, this posterior distribution is examined and some computational difficulties which will help motivate an inference approach will be pointed out.

The posterior has the following form:

Equation 4: p(Z,Θ,Φ|W,α,β)=p(W,Z,Θ,Φ| α,β)p(W| α,β)

A closer look at the denominator:

Equation 5: p(W|α,β)=Φp(Φ|β)Θp(Θ|α)Zp(Z|Θ)p(W|Z,Φ)dΘdΦ

Equation (5) is known as the evidence and acts as a normalizing constant. Calculating the evidence requires computing a high dimensional integral over the joint probability. As shown in equation (5), the coupling of Θ and Φ makes them inseparable in the summation and thus this integral is at least exponential in dim(Θ)×dim(Φ), making it intractable. The intractability of the evidence integral is a common problem in Bayesian inference and is known as the Inference ProblemFootnote 1. LDA inference and implementations differ in the way they overcome this problem.

Variational inference

In modern machine learning, variational (Bayesian) inference (VI), is most often used to infer the conditional distribution over the latent variables given the observations and parameters. This is also known as the posterior distribution over the latent variables (equation (2)). At a high level, VI is straightforward: the goal is to approximate the intractable posterior with a distribution that comes from a family of tractable distributions. This family of tractable distributions are called variational distributions (from variational calculus). Once the family of distributions are specified one approximates the posterior by finding the variational distribution that optimizes some metric between itself and the posterior. A common metric used to measure the similarity between two distributions is the Kullback-Leibler divergence and it is defined as follows: (KL) divergence and it is defined as follows:

Equation 6: KL(q||p)=\Ezlogq(z)p(z|x)=zq(z)logq(z)p(z|x)\

Where q(·) and p(·) are probability distributions over the same support. In the original LDA paperFootnote 1 the authors propose a family of distributions having the following form:

Equation 7: q(W,Z,Θ,Φ|λ,π,γ)=t=1TDir(ϕt|λt)j=1|D|Dir(θj|γj)i=1NjMulti(zij|πij)

Where λ ,π and γ are free variational parameters. This family of distributions is obtained by decoupling Θ and Φ (this coupling is what led to intractability), which makes the latent variables conditionally independent on the variational parameters. Thus, the approximate inference is reduced to the following deterministic optimization problem:

Equation 8: λ*,π*,γ*=argminλ,π,γ KL(q||p)

Where p is the posterior of interest and its final approximation is given by:

Equation 9: q(W,Z,Θ,Φ|λ*,π*,γ*)

In the context of the problem, the optimization problem in equation (8) is ill-posed since it requires p(·) and approximating p(·) is the original inference problem. It is straightforward to show the following:

Equation 10: Ezlogp(z,x)q(z)=-KL(q||p)+logp(x)

Equation 11: Let  L=Ezlogp(z,x)q(z) 

L is called the Evidence Lower Bound (ELBO) and though it depends on the likelihood it is free of p(·) and therefore tractable. Therefore, the optimization problem in equation (8) is equivalent to the following optimization problem:

Equation 12: λ*,π*,γ*=argmaxλ,π,γL

Thus, inference in LDA maximizes the ELBO over a tractable family of distributions to approximate the posterior. Typically, a stochastic optimization approach is implemented to overcome the computational complexity—stochastic coordinate descent in particular. Further details on the analysis of VI is provided inFootnote 1, sections 5.2, 5.3, and 5.4 ofFootnote 1, and section 4 ofFootnote 4.

Dynamic topic modelling

Dynamic topic modelling refers to the introduction of a temporal dimension into the topic modelling analysis. In particular, dynamic topic modelling in the context of this project, refers to studying the change over time of specific topics. The project aims to analyze fixed topics over a particular time interval. Since the documents coming out of the CCMED have a natural time stamp, the date of death (DOD), they provide a canonical way to split the complete dataset into corpora covering a specific time interval. Once the data are split, the LDA can be applied to each individual corpus and then it is possible to analyze how each topic evolves over time.

One challenge with this dynamic approach is mapping the topics from two adjacent time windows. Because of the stochastic nature of the optimization problem in the inference stage, every time an instance of LDA is run, the ordering of the resulting abstract topics is random. Specifically, given two adjacent time windows indexed by t and t-1 and a fixed topic indexed by i, how can it be assured that the ith topic at time t corresponds to the ith topic at time t-1? To answer this, it is possible to construct topic priors for time t by using learned topic parameters from time t-1. To have a better understanding of the mechanism, the term prior refers to the parameters of the prior distributions and not the distributions themselves; equivalently this refers to quantities that are proportional to the location (expectation) of the prior distributions. Under this setting, the topic prior β can be represented by a matrix such that the entry βij is the prior on the ith term given the jth topic. Note that without any prior information or domain knowledge about Φij, the probability parameter of the ith term given the jth topic, a uniform prior would be imposed by making β a constant and therefore it would be minimally represented by a scalar. Whenever β is constant, the resulting Dirichlet distribution is symmetric and it is said to have a symmetric prior, which is the constant. Suppose that at time t-1 we learned topic parameter matrix Φt-1 , before learning Φt we will impose a prior βt with the following form:

Equation 13: β(t)=ηΦ(t-1)+(1-η)β(0)
where β(0)=1V1V1V1V  and  η[0,1]

Matrix Φt-1 serves as an informative prior for Φt, essentially implying that it is assumed that the topic distributions from adjacent time windows are similar in some sense. β0 serves as an uninformative uniform prior, this matrix essentially smooths out the sharp information from Φt-1. Keeping in mind that the vocabulary also evolves over time, meaning that some words are being added and some words are being removed from the vocabulary as the model sees new corpora, meaning this has to be accounted for in the prior. It is necessary to ensure that any unlearned topic could potentially have any new word even if in the previous time windows, the same topic had a probability of 0 of having that word. The introduction of β0 with a non-zero η ensures that any new word has a non-zero probability of being picked up by any evolving topic.

A Dirichlet distribution with a non-constant β is said to have a non-symmetric prior. Typically, the literature advises against non-symmetric priorsFootnote 5 since it is usually unreasonable to assume that there is sufficient a priori information about the word distributions within unknown topics—this case is different. It is reasonable to assume that adjacent time corpora share some level of word distribution information and to further justify this prior, an overlap between the adjacent corpora will be imposed. Suppose Dt-1 and Dt are corpora from time t-1 and t respectively, essentially the following condition will be imposed:

Equation 14: D(t-1)D(t)

The proportion of overlap is controlled by a hyperparameter that is set beforehand. Note that the overlap strengthens the assumption that βt is a reasonable prior for Φt. However, one might still reasonably assume that this prior is sensible, even with non-overlapping corpora, since Dt-1 and Dt would be quite close in time and thus share some level of information as far as word distribution goes.

Date modified:

Official Languages in Natural Language Processing

By: Julien-Charles Lévesque, Employment and Social Development Canada; Marie-Pier Schinck, Employment and Social Development Canada

Official Languages in Natural Language Processing

It is no secret that English is the dominant language in the field of Natural Language Processing. This can be a challenge for Government of Canada data scientists, who must ensure the quality of French data and that data from both official languages receive equal treatment so as to avoid any biases.

The Data Science Division of the Chief Data Office (CDO) at Employment and Social Development Canada (ESDC) is launching a research project on the use of natural language processing (NLP) for both official languages. This initiative, funded by ESDC's Innovation Lab, aims to deepen the CDO's understanding of the impact of language (French or English) on the merit of tools and techniques used in NLP to enable more informed decisions in future NLP projects.

Why is it important to explore the use of both official languages in NLP?

ESDC has experienced this firsthand through their work on NLP projects, and some of their partners in other departments have reported facing this issue as well. While there are numerous possible approaches to the treatment and processing of data in multiple languages, it is unclear whether some of these approaches work better than others to provide predictions of comparable quality for both official languages. Essentially, because the impact of how language is processed is never the sole focus of projects, data scientists can only invest limited time and resources to explore that question—which could lead to suboptimal decisions. For the French language, there is a need to better understand the implications of the choices made by data scientists when applying NLP techniques. This will lead to better quality handling of French data, thus helping to reduce language-driven biases and increase fairness for solutions impacting service delivery to clients, as well as internal solutions.

New research into NLP techniques for official languages

To address this problem, ESDC is launching a research project that will focus on some recurrent questions surrounding the application of NLP techniques to both official languages. This includes techniques for preprocessing, embedding and modelling text, as well as techniques to mitigate the impact of imbalanced datasets. They want to gain transferable knowledge that could be leveraged by their team, as well as the GoC data science community, to help bridge the gap between French and English when it comes to the quality of NLP applications in the federal government.

For now, they will only be using text classification problems as use cases. It is both a very common NLP task as well as one that they have worked on in numerous projects. They have access to several labelled data sets from these past projects, enabling them to ground their findings in a more applied context, using real life data from their department.

Leveraging existing data sets

The ESDC team will be using datasets from four past text classification problems they solved. These vary in terms of the length of documents, the quality of the text, the classification task (binary vs. multiclass), the proportion of French and English content as well as the way the French content was handled. For context, each of these problems is explored in more detail below.

  • The T4 project is a binary classification problem of notes written by call centre agents; the objective was to predict if a T4 had already been re-sent to a client or not.
  • The Media Monitoring project is a binary classification problem of NewsDesk news articles; the aim was to predict if articles were relevant to senior management.
  • The Record of Employment Comments (ROEC) project is a multiclass classification problem, where the objective was to predict which reason for separation corresponded to employers' comments on Record of Employment forms.
  • The Human Resources (HR) project is a research project that explored the pre-selection of candidates for large entry-level staffing processes. It was framed as a binary classification problem where the objective was to predict the label attributed by HR staff based on the candidates' answers to screening questions.
Table 1. Overview of each problem's data and final solution
Project name Problem Type Dataset size Proportion of French content Input length Method used
T4 Binary Small (6k) 35% Short Tokens in both languages,N-Gram & Chi-Square + MLP
Media Monitoring Binary Large (1M) 25% Long French translated to English,
Meta-embedding (from GloVe, FastText and Paragram), Ensemble model (LSTM, GRU, CNN)
ROEC Multiclass Medium-Large (300k+) 28% Short Tokens in both languages,
N-Gram & Chi-Square + MLP
HR Binary Small (5k) 6% Medium to long Pretrained multilingual contextual embeddings (BERT Base) followed by fine-tuning

Key research questions

This work will explore key questions that typically arise when developing NLP solutions for classification. The recurring question of imbalanced datasets in GoC data (more observations in English than in French) will also be addressed. More specifically, this project will attempt to answer the following questions:

  1. What is the difference between using a separate model for French and English and using a single model for both? Can general rules or guidelines be inferred for when each approach might be preferable?
  2. Is the strategy of translating French data to English and then training a monolingual English model valid? What are the main factors to take into consideration when using that approach?
  3. Are models trained on a multitude of languages biased in favor of one language over the other? Is the understanding of French documents equivalent to the understanding of English ones?
  4. What is the impact of the imbalance in language representation in the data? Is there a minimum French to English ratio that should be targeted? Which methods should be used to mitigate the implications of this imbalance?

Sharing the results

The bulk of the experiments will be completed over the summer, and a presentation and report will be prepared and circulated sometime during the fall. This detailed report will document the research and exploration that took place as well as the findings. The report will be technical, with data scientists as the targeted audience, since the main goal of this initiative is to enable them to make more informed decisions when handling French data on NLP projects. Additionally, a Machine Learning Seminar will be prepared to discuss this research initiative. The specific topics discussed, and the number of sessions offered, will be driven by the conclusions of the study.

Let's connect!

The team hopes that this research initiative will bring value to future bilingual NLP projects through a more informed handling of French content, thus allowing a higher quality final product. In the meantime, if you have also been facing challenges when using NLP on bilingual datasets, if you have comments, ideas, or maybe some lessons learned that you think would be of interest, or if you simply would like to be kept in the loop, don't hesitate to reach out! The project team invites you to chat with the GC data science community by joining the conversation in the Artificial Intelligence Practitioners and Users Network!

Team Members

Marie-Pier Schinck (Data Scientist), Julien-Charles Lévesque (Data Scientist)

Date modified:

Federal government expenditures on COVID-19 response measures

On March 11, 2020, the World Health Organization declared the COVID-19 pandemic. To address the consequences of the pandemic on the Canadian economy, the federal government of Canada announced and implemented various support and recovery measures for businesses, households, students, the vulnerable population and organizations helping individuals. The table Federal government expenditures on COVID-19 response measures presents the major federal measures announced and implemented, their treatment in the national accounts (in particular, in the Income and Expenditure Accounts), the table numbers where the pertinent series may be found and the amount of expenditure on a quarterly basis.

For a comprehensive explanations on the treatment of COVID-19 government support measures in the national accounts, please refer to the documents Recording COVID-19 measures in the national account and Recording new COVID measures in the national accounts.

Treatment in national accounts: Subsidies on production, by quarter at quarterly rates
  2020 2021
First quarter Second quarter Third quarter Fourth quarter First quarter
COVID-19 measure $ millions
Canada Emergency Wage Subsidy (CEWS) - business 4,359 29,351 22,711 10,703 8,307
Temporary Wage Subsidy (TWS) - business 169 739      
Canada Emergency Rent Subsidy (CERS) - business     52 1,558 1,222
Lockdown Support (LS) - business     5 209 255
Source: Statistics Canada, tables 36-10-0103, 36-10-0118, 36-10-0477.
Treatment in national accounts: Current transfers to non-profit institutions serving households (NPISH), by quarter at quarterly rates
  2020 2021
  First quarter Second quarter Third quarter Fourth quarter First quarter
COVID-19 measure $ millions
Canada Emergency Wage Subsidy (CEWS) - NPISH 200 1,095 1,051 573 364
Temporary Wage Subsidy (TWS) - NPISH 13 46      
Canada Emergency Rent Subsidy (CERS) - NPISH     1 36 23
Lockdown Support (LS) - NPISH     0 4 4
Source: Statistics Canada, tables 36-10-0118, 36-10-0477, 36-10-0115.
Treatment in national accounts: Subsidies on products and imports, by quarter at quarterly rates
  2020 2021
  First quarter Second quarter Third quarter Fourth quarter First quarter
COVID-19 measure $ millions
Canada Emergency Commercial Rent Assistance (CECRA)   1,130 904    
  • Federal contribution
  849 679    
  • Provincial contribution
  281 225    
Source: Statistics Canada, tables 36-10-0103, 36-10-0118, 36-10-0477.
Treatment in national accounts: Current transfers to households - Employment Insurance benefits, by quarter at quarterly rates
  2020 2021
  First quarter Second quarter Third quarter Fourth quarter First quarter
COVID-19 measure $ millions
Canada Emergency Response Benefit (CERB) - EI stream   19,127 9,239 864  
Source: Statistics Canada, tables 36-10-0118, 36-10-0477, 36-10-0112.
Treatment in national accounts: Transfers to households -Other federal transfers to households, by quarter at quarterly rates
  2020 2021
  First quarter Second quarter Third quarter Fourth quarter First quarter
COVID-19 measure $ millions
Canada Emergency Response Benefit (CERB) - CRA stream   29,002 15,597 704  
Canada Emergency Student Benefit (CESB)   1,386 1,550 8  
Canada Recovery Benefit (CRB)       6,073 7,280
Canada Recovery Caregiving Benefit (CRCB)       900 960
Canada Recovery Sickness Benefit (CRSB)       246 144
Source: Statistics Canada, tables 36-10-0118, 36-10-0477, 36-10-0112.

Why are we conducting this survey?

The purpose of this survey is to collect reliable and timely information on special crops. Results from this survey are used to:

  • validate crop production such as farm stock and marketing data, and
  • calculate the contribution of the special crops sector to the Canadian economy.

The Canadian Special Crops Association, Pulse Canada and federal and provincial governments, such as Agriculture and Agri-Food Canada use this information for establishing programs and policies.

Your information may also be used by Statistics Canada for other statistical and research purposes.

Your participation in this survey is required under the authority of the Statistics Act.

Other important information

Authorization to collect this information

Data are collected under the authority of the Statistics Act, Revised Statutes of Canada, 1985, Chapter S-19.

Confidentiality

By law, Statistics Canada is prohibited from releasing any information it collects that could identify any person, business, or organization, unless consent has been given by the respondent, or as permitted by the Statistics Act. Statistics Canada will use the information from this survey for statistical purposes only.

Record linkages

To enhance the data from this survey and to reduce the reporting burden, Statistics Canada may combine the acquired data with information from other surveys or from administrative sources.

Data-sharing agreements

To reduce respondent burden, Statistics Canada has entered into data-sharing agreements with provincial and territorial statistical agencies and other government organizations, which have agreed to keep the data confidential and use them only for statistical purposes. Statistics Canada will only share data from this survey with those organizations that have demonstrated a requirement to use the data.

Section 11 of the Statistics Act provides for the sharing of information with provincial and territorial statistical agencies that meet certain conditions. These agencies must have the legislative authority to collect the same information, on a mandatory basis, and the legislation must provide substantially the same provisions for confidentiality and penalties for disclosure of confidential information as the Statistics Act. Because these agencies have the legal authority to compel businesses to provide the same information, consent is not requested and businesses may not object to the sharing of the data.

For this survey, there are Section 11 agreements with the provincial statistical agencies of Newfoundland and Labrador, Nova Scotia, New Brunswick, Quebec, Ontario, Manitoba, Saskatchewan, Alberta and British Columbia. The shared data will be limited to information pertaining to business establishments located within the jurisdiction of the respective province.

Business or organization and contact information

1. Verify or provide the business or organization's legal and operating name and correct where needed.

Note: Legal name modifications should only be done to correct a spelling error or typo.

Note: Press the help button (?) for additional information.

Legal Name

The legal name is one recognized by law, thus it is the name liable for pursuit or for debts incurred by the business or organization. In the case of a corporation, it is the legal name as fixed by its charter or the statute by which the corporation was created.

Modifications to the legal name should only be done to correct a spelling error or typo.

To indicate a legal name of another legal entity you should instead indicate it in question 3 by selecting 'Not currently operational' and then choosing the applicable reason and providing the legal name of this other entity along with any other requested information.

Operating Name

The operating name is a name the business or organization is commonly known as if different from its legal name. The operating name is synonymous with trade name.

  • Legal name
  • Operating name (if applicable)

2. Verify or provide the contact information of the designated business or organization contact person for this questionnaire and correct where needed.

Note: The designated contact person is the person who should receive this questionnaire. The designated contact person may not always be the one who actually completes the questionnaire.

  • First name
  • Last name
  • Title
  • Preferred language of communication
    • English
    • French
  • Mailing address (number and street)
  • City
  • Province, territory or state
  • Postal code or ZIP code
  • Country
    • Canada
    • United States
  • Email address
  • Telephone number (including area code)
  • Extension number (if applicable)
    The maximum number of characters is 10.
  • Fax number (including area code)

3. Verify or provide the current operational status of the business or organization identified by the legal and operating name above.

  • Operational
  • Not currently operational
    Why is this business or organization not currently operational?
    • Seasonal operations
      • When did this business or organization close for the season?
        • Date
      • When does this business or organization expect to resume operations?
        • Date
    • Ceased operations
      • When did this business or organization cease operations?
        • Date
      • Why did this business or organization cease operations?
        • Bankruptcy
        • Liquidation
        • Dissolution
        • Other - Specify the other reasons for ceased operations
    • Sold operations
      • When was this business or organization sold?
        • Date
      • What is the legal name of the buyer?
    • Amalgamated with other businesses or organizations
      • When did this business or organization amalgamate?
        • Date
      • What is the legal name of the resulting or continuing business or organization?
      • What are the legal names of the other amalgamated businesses or organizations?
    • Temporarily inactive but will re-open
      • When did this business or organization become temporarily inactive?
        • Date
      • When does this business or organization expect to resume operations?
        • Date
      • Why is this business or organization temporarily inactive?
    • No longer operating due to other reasons
      • When did this business or organization cease operations?
        • Date
      • Why did this business or organization cease operations?

4. Verify or provide the current main activity of the business or organization identified by the legal and operating name above.

Note: The described activity was assigned using the North American Industry Classification System (NAICS).

Note: Press the help button (?) for additional information, including a detailed description of this activity complete with example activities and any applicable exclusions.

This question verifies the business or organization's current main activity as classified by the North American Industry Classification System (NAICS). The North American Industry Classification System (NAICS) is an industry classification system developed by the statistical agencies of Canada, Mexico and the United States. Created against the background of the North American Free Trade Agreement, it is designed to provide common definitions of the industrial structure of the three countries and a common statistical framework to facilitate the analysis of the three economies. NAICS is based on supply-side or production-oriented principles, to ensure that industrial data, classified to NAICS , are suitable for the analysis of production-related issues such as industrial performance.

The target entity for which NAICS is designed are businesses and other organizations engaged in the production of goods and services. They include farms, incorporated and unincorporated businesses and government business enterprises. They also include government institutions and agencies engaged in the production of marketed and non-marketed services, as well as organizations such as professional associations and unions and charitable or non-profit organizations and the employees of households.

The associated NAICS should reflect those activities conducted by the business or organizational units targeted by this questionnaire only, as identified in the 'Answering this questionnaire' section and which can be identified by the specified legal and operating name. The main activity is the activity which most defines the targeted business or organization's main purpose or reason for existence. For a business or organization that is for-profit, it is normally the activity that generates the majority of the revenue for the entity.

The NAICS classification contains a limited number of activity classifications; the associated classification might be applicable for this business or organization even if it is not exactly how you would describe this business or organization's main activity.

Please note that any modifications to the main activity through your response to this question might not necessarily be reflected prior to the transmitting of subsequent questionnaires and as a result they may not contain this updated information.

The following is the detailed description including any applicable examples or exclusions for the classification currently associated with this business or organization.

Description and examples

  • This is the current main activity
  • This is not the current main activity

Provide a brief but precise description of this business or organization's main activity

e.g., breakfast cereal manufacturing, shoe store, software development

Main activity

5. You indicated that is not the current main activity.

Was this business or organization's main activity ever classified as: ?

  • Yes
    When did the main activity change?
    • Date
  • No

6. Search and select the industry classification code that best corresponds to this business or organization's main activity.

How to search:

  • if desired, you can filter the search results by first selecting this business or organization's activity sector
  • enter keywords or a brief description that best describes this business or organization main activity
  • press the Search button to search the database for an activity that best matches the keywords or description you provided
  • then select an activity from the list.

Select this business or organization's activity sector (optional)

  • Farming or logging operation
  • Construction company or general contractor
  • Manufacturer
  • Wholesaler
  • Retailer
  • Provider of passenger or freight transportation
  • Provider of investment, savings or insurance products
  • Real estate agency, real estate brokerage or leasing company
  • Provider of professional, scientific or technical services
  • Provider of health care or social services
  • Restaurant, bar, hotel, motel or other lodging establishment
  • Other sector

7. You have indicated that the current main activity of this business or organization is:

Main activity

Are there any other activities that contribute significantly (at least 10%) to this business or organization's revenue?

  • Yes, there are other activities
    Provide a brief but precise description of this business or organization's secondary activity
    • e.g., breakfast cereal manufacturing, shoe store, software development
  • No, that is the only significant activity

8. Approximately what percentage of this business or organization's revenue is generated by each of the following activities?

When precise figures are not available, provide your best estimates.

Approximately what percentage of this business or organization's revenue is generated by each of the following activities?
  Percentage of revenue
Main activity  
Secondary activity  
All other activities  
Total percentage  

Physical stocks of special crops

1. On the reference date , which of the following special crops were held as physical stocks in your facilities?

Include only stocks held in Canadian facilities such as elevators, cleaning plants, and stocks in-transit.
Exclude stocks held on farms or outside Canada.

Select all that apply.

  • Canary seed
  • Chickpeas
  • Dry field peas
    • Include feed peas.
  • Lentils
  • Mustard seed
  • Sunflower seed
    • Include sunola and other dwarf varieties.
  • No physical stocks of these special crops on the reference date

2. On the reference date, please indicate the physical stocks in metric tonnes for the following special crops.

Include only stocks held in Canadian facilities such as elevators, cleaning plants, and stocks in-transit.
Exclude stocks held on farms or outside Canada.

On the reference date, please indicate the physical stocks in metric tonnes for the following special crops.
  Metric tonnes
Canary seed  
a. Owned by this company  
b. Held for farmers  
c. Held for other companies  
Chickpeas  
d. Owned by this company  
e. Held for farmers  
f. Held for other companies  
Dry field peas  
g. Owned by this company  
h. Held for farmers  
i. Held for other companies  
Lentils  
j. Owned by this company  
k. Held for farmers  
l. Held for other companies  
Mustard seed  
m. Owned by this company  
n. Held for farmers  
o. Held for other companies  
Sunflower seed  
p. Owned by this company  
q. Held for farmers  
r. Held for other companies  

Changes or events

1. Indicate any changes or events that affected the reported values for this business or organization, compared with the last reporting period.

Select all that apply.

  • Strike or lock-out
  • Exchange rate impact
  • Price changes in goods or services sold
  • Contracting out
  • Organizational change
  • Price changes in labour or raw materials
  • Natural disaster
  • Recession
  • Change in product line
  • Sold business or business units
  • Expansion
  • New or lost contract
  • Plant closures
  • Acquisition of business or business units
  • Other
    Specify the other changes or events:
  • No changes or events

Contact person

1. Statistics Canada may need to contact the person who completed this questionnaire for further information.

Is the provided given names and the provided family name the best person to contact?

  • Yes
  • No

Who is the best person to contact about this questionnaire?

  • First name:
  • Last name:
  • Title:
  • Email address:
  • Telephone number (including area code):
  • Extension number (if applicable):
    The maximum number of characters is 5.
  • Fax number (including area code):

Feedback

1. How long did it take to complete this questionnaire?

Include the time spent gathering the necessary information.

  • Hours:
  • Minutes:

2. Do you have any comments about this questionnaire?