CVs for Total sales by geography - October 2020
Table summary
This table displays the results of Annual Retail Trade Survey: CVs for Total sales by geography - October 2020. The information is grouped by Geography (appearing as row headers), Month and Percent (appearing as column headers).
Geography
Month
202010
%
Canada
1.3
Newfoundland and Labrador
1.3
Prince Edward Island
0.9
Nova Scotia
1.4
New Brunswick
2.8
Quebec
1.9
Ontario
3.2
Manitoba
2.1
Saskatchewan
2.7
Alberta
2.0
British Columbia
2.2
Yukon Territory
1.3
Northwest Territories
0.5
Nunavut
1.8
StatCan+ beta consultation
Consultation objectives
This new product is part of Statistics Canada's quest to make statistics more accessible and understandable for all Canadians. StatCan+ provides highlights of data and analysis in plain language on key topics that include links to full articles, data visualizations and additional statistical resources.
Through StatCan+, Statistics Canada will deliver more data, more often. We'll also be sharing news about events, webinars, and profiles of some of our data users. In addition, the product will cover the agency's activities within the community and invite users to get to know the agency better.
This consultation will help Statistics Canada improve and expand StatCan+ to respond to users' information needs.
Consultation methodology
Statistics Canada will ask individuals to consult the beta version of the webpage and to provide feedback through an online feedback form.
Statistics Canada is committed to respecting the privacy of consultation participants. All personal information created, held or collected by the agency is protected by the Privacy Act. For more information on Statistics Canada's privacy policies, please consult the privacy notice.
Natural Resources Canada (NRCan) has a strong foundation integrating advanced analytics in its science and research programs. This expertise makes the department an authority in areas such as geospatial data and projecting ecosystem disturbances in Canada’s forests. NRCan aims to lead the digital transformation of the natural resource sector. To that end, the Digital Accelerator was formed to explore innovative applications of digital solutions and develop strategic partnerships to augment NRCan’s expertise.
What is the Digital Accelerator?
The Digital Accelerator is a team of data scientists and analysts who provide a cross-functional, client-centric service to science and policy experts throughout NRCan. The Accelerator takes a hands-on approach to growing the department’s artificial intelligence-related competencies by delivering tangible products in collaboration with partners and identifying opportunities to share knowledge, expertise and resources.
For example, researchers from NRCan, Transport Canada, Environment and Climate Change Canada, the University of Waterloo and the University of Ontario Institute of Technology are collaborating to provide tools and knowledge to utility planners, policy makers, university researchers and consultants to better inform electrical grid management. The Digital Accelerator is utilizing a variety of datasets—including techno-economic factors, environmental considerations and drivers’ social behaviour—to develop machine learning models to analyze the optimization of electric grids and charging infrastructure for mass vehicle penetration. This analysis will help forecast the impact on charging infrastructure, future grid extensions and utilities generation capacities.
These types of strategic partnerships are fundamental to accelerating the adoption of advanced analytics and providing more value to Canadians. The Digital Accelerator is excited to announce three new partnerships with data science teams from Statistics Canada, Microsoft Canada and Google Canada. These collaborations represent a new way for departments and technology firms to work together and keep pace with the rate of advancements in AI.
Are you interested in learning more or collaborating? Check out the NRCan website or email them today!
Project team and contributors:
Alida Rubwindi, Natural Resources Canada; Lisa VandenBerg, Natural Resources Canada
Version Control with Git for Analytics Professionals
By: Collin Brown, Statistics Canada
Analytics and data science workflows are becoming more complex than ever before—there are more data to analyze; computing resources continue to become cheaper; and there has been a surge in the availability of open source software.
For these reasons and more, there has been a significant uptake in programming by analytics professionals who do not have a classical computer science background. These advances have allowed analytics professionals to expand the scope of their work, perform new tasks and leverage these tools to deliver more value.
However, this rapid adoption of programming by analytics professionals introduces new complexities and exasperates old ones. In classical computer science workflows, such as software development, many tools and techniques have been rigorously developed over decades to accommodate this complexity.
As more analytics professionals integrate programming and open source software into their work, they may benefit significantly from also adopting some of the best practices from computer science that allow for the management of complex analytics and workflows.
When should analytics professionals leverage tools and techniques to manage complexity? Consider the problem of source code version control. In particular, how can multiple analytics professionals work on the same code base without conflicting with each other and how can they quickly revert back to previous versions of the code?
Leveraging Git for version control
Even if you are not familiar with the details of Git, the following scenario will demonstrate the benefits of such a tool.
Imagine there is a small team of analytics professionals making use of Git (a powerful tool typically used in software engineering) and GCCode (a Government of Canada internal instance of GitLab).
The three analytics professionals—Jane, John and Janice—create a monthly report that involves producing descriptive statistics and estimating some model parameters. The code they use to implement this analysis is written in Python, and the datasets they perform their analysis on are posted to a shared file system location that they all have access to. They must produce the report on the same day that the new dataset is received and, afterwards, send it to their senior management for review.
The team uses GCCode to centrally manage their source code and documentation written in gitlab flavoured markdown. They use a paired down version of a successful git branching model to ensure there are no conflicts when they each push code to the repository. The team uses a peer review approach to pull requests (PRs), meaning that someone other than the person who submitted the PR must review and approve the changes implemented in the PR.
This month is unusual; with little notice, the team is informed by their supervisor that there will be a change in the format that one of the datasets is received in. This format change is significant and requires non-trivial changes to the team’s codebase. In particular, once the changes are made, the code will support data preprocessing in the new format, but will no longer accommodate the old format.
The three employees quickly delegate responsibilities to incorporate the necessary changes to the codebase:
Jane will write the new piece of code required to accommodate the new data format
John will write automated tests that verify the correctness of Jane’s code
Janice will update the documentation to describe the data format changes.
The team has been following good version control practices, so the main branch of their GCCode repository is up to date and correctly implements the analysis required to produce the previous months’ reports.
Jane, John, and Janice begin by pulling from the GCCode repository to make sure each of their local repositories is up to date. Once this step is done, they each checkout a new branch from the main branch. Since the team is small, they choose to omit much of the overhead presented in a successful branching model and just checkout their own branches directly from the main branch.
The three go about their work on their local workstations, committing their changes as they go while following good commit practices. By the end of the business day, they push their branches to the remote repository. At this point, the remote repository has three new branches that are each several commits ahead of the main branch. Each of the three assigns another to be their peer reviewer, and the next day the team approves changes and merges each member’s branch to main.
On the day that the report must be generated, they run their new code and successfully generate and send the report for their senior management using the new data.
Later that day, they receive an urgent request asking them to reproduce the previous three months’ reports for audit purposes. Given that the code has changed to accommodate the new data format, the current code is no longer compatible with the previous datasets.
Git to the rescue!
Fortunately, however, the team is using Git to manage their codebase. Because the team is using Git, they can checkout to the commit just before they made their changes, and temporarily revert the state of the working folder to what it was before their changes. Now that the folder has changed, they can retroactively produce the three reports using the previous three months’ data. Finally, they can then checkout back to the most recent commit of the main branch, so that they can use the new codebase that accommodates the format change going forward.
Even though the team described above is performing an analytics workflow, they were able to leverage Git to prevent a situation that otherwise may have been very inconvenient and time-consuming.
Learn more about Git
Would your work benefit from using the practices described above? Are you unfamiliar with Git? Here are a few resources to get you started:
The first half of IBM’s How Does Git Work provides a mental model for how Git works, and introduces many of the technical terms of Git and how they relate to that model.
This article about a successful git branching model provides a guide on how to perform collaborative workflows using a branching model and a framework that can be adjusted to suit particular needs.
The Git book provides a very detailed review of the mechanics of how Git works. It is broken down by section, so you can review whichever portion(s) are most relevant to your current use case.
What’s next?
Applying version control to one’s source code is just one of many computer science-inspired practices that can be applied to analytics and data science workflows.
In addition to versioning source code, many data science and analytics professionals may find themselves benefiting from data versioning (see Data Version Control for an implementation of this concept) or model versioning (e.g. see MLFlow model versioning).
These resources are a great place to start as you begin to discover how complexity management practices from computer science can be used to improve data science and analytics workflows!
The Data Science Division (DScD) at Statistics Canada recently completed a research project for the Field Crop Reporting Series (FCRS) Footnote 1 on the use of machine learning techniques (more precisely, supervised regression techniques) for early-season crop yield prediction.
The project objective was to investigate whether machine learning techniques could be used to improve the precision of the existing crop yield prediction method (referred to as the Baseline method).
The project faced two key challenges: (1) how to incorporate any prediction technique (machine learning or otherwise) into the FCRS production environment in a methodologically sound way, and (2) how to evaluate any prediction method meaningfully within the FCRS production context.
For (1), the rolling window forward validationFootnote 2 protocol (originally designed for supervised learning on time series data) was adapted to safeguard against temporal information leakage. For (2), the team opted to perform testing by examining the actual series of prediction errors that would have resulted had it been deployed in past production cycles.
Motivation
Traditionally, the FCRS publishes annual crop yield estimates at the end of each reference year (shortly after harvest). In addition, full-year crop yield predictions are published several times during the reference year. Farms are contacted in March, June, July, September and November for data collection, resulting in a heavy response burden for farm operators.
In 2019, for the province of Manitoba, a model-based method—essentially, variable selection via LASSO (Least Absolute Shrinkage and Selection Operator), followed by robust linear regression—was introduced to generate the July predictions based on longitudinal satellite observations of local vegetation levels as well as region-level weather measurements. This allowed the removal of the question about crop yield prediction from the Manitoba FCRS July questionnaire, reducing the response burden.
Core regression technique: XGBoost with linear base learner
A number of prediction techniques were examined, including: random forests, support vector machines, elastic-net regularized generalized linear models, and multilayer perceptrons. Accuracy and computation time considerations led us to focus attention on XGBoost Footnote 3 with linear base learner.
Rolling Window Forward Validation to prevent temporal information leakage
The main contribution of the research project is the adaptation of rolling window forward validation (RWFV) Footnote 2 as hyperparameter tuning protocol. RWFV is a special case of forward validation Footnote 2, a family of validation protocols designed to prevent temporal information leakage for supervised learning based on time series data.
Suppose you are training a prediction model for deployment in production cycle 2021. This following schematic illustrates a rolling window forward validation scheme with a training window of five years, and a validation window of three years.
The blue box at the bottom represents the production cycle 2021 and the five white boxes to its left correspond to the fact that a training window of five years is being used. This means that the training data for production cycle 2021 will be those from the five years strictly and immediately prior (2016 to 2020). For validation, or hyperparameter tuning for production cycle 2021, the three black boxes above the blue box correspond to our choice that the validation window is three years.
The RWFV protocol is used to choose the optimal configuration from the hyperparameter search space, as follows:
Fix temporarily an arbitrary candidate hyperparameter configuration from the search space.
Use that configuration to train a model for validation year 2020 using data from the following five years: 2015 to 2019.
Use that resulting trained model to make predictions for the validation year 2020. Compute accordingly the parcel-level prediction errors for 2020.
Aggregate the parcel-level prediction errors down to an appropriate single numeric performance metric.
Repeat for the two other validation years (2018 and 2019).
Averaging the performance metrics across the validation years 2018, 2019 and 2020, the result is a single numeric performance metric/validation error for the temporarily fixed hyperparameter configuration.
Next, this was repeated for all candidate hyperparameter configurations in the hyperparameter search space. The optimized configuration to actually be deployed in production is the one that yields the best aggregated performance metric. This is rolling window forward validation, or more precisely, our adaptation of it to the crop yield prediction context.
Note that the above protocol respects the operational constraint that, for production cycle 2021, the trained prediction model must have been trained and validated on data from strictly preceding years; in other words, the protocol prevents temporal information leakage.
Production-pertinent testing via prediction error series from virtual production cycles
To evaluate—in a way most pertinent to the production context of the FCRS—the performance of the aforementioned prediction strategy based on XGBoost(Linear) and RWFV, the data scientists computed the series of prediction errors that would have resulted had the strategy actually been deployed for past production cycles. In other words, these prediction errors of virtual past production cycles were regarded as estimates of the generalization error within the statistical production context of the FCRS.
The following schematic illustrates the prediction error series of the virtual production cycles:
Now repeat, for each past virtual production cycle (represented by an orange box), what was just described for the blue box. The difference now is the following: for the blue box, namely the current production cycle, it is NOT yet possible to compute the production/prediction errors at time of crop yield prediction (in July) since the current growing season has not ended. However, for the past virtual production cycles (the orange boxes), it is possible.
These prediction errors in virtual past production cycles can be illustrated in the following plot:
The red line illustrates the Baseline model prediction errors, while the orange line illustrates the XGBoost/RWFV strategy prediction errors. The gray lines illustrate the prediction errors for each of the candidate hyperparameter configurations in our chosen search grid (which contains 196 configurations).
The XGBoost/RWFV prediction strategy exhibited smaller prediction errors than the Baseline method, consistently over consecutive historical production runs.
Currently, the proposed strategy is in the final pre-production testing phase, to be jointly conducted by subject matter experts and the agricultural program’s methodologists.
The importance of evaluating protocols
The team chose not to use a more familiar validation method such as hold-out or cross validation, nor a generic generalization error estimate such as prediction error on a testing data set kept aside at the beginning.
These decisions were taken based on our determination that our proposed validation protocol and choice of generalization error estimates (RWFV and virtual production cycle prediction error series, respectively) would be much more relevant and appropriate given the production context of the FCRS.
Methodologists and machine learning practitioners are encouraged to evaluate carefully whether generic validation protocols or evaluation metrics are indeed appropriate for their use cases at hand, and if not, seek alternatives that are more relevant and meaningful within the given context. For more information about this project, please email statcan.dsnfps-rsdfpf.statcan@statcan.gc.ca.
The Annual Survey of Manufacturing and Logging Industries (ASML) measures both revenues from goods manufactured as well as total revenues. It should be noted that when comparing to the sales of goods manufactured variable from the Monthly Survey of Manufacturing (MSM), users should use the first concept, revenues from goods manufactured from the ASML. The total revenues published from the ASML measures a broader concept as it includes revenues from activities other than manufacturing. For example, total revenues includes goods purchased for resale, investment and interest revenues. Total revenues from the ASML therefore cannot be compared to sales of goods manufactured published by the MSM.
The two surveys answer different user needs. The Monthly Survey of Manufacturing is built to provide an indicator on the state of the manufacturing sector and track monthly changes, i.e. provide information on the trend, while the Annual Survey of Manufacturing and Logging Industries is built to paint a detailed picture on the total dollar values of the industries, i.e. to provide information on the levels.
In order to provide information on trend that is not altered by changes in the sample, the sample of the Monthly Survey of Manufacturing is redrawn every five years, while the sample of the Annual Survey of Manufacturing and Logging Industries is renewed every year.
Both surveys are subject to revisions, however the two surveys will not produce identical results mainly because of methodological differences. For example, there are differences in sampling strategies (as described above), respondents reporting on the annual survey for a fiscal year that is different from a January to December calendar year, auxiliary data sources (MSM uses GST data, while ASML uses T2 tax data for imputation and calibration), imputation methods (for a particular record, MSM may use historical imputation, while ASML may use a donor to impute, or vice versa).
For more information on data sources and methodology please visit the following links:
Inter-city indexes of price differentials of consumer goods and services show estimates of price differences between 15 Canadian cities in all provinces and territories, as of October 2019. These estimates are based on a selection of products (goods and services) purchased by consumers in each of the 15 cities.
In order to produce optimal inter-city indexes, product comparisons were initially made by pairing cities that are in close geographic proximity. The resulting price level comparisons were then extended to include comparisons between all of the cities, using a chaining procedure. The following initial pairings were used:
St. John's, Newfoundland and Labrador
Halifax, Nova Scotia
Charlottetown-Summerside, Prince Edward Island
Halifax, Nova Scotia
Saint John, New Brunswick
Halifax, Nova Scotia
Halifax, Nova Scotia
Ottawa, Ontario
Montréal, Quebec
Toronto, Ontario
Ottawa, Ontario
Toronto, Ontario
Toronto, Ontario
Winnipeg, Manitoba
Regina, Saskatchewan
Winnipeg, Manitoba
Edmonton, Alberta
Winnipeg, Manitoba
Vancouver, British Columbia
Edmonton, Alberta
Calgary, Alberta
Edmonton, Alberta
Whitehorse, Yukon
Edmonton, Alberta
Yellowknife, Northwest Territories
Edmonton, Alberta
Iqaluit, Nunavut
Yellowknife, Northwest Territories
Reliable inter-city price comparisons require that the selected products be very similar across cities. This ensures that the variation in index levels between cities is due to pure price differences and not to differences in the attributes of the products, such as size and/or quality.
Within each city pair, product price quotes were matched on the basis of detailed descriptions. Whenever possible, products were matched by brand, quantity and with some regard for the comparability of retail outlets from which they were selected.
Additionally, the target prices for this study are final prices and as such, include all sales taxes and levies applied to consumer products within a city. This can be an important source of variation when explaining differences in inter-city price levels.
It should be noted that price data for the inter-city indexes are drawn from the sample of monthly price data collected for the Consumer Price Index (CPI). Given that the CPI sample is optimized to produce accurate price comparisons through time, and not across regions, the number of matched price quotes between cities can be small. It should also be noted that, especially in periods when prices are highly volatile, the timing of the product price comparison can significantly affect city-to-city price relationships.
The weights used to aggregate the food indexes within a city are based on the combined consumption expenditures of households living in the 15 cities tracked. As such, one set of weights is used for all 15 cities for the food indexes. Iqaluit has only the food major component index and its selected sub-groups published, as a result, the weights used to aggregate the non-food product indexes within a city are based on the combined consumption expenditures of households living in the other 14 cities tracked. Currently, 2017 expenditures are used to derive the weights. These expenditures are expressed in October 2019 prices.
The inter-city index for a particular city is compared to the weighted average of all 15 cities, which is equal to 100. For example, an index value of 102 for a particular city means that prices for the measured commodities are 2% higher than the weighted, combined city average.
These estimates should not be interpreted as a measure of differences in the cost of living between cities. The indexes provide price comparisons for a selection of products only, and are not meant to give an exhaustive comparison of all goods and services purchased by consumers. Additionally, the shelter price concept used for these indexes is not conducive to making cost-of-living type comparisons between cities (see below).
Additional Information on Shelter
Shelter prices were absent from the inter-city index program prior to 1999 because of methodological and conceptual issues associated with their measurement. The diverse nature of shelter means that accurate matches between cities are often difficult to make.
To account for some of these difficulties, a rental equivalence approach is used to construct the inter-city price indexes for owned accommodation. Such an approach uses market rents as an approximation to the cost of the shelter services consumed by homeowners. It is important to note that this approach may not be suitable for the needs of all users. For instance, since the rental equivalence approach does not represent an out-of-pocket expenditure, the indexes should not be used for measuring differences in the purchasing power of homeowners across cities.
The relatively small size of the housing market in Whitehorse and Yellowknife makes it difficult to construct reliable price indexes for rented accommodation and owned accommodation. To compensate, housing information is collected using different pricing frequencies and collection methods than in the rest of the country. Consequently, users should exercise caution when using the indexes for rented accommodation and owned accommodation for these two cities.