All times listed in the schedule refer to Eastern Daylight Time (EDT): UTC-4
Wednesday October 30, 2024
08:45 – 09:00
Opening Remarks
Simon Goldberg Room
- Éric Rancourt, Assistant Chief Statistician, Strategic Data Management, Methods and Analysis Field, Statistics Canada, Canada
09:00 – 10:00
Session 1 – Keynote Address
Simon Goldberg Room
- The evolving role of National Statistical Institutes – Challenges and Opportunities
Pádraig Dalton, Former Director General, Central Statistics Office, Ireland-
Abstract
The demand for data continues to grow unabated. The range of subject matter areas on which evidence and insight is required continues to expand while in addition users want more timely, more frequent and more granular data. All of this is happening in a context where National Statistical Institutes (NSIs) are just one part of an ever-growing data market and where fake news and alternative facts present significant risks to decision makers and indeed the general public. National Statistical Institutes must step forward and play a strong leadership role and establish their point of difference from other unregulated data providers and create an understanding of the value-add that can and must be delivered from within the Official Statistical System. To meet the growing range of demands in this challenging environment NSIs will need to be agile, adaptive and innovative while adhering to our fundamental principles and core values. We will also need to exploit all of the opportunities offered by advancements in technology, methodology and the ongoing emergence of new data sources. This presentation will look at the evolving context within which NSIs now find themselves and consider the associated opportunities and challenges this presents.
-
10:00 – 10:30
Morning Break
10:30 – 12:00
Session 2A – Balancing Disclosure Risk and Analytic Utility with Synthetic Data
Simon Goldberg Room
- Inference from Synthetic Data: Challenges and Solutions
Anne-Sophie Charest, Université Laval, Canada-
Abstract
As concerns over privacy and data security mount, the generation of synthetic data emerges as a promising avenue for mitigating disclosure risks associated with personal data. But while there is a rapidly growing body of literature on methods to generate such synthetic datasets, the focus has predominantly been on generating more complex and accurate data, rather than the implications of synthesis for downstream analysis.
This presentation aims to address this gap by providing an overview of current research efforts in the field, including some of my own contributions. Two main approaches to obtain accurate uncertainties will be discussed: using combining rules with multiple synthetic datasets as is sometimes done with missing data, and leveraging direct knowledge of the synthesis methodology, which is available for example in the case of differential privacy. We will discuss the advantages and limitations of both and offer some thoughts on practical implications.
-
- Generating and Analyzing Synthetic Data
Khaled El Emam, University of Ottawa, Canada-
Abstract
Synthetic data can be a privacy protective way for sharing microdata, and are increasingly seen as a more robust form of non-personal information. This presentation will summarize the learnings from a series of real data and simulation studies on the generation and statistical analysis of synthetic data: (a) common metrics to assess the privacy risks of synthetic data are attribution and membership disclosure, (b) valid statistical analysis requires the generation of 10 synthetic dataset and using combining rules to get the parameter estimates and standard errors, (c) data amplification has marginal benefit, (d) understanding the performance of generative models relative to the number of variables in the training dataset, (e) when and how to augment microdata to maximize model performance, (f) using synthetic data generation to mitigate bias in real-world data.
These results will be illustrated on health datasets across multiple types of generative models: sequential synthesis with decision trees, Bayesian networks, generative adversarial networks, and variational autoencoders.
-
- Generating Select Synthetic Survey Data
Minsun Riddles, Westat, USA-
Abstract
As the demand for microdata access grows alongside privacy concerns, interest in synthetic data has surged. For example, synthetic data is recognized as offering a promising solution for sharing vast amounts of health data to accelerate research while safeguarding privacy. However, generating synthetic data presents challenges, particularly when it comes to survey data, in balancing reducing disclosure risk with retaining original data integrity. One approach to addressing these challenges is through the 'select' data synthesis approach, which involves synthesizing select variables of select records with high disclosure risks. In this paper, we address these challenges and propose solutions for generating select synthetic data in two U.S. large-scale national surveys. Additionally, we introduce a replication variance estimation method to appropriately measure the additional variation introduced by data synthesis.
-
10:30 – 12:00
Session 2B – Modern Approaches for Different Data Sources
Jean Talon Conference Room
- Low response rate from merchants? No problem, just ask consumers! An application of indirect sampling to consumer payment diary data
Joy Wu, Heng Chen, Bank of Canada, Canada-
Abstract
Merchant surveys traditionally have very low response rates. An alternative would be to leverage consumer surveys, but there are two issues: the representativeness of consumers and the accuracy of their responses. This paper addresses both issues through applying indirect sampling and survey weight calibration. When survey practitioners face the challenge to directly sample from the target population for which either the sampling frame is difficult to compile, or response rates are very low, indirect sampling can help by leveraging the links between the frame population and the target population. We apply this estimator to generate indirect estimates of (target
population) Canadian-merchants' cash, debit, and credit acceptance from consumer survey data (frame population), which contains transaction-level details recorded in the diary section of the data. Not only do we leverage the consumer-merchant transaction data to construct the merchant weights through the generalized weight share method (GWSM), but we also construct the payment acceptance variables of these linked merchant from both consumers' revealed and stated payment choices. We do so without needing to interact with merchants, overcoming the challenge of traditionally low response rates.We also innovate on the existing GWSM methodology by using an external data set to estimate the total number of visits received by a merchant (a key component in GWSM), which is underestimated in the consumer data due to its structure. Lastly, our GWSM weights are also calibrated so that our final indirect merchant sample is representative of the merchant population. Our results show that the direct and indirect estimates of merchant payment acceptance are very similar.
-
- The Field Data Collector Labor Force: Considerations for the Future of In-Person Data Collection
Brad Edwards, Rick Dulaney, Jill Carle, Tammy Cook, Westat, USA-
Abstract
In-person data collection is critical for the success of many large government-sponsored surveys. Despite response rate declines and increasing costs, the mode remains the gold standard for meeting the most rigorous survey requirements for national survey programs, particularly as part of a multimode data collection strategy (Schober 2018). However, over the last ten years critical labor market and workforce changes, exacerbated by the pandemic, have made in-person data collection efforts prohibitive for many largest survey organizations. New ideas about job flexibility and job satisfaction coupled with the increasingly technical role and demanding nature of the job have impacted recruitment and retention for survey organizations across the U.S. and Europe (Charman et al 2024; Carle et al 2023). Indeed, some European countries have abandoned in-person data collection altogether for lack of work to sustain a field data collector (FDC) labor force.
This presentation will summarize trends in U.S. FDC employment over the past decade and outline key challenges in recruiting and retaining high quality FDCs. Through surveys of current and exiting FDCs and a unique administrative dataset of over 27,000 FDCs across more than 80 large survey projects, we will highlight key takeaways from an ongoing research program, including the impact of changing demographics on FDC retention, and how training mode efficacy differs by FDC experience. We will discuss considerations for the future of in-person data collection, including supplementing field data collection with multimode alternatives such as video interviewing, professionalizing the FDC role, and updating value propositions for respondents.
-
- Constructing a synthetic population to evaluate redesign options through simulation, for a rotating panel survey
Pauline Summers, Andrew Brennan, Statistics Canada, Canada-
Abstract
The methodology of Canada's Labour Force Survey (LFS) undergoes a thorough review every ten years. For the current review, we developed a simulation system that reproduces the complex LFS survey process, from sampling to estimation, for investigating alternative methodologies or identifying places for improvement. The complexity of the LFS – its rotating panel design, regression composite estimator, and other features – made the simulation system both challenging to construct and extremely useful for understanding how the various components interact. Since its development, it has proved to be an indispensable tool, and has helped us generate knowledge specific to the LFS that would otherwise have been very difficult or impossible to obtain.
This presentation describes the methodology we used to construct the synthetic population that is the basis for those simulations. We faced a series of unique modelling challenges, juggling requirements with respect to variable specification, cross-sectional consistency, and longitudinal consistency, to support simulations of the complex LFS survey process. Our solution is a "rotating panel population," modelled over a six-year period using a combination of cross-sectional and longitudinal modelling techniques.
The synthetic population consists of six sets of parallel "clones" of a base population, where each "clone" unit is longitudinally modelled for six months, and then reset; this produces the six-month series required for each simulated LFS respondent, while mitigating the drift that arises from projecting a population over an extended period. We developed an innovative methodological procedure to generate the data in multiple stages, using several different statistical tools and techniques, and implemented it in R.
-
- Improving the automated capture of Survey of Household Spending receipts using advanced machine learning techniques
Joanne Yoon, Oladayo Ogunnioki, Statistics Canada-
Abstract
The Survey of Household Spending (SHS) conducted by Statistics Canada collects paper diaries and shopping receipts as a source of household expenditure data. An auto-capturing algorithm was created for SHS 2023 to reduce statistical clerks' manual work of extracting important information from scanned receipts of common store brands. The algorithm used Tesseract optical character recognition (OCR) to extract text characters from images of receipts, and it identified store and product entities using regular expressions, also known as regex. The goal of this study was to enhance the current auto-capture algorithm by experimenting with more advanced OCR and machine learning methods. As a result, PaddleOCR, an open-source OCR toolkit, was selected as the new default OCR engine due to its overall performance in recognizing texts, especially digits, accurately across receipts of various qualities. Additionally, entity classifiers based on support vector machines were trained on historical SHS records and existing regex patterns. By using classifiers to categorize different elements present on receipts instead of relying solely on regex patterns, product and store recognition improved. It is expected that this new algorithm will be used for SHS 2025 to improve the auto-capture quality and reduce the manual burden associated with capturing receipt variables.
-
- Using LLMs for Automating the Analysis of Alternative Data for the Physical Flow Account of Plastic Materials
Alexandre Istrate, Oladayo Oggunnioki, Statistics Canada-
Abstract
The Physical Flow Account for Plastic Material (PFAPM) pilot program aims to track the flow of plastic materials throughout the Canadian economy. Within this program, analysts rely heavily on alternative data sources consisting in a diverse array of annual reports sourced from various companies and organizations. These are essential to conduct thorough research and validate activities pertinent to the account.
Due to their unstructured format and diversity, analyzing annual reports from various companies and organizations is a laborious and inefficient process, requiring significant time and effort from analysts. To address this, the project leverages advanced natural language processing (NLP) techniques, notably Large Language Models (LLMs), to automate two key objectives: sector classification and COVID-19 impact summarization.
The goal was to develop an algorithmic pipeline that can ingest PDF documents and based on their content, classify companies into distinct sectors (Residential, Commercial, Institutional, Industrial, and Construction) and summarize the impact of the COVID-19 pandemic on plastic-related activities such as collection and recycling rates, logistical disruptions, etc.
By automating these tasks, the project seeks to enhance the efficiency of data extraction, reduce manual workload, and improve the quality of insights derived from document analysis. The ultimate objective is to contribute to the advancement of environmental-economic research and inform sustainable plastic resource management strategies. By harnessing the power of LLMs, the project aims to unlock the potential of alternative data sources, providing more accurate and effective insights to shape environmental-economic policy and decision-making processes.
-
12:00 – 13:30
Lunch
13:30 – 15:00
Session 3A – Small Area Estimation: Extensions, Applications and New Developments sponsored by the International Association of Survey Statisticians (IASS)
Simon Goldberg Room
- Hierarchical Bayes small area estimation for county-level health prevalence to having a personal doctor
Andreea Erciulescu, Westat, USA-
Abstract
The complexity of survey data and the availability of data from auxiliary sources motivate researchers to explore estimation methods that extend beyond traditional survey-based estimation. The U.S. Centers for Disease Control and Prevention's Behavioral Risk Factor Surveillance System.
(BRFSS) collects a wide range of health information, including whether respondents have a personal doctor. While the BRFSS focuses on state-level estimation, there is demand for county-level estimation of health indicators using BRFSS data. A hierarchical Bayes small area estimation model is developed to combine county-level BRFSS survey data with county-level data from auxiliary sources, while accounting for various sources of error and nested geographical levels. To mitigate extreme proportions and unstable survey variances, a transformation is applied to the survey data. Model-based county-level predictions are constructed for prevalence of having a personal doctor for all the counties in the U.S., including those where BRFSS survey data were not available. An evaluation study using only the counties with large BRFSS sample sizes to fit the model versus using all the counties with BRFSS data to fit the model is also presented.
-
- New Twists on Old Tricks: Applications and Extensions of Traditional Small Area Models
Emily Berg, Iowa State University, USA-
Abstract
The mixed model is a widely accepted tool for small area estimation. The thesis of this talk is that this vetted approach remains well-suited to addressing the challenges that arise in modern small area estimation problems. As evidence of this, we review applications and extensions of area-level and unit-level small area models. We first discuss the use of multivariate Fay-Herriot models for small area estimation of crime victimization rates. This application uses data from the National Crime Victimization Survey and illustrates the value of area-level models in an application of current interest. We then review approaches to estimating nonlinear small area parameters, such as the median or the quartile, in the context of the unit-level model. In this setting, we discuss methods for informative sampling and extensions to exponential dispersion family models.
The review encompasses modest innovations to well established procedures. The methods illustrate that the basic construct of the mixed model remains a useful tool for addressing small area problems of recent interest.
-
- Reverse-Engineering a Hypothetical Raking Process for the Estimation of Mean Squared Error of Raked Small Area Estimates
François Verret and Braedan Walker, Statistics Canada, Canada-
Abstract
Small area estimation (SAE) of total employment and unemployment rate is performed monthly for the Labour Force Survey (LFS) using the Fay-Herriot model. These estimates are required for a mapping of the ten provinces defined by Census Metropolitan Areas, Census Agglomerations and a complementary geography called Self-contained Labour Areas.
To protect against model inadequacies and for consistency, total employment small area estimates are raked to their provincial direct/published estimate, which are of good quality by design. To get an uncertainty estimate for the raked small area estimates, a parametric bootstrap procedure making the usual assumption of independence between area-level direct estimates results in an MSE for the aggregate of the raked estimates at the provincial level considerably greater than the variance of the provincial direct estimate, especially for smaller provinces. This is because (province-level) LFS weight calibration introduces negative correlations between sub-provincial direct estimates. Ignoring this negative correlation thus results in MSE estimates for the raked estimates that should be artificially inflated.
Instead, a parametric bootstrap is used with a working covariance matrix of the direct estimates obtained by smoothing the variance components (as in the independence case) and by reverse-engineering a hypothetical raking process to derive the covariance terms. Application of the resulting theory gives a variance of the raked SAE estimates that is consistent with that of the provincial direct estimates. The MSE estimates of the raked SAE estimates obtained are in turn reduced and more reasonable.
-
13:30 – 15:00
Session 3B – Evaluating and Improving Surveys
Jean Talon Conference Room
- Correcting Selection Bias in a Non-probability Two-phase Payment Survey
John Tsang, University of Ottawa, Canada
Heng Chen, Bank of Canada, Canada -
-
Abstract
This presentation extends the Pseudo Maximum Likelihood (PML) estimator to the non-probability two-phase sampling by leveraging the probability sample at the individual level. Using the Bank of Canada 2020 Cash Alternative Survey Wave 2, we compare the performance of our proposed method to alternative methods, which either do not account for a two-phase sampling design or do not explicitly model the selection probability. The results show that the PML-based approach performs better than raking in reducing the selection bias for both phases' payment-related variables, especially for the low-response youth group. Furthermore, the two-phase PML weighting scheme accounting for both phases' selection mechanisms has a smaller bias than Phase-2-Only alternatives of ignoring Phase 1's design and simply treating Phase 2 as a standalone single-phase design. The presentation will end with a discussion about variance estimation for both phases.
-
- Recruitment and Collection of Web Panels at Statistics Canada
Krista MacIsaac, Cilanne Boulet, Marnie Thomas, Statistics Canada-
Abstract
In 2020, Statistics Canada started to use probabilistic web panels as an alternate method of collecting official statistics. In a web panel, respondents to other surveys are asked for contact information to participate in future short surveys. This presentation will highlight Statistics Canada's experience with panels after 4 years, including what has been learned about the recruitment of panel participants and how to subsequently collect data using panel surveys. For instance, recruitment questions have been displayed in various ways, resulting in very different rates of participation. Moreover, the wealth of auxiliary information available on the recruitment survey can not only be incorporated into the creation of weights, it can also be used to actively manage collection operations, by predicting the probability of response to target follow-up efforts.
-
- A Bias Evaluation for Probabilistic Web Panels at Statistics Canada
Anne Mather, Cilanne Boulet, Statistics Canada, Canada-
Abstract
Statistics Canada began to implement probabilistic web panels in 2020.
Participants for these panels are recruited via questions on other Statistics Canada surveys that ask those willing to participate to provide their contact information. Web panels are a timely, cost-effective collection method to meet emerging data needs. However, they have lower response rates than are typically observed with traditional survey methods. Despite the lower response rates, the recruitment surveys provide a lot of auxiliary information for both respondents and non-respondents to the panel, which can be used in non-response adjustments.
A study was carried out to explore the potential bias associated with these lower response rates and the degree to which it can be corrected during weighting. We will present the highlights of that study.
-
- Life in the FastText Lane: Harnessing Linear Programming Constrained Machine Learning for Classifications Revision
Justin Evans, Laura Wile, Statistics Canada, Canada-
Abstract
Statistics Canada's Labour Force Survey (LFS) plays an essential role in the estimation of labour market conditions in Canada. Periodically, LFS revises its data to the most recent industry and occupational classification versions. Differences in versions can be extensive, including high-level and unit-group structural changes, creations, deletions, split-offs and combination of classification units (classes). Historically, to reconcile split-off classes - where one class splits into multiple classes - a sample of LFS split-off records would be manually recoded to the new classification version. Based on the split-off proportion observed in the recoded sample, a random allocation method would be applied on all data to reflect the changing Canadian labour market over time. This article proposes using machine learning (fastText), constrained to split-off proportions using linear programming, to revise industry and occupation classifications in LFS.
The hybrid framework benefits from a text-based revision mechanism while adhering to traditional proportions driven estimates, thus ensuring a minimal impact on the comparability of published labour market indicators.
-
- Data-Driven Imputation Strategies and their Associated Quality Indicators in Economic Surveys
Matei Mireuta, Ahalya Sivathayalan, Stephen Styles, Statistics Canada, Canada-
Abstract
Majority of economic surveys at Statistics Canada are processed in the Integrated Business Statistics Program (IBSP). The IBSP framework relies on the generalized system BANFF for editing and imputation (E&I) and on the generalized system G-EST for estimation and variance estimation. Currently, variance estimation is carried out analytically using the System for the Estimation of Variance due to Nonresponse and Imputation (SEVANI) which is part of G-EST.
Classical E&I strategies for economic surveys are typically based on composite linear imputation given the difference in availability of auxiliary data among units. However, this can lead to tens (and often more) of possible imputation models for a given variable which makes implementation, support and analysis of a survey's overall imputation strategy very difficult. As part of recent modernization initiatives within the agency, our team has investigated several machine learning alternatives and examined their application and quality of imputation in the context of economic surveys.
This presentation will cover some of these results as well as the challenges faced in the estimation of the variance due to imputation of these methods.
The authors will describe two approximate fit-for-use methods for the estimation of total variance and their advantages/disadvantages from the point of view of a harmonized processing framework such as IBSP.
-
15:00 – 15:30
Afternoon Break
15:30 – 17:00
Session 4 – Collection Initiatives in Challenging Situations
Simon Goldberg Room
- Collection of Social Data – designing and changing
Fiona O'Riordan, Central Statistics Office, Ireland -
-
Abstract
Social data is key to a properly functioning society. Our age, our sex, our educational attainment, do we work? do we have a warm coat? have we experienced discrimination? is data that informs government, researchers, and the citizen about our lives. Asked continually, we can see how we are changing and how our requirements are changing.
The task of collecting social data has changed considerably over the last5-10 years. Field interviewers still have a role but its time now to supplement this mode with other ways of collecting this data. Innovation is key - understanding the respondent and designing campaigns that are efficient and, in some way, appealing is now necessary for success. The Office uses the concept of adaptive design to ensure continued success in the collection of household data, this is where the design of the survey is adapted to suit the requirements of the respondent.
The Central Statistics Office (Ireland) is moving to a multi-mode environment, the Office has reviewed and continues to review its sample design for all surveys, it is trying to augment and keep the sample frame as current as possible by using admin data; behavioural science, analysis of previous response behaviours and the use of panel surveys are some of the key projects that are in progress at present.
The Office is adapting to societal changes and societal requirements by using different tools and methods to continue to collect quality social data.
-
- The challenges of collecting data in remote locations: The example of data collection for the Québec Health Survey of High School Students in Nunavik – UVIKKAVUT QANUIPPAT?
Catherine Côté, Marcel Godbout, Institut de la statistique du Québec, Canada-
Abstract
Since 2010, the Institut de la statistique du Québec (ISQ) has been conducting the Québec Health Survey of High School Students (QHSHSS) every six years for the Ministère de la Santé et des Services sociaux (MSSS). There were editions in 2010/2011, in 2016/2017 and in 2022/2023. This survey collects data on the physical and mental health, lifestyle habits, and social adjustment of high school students in Québec's regions. These three editions of the survey did not cover Nunavik, a region in Québec's Far North that comprises 14 communities, because the health needs and realities are different from those in the southern regions of the province. In 2016, the Nunavik Regional Board of Health and Social Services (NRBHSS), the School Board of Nunavik (Kativik Ilisarniliriniq) and the MSSS indicated to the ISQ that they would like a survey similar to the QHSHSS to be conducted just for Nunavik. The ISQ therefore began planning this survey that same year. However, there were many challenges along the way and it was only conducted in 2022. Collection was done in all communities, and the high school students in each one were enumerated because of their low numbers. The QHSHSS questionnaire was used for this survey, but was largely adapted to reflect the particularities of the region. In this presentation of the survey, called Québec Health Survey of High School Students in Nunavik – UVIKKAVUT QANUIPPAT?, we will examine the various challenges encountered (geographic, logistic, context, etc.) during data collection and how they were overcome.
-
- Measuring urban Indigenous health using respondent driven sampling
Lisa Avery and Sara Wolfe, University Health Network (Toronto) and Aboriginal Health and Wellness Centre, Canada-
Abstract
Our Health Counts (OHC) uses respondent-driven sampling (RDS) that combines a peer-to-peer recruitment strategy with statistical methods to sample from populations that are difficult to reach because they lack a sampling frame. Embedding the sampling process within a community enables the collection of comprehensive socio-demographic data that might otherwise be underreported in a standard census by leveraging the strong social connections within a group. The survey process gathers information about social connectivity that, alone, or in combination with information about the ties in the recruitment chain, is used to adjust and obtain unbiased estimates of the population parameters by accounting for unequal sampling probability and homophily in the social network. An initial set of 10-12 seeds from the target population is selected to complete the OHC survey, who then refer family and friends. The first wave of recruits is identified through these referrals. Referrals are provided a unique coupon serial number, honourarium, and survey participation is facilitated. The second wave of participants are recruited based on the first wave's referrals, and so on, until the target sample size has been reached. Child recruitment occurs through custodial caregivers who complete the survey. Every OHC project is Indigenous-led, and the data remains community-owned and governed. OHC successfully produces meaningful, culturally relevant health data for Indigenous adults and children and has completed its data collection phase in Winnipeg, Manitoba, led by Aboriginal Health and Wellness Centre of Winnipeg. Key to the OHC approach are the principles of reciprocity, relationality, and local community self-determination.
-
Thursday October 31, 2024
9:00 – 10:00
Session 5 – Waksberg Award Winner Address
Simon Goldberg Room
- Waksberg Lecture 2024: Sample Design Using Models
Richard Valliant, Research Professor Emeritus, University of Michigan and Joint Program in Survey Methodology, University of Maryland, USA-
Abstract
Joseph Waksberg was an important figure in survey statistics mainly through his applied work in the design of samples. He took a design-based approach to sample design by emphasizing uses of randomization with the goal of creating estimators with good design-based properties. Since his time on the scene, advances have been made in the use of models to construct designs and in software to implement elaborate designs. This paper reviews uses of models in balanced sampling, cutoff samples, stratification using models, multistage sampling, and mathematical programming for determining sample sizes and allocations.
-
10:00 – 10:30
Morning Break
10:30 – 12:00
Session 6A – Recent Advances in Time Series Modelling
Simon Goldberg Room
- Multilevel times series model for mobility trends in the Netherlands
Harm Jan Boonstra, Maastricht University, Netherlands
Jan van den Brakel, Statistics Netherlands, Maastricht University, Netherlands-
Abstract
The purpose of the Dutch Travel Survey (DTS) is to produce reliable estimates on mobility of the Dutch population. In this paper a multilevel time-series model is proposed to estimate mobility trends at several aggregation levels. The method is developed to solve different problems. The sample size for many publication domains is so small that direct estimates for target parameters are very noisy and unreliable. The time series model is developed as a form of small area estimation to obtain more precise domain estimates and smoother trend series.
Another problem are systematic shocks in the sample estimates, which are the result of three major survey process redesigns. These shocks or discontinuities disturb comparability with figures published in the past. The time series model accounts for the discontinuities through an intervention component such that uninterrupted series of trend estimates are obtained. Finally, the coronavirus had a major impact on mobility, which requires additional adjustments of the model.
The DTS is a multipurpose survey that produces many different output tables. Instead of developing separate models for each individual output table, one multivariate time series model is developed for a breakdown of the population parameter at the most detailed level in about 700 domains. This breakdown is based on the cross-classification of all output tables of interest. Predictions at higher aggregation levels are obtained by aggregation of the predictions of these 700 domains. This result in a numerically consistent set of estimates for all target variables, which are corrected for the different discontinuities.
Modelling time series at the most detailed level requires random effect components to avoid overfitting of the time series, in particular the discontinuities. Additionally, non-normally distributed random effects such as Laplace and Horseshoe distributions are used as a regularization method to suppress noisy model coefficients and at the same time to allow large effects that are sufficiently supported by the data. The model is developed in hierarchical Bayesian framework and fitted with MCMC simulation. The method is implemented in production for publication of official statistics on mobility.
-
- The impact of environmental disasters on Canadian personal debt
Cristina Agatep, Bank of Canada, Canada-
Abstract
Canada has been experiencing an increase in the occurrence of natural disasters in recent years, resulting in a growing interest in the effect that weather events have on the personal finances of its citizens. Tracking individual consumer debt incurs a high privacy cost. Instead, the distribution of consumer debt is tracked over time to balance the need for privacy with strong insights into the behaviour of the population.
The focus of this talk is on the impact of wildfires on debt distribution through a causal inference lens, where the evolution of densities across time is treated as functional data objects. A synthetic control model is used to account for the counterfactual evolution of debt distribution arising from the 2016 Fort McMurray wildfires.
-
- Privacy Mechanisms That Balance Utility for Time Series Data
Anindya Roy, University of Maryland Baltimore County and United States Census Bureau, USA-
Abstract
Ensuring privacy in released data is of paramount importance for data-producing agencies. Privacy and confidentiality issues in data collection and data release mechanisms have seen revolutionary changes in recent years. The procedures guaranteeing the desired privacy of released data mostly use noise addition to achieve privacy goals. While appealing and appropriate for general databases, noise addition for time series data typically changes the sample autocorrelation structure, thereby compromising data utility for time series data. We propose a privacy mechanism that satisfies the dual objectives of privacy and data utility for time series. The proposed mechanism uses noise convolution instead of noise addition to achieve a privacy-utility trade-off. In the context of our proposal, we investigate filter design that allows us to define quantities that remain invariant under the mechanism, thereby excluding such data functionals from the privacy budget. We also propose generalizations that make the filtering-based privacy mechanism model-agnostic as well as those that are applicable to multiple time series.
-
10:30 – 12:00
Session 6B – Data Ethics and Confidentiality
Jean Talon Conference Room
- Advancing Equitable Data Collection: Insights from Statistics Canada's Statistical Integration Methods Division Disaggregated Data Action Plan (DDAP) Research Project
Andrew Pearce, Kenza Sallier, Christiane Laperrière, Statistics Canada, Canada-
Abstract
In an era marked by the advocacy for Indigenous rights, racial justice, and economic equity, Statistics Canada has embarked on a transformative journey through the Disaggregated Data Action Plan (DDAP). This initiative aims to modernize data collection methods to better understand and address the challenges faced by diverse population groups, including women, Indigenous peoples, racialized communities, and individuals with disabilities. In this context, the Statistical Integration Methods Division of Statistics Canada has undertaken a comprehensive investigation of the ethical and practical implications inherent in evolving survey designs and data sources.
The research culminated in the creation of "Guiding Principles: Leveraging the 2021 Census of Populations Data for DDAP Groups of Interest". This document summarizes valuable insights drawn from our investigation and literature review. It explains the organizational framework of DDAP within Statistics Canada, navigates existing data sources, addresses ethical considerations, and rigorously examines sampling methods tailored for DDAP initiatives.
Drawing upon theoretical frameworks and applications based on Statistics Canada's experience, the findings highlight the importance of accounting for population characteristics such as hiddenness and social connectedness in selecting appropriate sampling methods. Through concrete examples and a detailed analysis of pros and cons, the guiding principles equip decision-makers with a comprehensive toolkit for navigating the intricacies of data collection in DDAP contexts.
This presentation will discuss the methodological gaps that untraditional sampling methods can fill in the context of DDAP, discuss the importance of clear standards when creating and using such methods as well as the practical considerations to implement them in the context of a National Statistical Organization.
-
- On the interplay of legal requirements, quality aspects and ethical risks when using ML in official statistics
Florian Dumpert, Federal Statistical Office of Germany, Germany-
Abstract
Ethics makes statements about how we should act. Ethical questions and ethical risks can also arise when machine learning is used in official statistics. More generally speaking: When methods and technologies are used to produce official statistics. The presentation deals with the interplay and the dependencies of such ethical risks, legal requirements and quality aspects and discusses the approach of German official statistics to address the topic "Ethics of Machine Learning".
-
- Statistical Disclosure Control Analysis for Small Area Estimation
Cissy Tang, Statistics Canada, Canada-
Abstract
Currently, Statistics Canada has no official guidance on confidentiality rules for releasing small area estimates and no official study has yet been conducted on the subject. In recent years, there has been increasing demand from Research Data Centres (RDC) researchers for comprehensive confidentiality guidelines such that they can publish small area estimates in their research. This confidentiality analysis applies to area-level small area estimation.
A simulation study is conducted in R to create simulated populations from which samples are selected. The simulated population contains an auxiliary variable, a variable of interest, and domain information. The strength of the relationship between the auxiliary variable and the variable of interest is controlled through an "error" variable with a random component.
Stratified random sample are drawn, and area-level small area estimates are calculated using the "sae" R package (Molina and Marhuenda 2015). The simulation is run for various sampling rates and various auxiliary variable strength levels to identify potential areas of disclosure risk. The risk of disclosure of the small area estimate is compared against the direct Horvitz-Thompson estimate to demonstrate that small area estimates are inherently less risky than direct estimates, especially when sampling rates are extremely low. The results are then analyzed and finally, comprehensive confidentiality guidelines for the release of area-level small area estimates are proposed. The presentation will outline the simulation process and discuss justifications for the proposed confidentiality guidelines.
-
- Synthetic Data Disclosure Risk Assessment
Zhe Si Yu, Statistics Canada, Canada-
Abstract
The adoption of synthetic data generation as a confidentiality measure is increasing in statistical agencies worldwide, including at Statistics Canada. This approach provides an alternative to the traditional dissemination of anonymized public microdata files, offering both privacy protection and data utility. However, the creation of synthetic data presents challenges in assessing and mitigating disclosure risks. This paper reviews the different types of disclosure risks, that being attribute, membership and identity disclosure, and presents some of the associated methods for measuring risk. Identity disclosure is recognized as a non-issue for fully synthetized data, but remains an issue for partially synthesized data. The paper presents prominent risk assessment metrics and discusses practical methods for disclosure control in data synthesis. Methods for assessing disclosure risks usually produce a metric that can be used to gauge the risk, but there is little consensus on threshold values for these metrics. It is also important to focus on importance of balancing utility and confidentiality, which needs further discussion in context of these methods.
The paper concludes by offering insights and recommendations about managing disclosure risk while creating synthetic data as well as providing some ideas on future directions for research and practical implications for managing disclosure risks in synthetic data.
-
- Exploration of Deep Learning Synthetic Data Generation for Sensitive Utility Data Sharing
Julian Templeton, Benjamin Santos, Rafik Chemli, Statistics Canada, Canada-
Abstract
Utilities hold crucial information about energy usage and building characteristics which can be utilized by government agencies to improve their corresponding analytics. However, this data is associated with private customer records and thus the building data and energy usage may be too sensitive to share. Often, high-level aggregated versions of this data are shared through robust contracts, limiting the statistics that can be derived.
With the advancement of generative machine learning techniques, Statistics Canada and Natural Resources Canada have explored the feasibility of using these models to produce synthetic versions of utility data which may be shared in full to requesting organizations. These synthetic datasets can be created by a utility through a locally run program and the outputs can be approved before being sent. This work has identified that certain generative models can feasibly be used by utilities to generate new versions of a dataset and has identified the issues which must be addressed prior to implementing this in practice. Both tabular and time-series models have been tested for different data sharing scenarios, where the TimeGAN model successfully captured the general energy peaks and valleys over a given day with reasonable computational requirements. Although this process takes days for annual energy amounts over thousands of customer records, this can enable new data sharing initiatives between utilities and National Statistical Offices while managing privacy risks. As work progresses in future phases with real utility partners, trust can be built for these approaches, and they can begin being tested on real data by actual data holders.
-
12:00 – 13:30
Lunch
13:30 – 15:00
Session 7A – Strategies to mitigate potential nonresponse bias in social surveys
Simon Goldberg Room
- Strategies to battle bias when preparing for digital first collection and a smaller geographical footprint for people surveys in Australia
Anders Holmberg, Australian Bureau of Statistics, Australia-
Abstract
Although non-probability data sources are not new to official statistics, a revived interest in the topic has emerged from pressures due to falling survey response rates, increasing data collection costs and a desire to take advantage of new data source opportunities from the ongoing societal digitalisation. Due to the exclusion of certain segments of the target population, inference derived solely from a non-probability data source is likely to result in bias. This work approaches the challenge of addressing the bias by integrating non-probability data with reference probability samples. We focus on methods to model the propensity of inclusion in the non-probability dataset with the help of the accompanying reference sample, with the modelled propensities then applied in an inverse probability weighting approach to produce population estimates. The reference sample is sometimes assumed as given. In this presentation however, we pursue an objective of finding an optimal strategy, that is, the combination of a data integration-based estimator and sample design for the reference probability sample. We discuss recent work in which we take advantage of the good unit identification possibilities in business surveys to study an estimator based on propensities and derive optimal (unequal) selection probabilities for the reference sample.
-
- Dealing with non-response and attrition bias in a Transformed LFS
Petya Kozhuharova, Office for National Statistics, United Kingdom-
Abstract
The Office for National Statistics UK is undergoing a transformation of its Labour Force Survey. The key changes include switching to an online first mode, implementing an adaptive survey design and changing the structure of the longitudinal component. Extensive testing of the attrition and non-response biases present in the TLFS quarterly files (at GB level) has been carried out. Prominent non-response biases were detected by tenure, age and geographically less affluent areas. A pre-calibration step was included in the weighting to offset this non-response and adjust distributions before final weighting. The ONS also implemented an Adaptive Survey Design and targeted field forces to areas of known non-response. This step has successfully increased response rates as intended, complimenting methods adjustments.
Investigations showed different longitudinal attrition patterns in the new survey. Previous employment status in TLFS is a significant predictor of longitudinal attrition even after controlling for other predictors, with employed and self-employed people dropping out significantly more from subsequent waves. Thus, following preliminary calibration as a non-response adjustment for Wave 1, for waves 2-5 we adjust individual design weights by calculating sequential attrition probabilities calculated via logistic regression models. These wave-specific weights are then combined before final calibration. Final calibration is done by combinations of age groups, sex, local authority, regions and country. Without the addition of attrition with included previous economic status, the quarterly estimate was biased towards economically inactive people who are more likely to answer in later waves.
Therefore, the inclusion of attrition adjustments increases employment compared to weighting just including non-response adjustment.
-
- Investigations into administrative data for measuring persistent child poverty
Adam O'Neill, Keith McLeod, Robert Templeton, Statistics New Zealand, New Zealand-
Abstract
Using information from the Household Economic Survey (HES) Stats NZ provides yearly estimates of child poverty rates in Aotearoa (New Zealand). To extend this, a new longitudinal survey was developed to inform on rates of persistent child poverty - the Living in Aotearoa Survey.
Collection challenges associated with increasing non-response, along with keeping to budgeted costs, posed significant risk to the quality of downstream estimates of persistent child poverty. This highlighted a need to explore alternative avenues for estimating poverty persistence.
Advances in the construction of households using administrative data has been a large focus at Stats NZ. Current admin-based households place 90% of individuals into the right address. At the population-level this still leaves, however, notable room for intra-household discrepancies.
Leveraging off the high coverage of low-income families observed through administrative welfare data, we show that the formation of these admin-based households can be improved for the purposes of measuring persistent child poverty in Aotearoa. Specifically, family welfare data is coupled with HES households, and high confidence admin-based households. Household income derived from administrative sources allows the poverty status of children to not only be determined for the current year, but previous three years as well.
Validations of these households along with the resulting poverty rates will also be presented.
-
13:30 – 15:00
Session 7B – Record Linkage
Jean Talon Conference Room
- Efficient Record Linkage for large datasets by Business Names
Hanan Ather, Statistics Canada, Canada-
Abstract
Record linkage across diverse datasets presents significant challenges in big data applications, particularly when business names serve as unique identifiers. Traditional linkage methods often face challenges with the variability in formatting, abbreviations, and errors within administrative and alternative data sources. Addressing this challenge, the introduced approach is a robust and user-friendly method that effectively links business records from external datasets to SBR, thereby overcoming the computational constraints of previous practices.
The system utilizes a suite of string-matching algorithms, including edit distance, n-gram, and Jaro-Winkler, to facilitate record linkage and statistical matching. It distinguishes between true matches and false matches by calculating similarity measures, s(x, y), for entities within the SBR and any external list of businesses. By establishing a scalar threshold, t, we refine the linkage criteria, enhancing the precision of match declarations.
Our method enhances data integration efforts by enabling a precise, yet computationally efficient identification of matches. Moreover, our approach is evaluated on accuracy metrics, balancing between sensitivity to detect true cases and specificity to minimize false positives.
The strategic advantage of our approach lies in its efficiency, providing rapid processing times without sacrificing the accuracy of the matches. This efficiency is especially advantageous in big data environments, where quick data processing and efficient use of computational resources are critical.
Our methodology is designed to contribute to improved standards in record linkage and statistical matching, aiming to enhance both the speed and precision of matching across databases.
Evaluating the linkage accuracy when linking records in waves
Abel Dasylva, Arthur Goussanou, Statistics Canada, CanadaAt Statistics Canada, many data sets are linked with quasi-identifiers such as the first name, last name, or address. In such cases, linkage errors are a potential concern and must be measured. In that regard, previous studies have shown that the evaluation may be based on modeling the number of links from a given record while accounting for all the interactions among the linkage variables and dispensing with clerical reviews, so long as the decision to link two records does not involve other records. In this communication, the methodology is adapted for a class of practical strategies, which violate this constraint by linking the records in consecutive waves, where a given wave links a subset of the records that are not linked in previous waves. In particular, the linkage may be based on a deterministic wave followed by a probabilistic one.
-
- Evaluating the accuracy when linking records in waves
Abel Dasylva, Arthur Goussanou, Statistics Canada, Canada-
Abstract
At Statistics Canada, many data sets are linked with quasi-identifiers such as the first name, last name, or address. In such cases, linkage errors are a potential concern and must be measured. In that regard, previous studies have shown that the evaluation may be based on modeling the number of links from a given record while accounting for all the interactions among the linkage variables and dispensing with clerical reviews, so long as the decision to link two records does not involve other records. In this communication, the methodology is adapted for a class of practical strategies, which violate this constraint by linking the records in consecutive waves, where a given wave links a subset of the records that are not linked in previous waves. In particular, the linkage may be based on a deterministic wave followed by a probabilistic one.
-
- Model Based Threshold Selection for Agricultural Probabilistic Linkages
Christian Arsenault, Statistics Canada, Canada-
Abstract
With the increasing importance of administrative data usage in producing official statistics, conducting quality probabilistic record linkages have become paramount to the success of many programs at Statistics Canada. Under the Fellegi and Sunter methodology, the weight threshold for determining a match is a critical parameter on which the optimality of the procedure depends. Solutions for setting this parameter up to this point have had major limitations, either relying on overly optimistic assumptions, or requiring training data or clerical review. However, a new model was developed which estimates the linkage error based on the number of links from a given record, accounting for all the interactions among the linkage variables. This error model serves as the foundation for evaluating different algorithms for setting the linkage threshold. Between exhaustive search, binary search, or a more sophisticated recursive partitioning procedure, each method offers distinct advantages in terms of runtime and the quality metrics produced. Using real farm data, we were able to validate the results of the error model with goodness-of-fit tests, as well as weigh the practical considerations of each of these methods by experimenting with different datasets. Although automating the selection of linkage thresholds presents challenges with the varying quality, structure, and size of agricultural data, this work provides a practical path forward to navigate these issues.
-
- The T1 Partnership Process: leveraging clustering and graphing methods
Shaundon Holmstrom, Statistics Canada, Canada-
Abstract
All individuals who own a sole proprietorship or partnership business must complete a statement of business activities when filing their personal income taxes (T1 General Form). A T1 partnership is a collection of filers who form a single business but file their share individually. With the T1 business population it can be challenging to determine if a filer is a sole proprietorship or a partnership, and in most cases, there are no linkage keys available to identify the collection of filers belonging to a given partnership. Therefore, a T1 partnership identification process has been developed with the goal of the performing internal record linkage within T1 business population to identify partnerships. Failing to identify valid partnerships can lead to duplication in the population of T1 businesses (overcoverage) while incorrectly linking individuals as partners can lead to businesses being incorrectly removed from the population of T1 businesses (undercoverage). The T1 processing system is currently undergoing a redesign.
The new T1 partnership identification process leverages numerical clustering using the DBSCAN algorithm, then compares matching fields to trim down potential clusters in the set of pairwise comparisons. When the final set of pairwise comparisons is complete, graph theory is used to create the set of all finalized partnerships.
-
- Effects of a two-category gender variable (men+ and women+) in the development of linkage-adjusted weights for the 2021 Canadian Census Health and Environment Cohort (CanCHEC)
Eric Hortop, Yubin Sung, Statistics Canada, Canada-
Abstract
The 2021 CanCHEC consists of the 2021 Census long-form sample records linked to the Derived Record Depository (DRD) with a linkage rate of 96%. The DRD includes persons from Statistics Canada's survey and administrative data, such as the Census and deaths. Thus, the linkage facilitates research examining sociodemographic factors influencing the health of Canadians. The 2021 CanCHEC is highly anticipated because the 2021 Census collected data specifically about gender for the first time, differentiating from sex at birth, enabling an examination of health outcomes among gender diverse persons. Linkage-adjusted record weights allow researchers to estimate and evaluate the variance of summary measures of the Canadian population in the presence of missed links. To create the final weights, we construct response homogeneous groups using the linkage propensity score from a logistic model fitted for the long-form sample records. We then use a cell calibration procedure that further adjusts the weights such that the weighted totals for a multivariate tabulation of the cohort match the weighted totals from the long-form sample. The tabulation variables include the two-category gender variable among other characteristics. We examine slippage on estimates of cisgender, transgender and non-binary populations to evaluate our practical weighting procedure.
Decisions in the weighting procedure can affect the accuracy of estimates for small population groups such as gender-diverse persons. This presentation will highlight the importance of routinely measuring slippage. A limitation is that our linkage-adjusted weights only consider the linkage of the cohort to the DRD while many population health analyses would need to account for additional linkages to health administrative data.
-
15:00 – 15:30
Afternoon Break
15:30 – 17:00
Session 8 – The Future of National Statistical Organizations
Simon Goldberg Room
- The Future of National Statistical Organisations - the Longer-Term Role and Shape of NSOs
Osama Rahman, Office for National Statistics, United Kingdom-
Abstract
The rise of a data-driven world is radically reshaping the context in which National Statistical Offices (NSOs) operate. There is rising demand from both policymakers and wider society to provide information that is more accurate, more timely and more granular compared to previous decades. At the same time, NSOs face increased competition from the emergence of many new sources of analysis and statistics, which is challenging their previously pre-eminent role as one of the most relevant and trusted sources of information. The increasing granularity and velocity of new forms of data and data science techniques are also leading to a reduction in the traditional gaps between policy development, implementation and delivery, and monitoring and feedback. This then leads to question to what extent the gap between statistics/research and operational analysis/use of management information needs to reduce and what role NSOs have here.
In response, NSOs are investing considerable effort in initiatives to remain relevant to their users. They are innovating by developing new methods and using new sources of data to produce more robust statistics and analysis that better meet the demands of government and society. Some NSOs are also looking to the longer-term, developing their vision for what their work will need to look for their organisations to remain relevant in ten to twenty years' time.
This session will discuss how NSOs are already adapting and ways they may need to continue to adapt to succeed in an increasingly data-driven world.
Other members of the panel:
André Loranger, Chief Statistician of Canada, Statistics Canada
Francesca Kay, Central Statistics Office, Ireland
Anders Holmberg, Australian Bureau of Statistics, Australia
-
Friday November 1, 2024
8:30 – 10:00
Session 9A – Young Statisticians from Statistics Canada's Modern Statistical Methods and Data Science Branch
Simon Goldberg Room
-
-
Abstract
The 2024 Statistics Canada International Methodology Symposium on "Shaping the Future of Official Statistics" will feature a special invited session designed to showcase the innovative thinking and strategic vision of young statisticians. Indeed, part of envisioning the future of official statistics demands an investment in tomorrow's statistical leaders. By cultivating a diverse and healthy pool of talent equipped with both technical expertise and essential soft skills, we can ensure a strong succession for the future of official statistics. Selected through a competitive process that took place within Statistics Canada Modern Statistical Methods and Data Science Branch, winners will present their strategic insights on pressing issues facing National Statistical Organizations (NSOs). This session presents a unique opportunity to gain fresh viewpoints with the hope of stimulating strategic thinking, encouraging dialogue, and contributing to the advancement of NSOs. Ultimately, shaping the future of statistical practices, the advancement of statistical methodology, and promoting data driven decision making at both national and global levels.
Presentation 1: How do you foresee the evolving role and significance of National Statistical Organizations (NSOs) such as Statistics Canada over the next 25 years?". [Assuming the speed of data and IT evolution remains the same as in the last years]
Namita Chhabra, Johan Fernandes, Craig Hilborn and Joshua Miller, Statistics Canada, CanadaPresentation 2: What would be the key characteristics, skills, abilities and capacities of the ideal (perfect) future statistician” (or employee of Statistics Canada) in 25 years.
Neal Jin, Andrew Jay, Andrew Pearce and Abhishek Singh, Statistics Canada, Canada Presentation 3:How can Statistics Canada, effectively and explicitly leverage ethical principles and democratic values to all Canadians particularly in the context of social and economic polarization?
Marc Beauparlant, Bassirou Diagne and Beni Ngabo Nsengiyaremye, Statistics Canada, CanadaPresentation 4: What are the issues facing Statistics Canada in reconciling the imperative of collecting disaggregated data for minority subgroups with the need to uphold privacy, data ethics, and historical sensitivities?
David Ahn, Alexandre Istrate, Samuel Sombo and Nicholas Wilker, Statistics Canada, Canada
-
9:00 – 10:00
Session 9B – Nowcasting for Economics Statistics
Jean Talon Conference Room
- High Frequency Data Collection at the U.S. Census Bureau - the Business Trends and Outlook Survey (BTOS)
Cory Breaux, Kathryn Bonney, U.S. Census Bureau, USA-
Abstract
The Business Trends and Outlook Survey is an experimental data product from the U.S. Census Bureau, intended to capture high-frequency changes in economic conditions, first launched in July 2022. The BTOS draws from a large sample of nearly 1.2 million businesses to provide representative, bi-weekly data on economic conditions and trends. The survey collects information on a wide range of business conditions including current performance, changes in revenue, employment, demand, and prices, operating status, the impact of natural disasters, and Artificial Intelligence usage. Firms are asked about the previous two weeks and for a six-month projection.
From December 2023 to February 2024, the BTOS added supplemental content which provides a detailed real- time look at U.S. businesses' use of AI.
During this period, bi-weekly estimates of AI use rose from 3.7% to 5.4%, with an expected use rate of about 6.6% by early Fall 2024. AI usage was found to vary across sector and geography, with the highest use rates occurring in the Information sector and states in the western U.S. Amongst AI-using businesses, the most common applications include marketing automation, virtual agents, natural language processing, and data/text analytics. Many businesses report using AI to replace worker tasks and existing equipment/software, though there is little evidence that AI use is associated with decline in firm employment. Businesses were also asked why they do not expect to use AI; the predominant reason reported was the inapplicability of the technology to their business. Future research can integrate BTOS data with other Census survey and administrative data, exploring the connection between AI use and firm performance.
-
- Improving Nowcasts for the U.S. Census Bureau Index of Economic Activity (IDEA)
Elizabeth Marra, Rebecca L Weaver, William R Bell, Tucker S McElroy, Valerie E Pianin, Jose Asturias, Rebecca J Hutchinson, U.S. Census Bureau, USA-
Abstract
The U.S. Census Bureau Index of Economic Activity (IDEA) is an experimental data product that was first released in February of 2023. It is constructed from 15 of the Census Bureau's primary monthly economic time series, providing a single time series reflecting the variation of the full set of component series over time. The component series are monthly measures of activity in retail and wholesale trade, manufacturing, construction, international trade, and business formations. One of the challenges of producing a monthly economic indicator from 15 different component series is that not all the series are released on the same day in a month. To account for these varying release dates, the index is calculated daily, incorporating the most recently released values from the component series for the current release month. For component series whose values have not yet been released for the current month, we predict (nowcast) their values using a multivariate autoregressive time series model. We estimate index weights for these 15 series using Principal Components Analysis applied to the standardized monthly growth rates. Series with larger weights will have more influence over the index, compared to those with smaller weights. If nowcasts for a highly weighted series are inaccurate, then when a new monthly estimate for the series is released, the index could exhibit a large revision for that month. This presentation reviews the index, discusses the nowcasting procedure, and presents a potential improvement to the nowcasting.
-
- Leveraging Transformers for Now-Casting Canadian Labor Indicators
Luke Budny, Aziz Al-Najjar, Tariq El Bahrawy, Carleton University, Canada-
Abstract
In the rapidly advancing data-centric world, the acuity of labor market analytics is pivotal for economic strategy and policy-making. This study explores the application of the Lag-Llama model, a state-of-the-art transformer-based architecture, for nowcasting key Canadian labor indicators such as employment, job vacancies, payroll, and hours worked. Addressing the challenge of delayed or incomplete labor data from Statistics Canada surveys, the study proposes a forecasting methodology that leverages both historical trends and contemporary data inputs. The model is pre-trained on diverse datasets across multiple domains, enhancing its predictive robustness and interpretability. Through systematic fine-tuning of hyperparameters and judicious selection of external variables via forward and backward selection techniques, the Lag-Llama model achieves significant performance gains over classical time series forecasting methods. Experimental results exhibit substantial improvements in Mean Absolute Percentage Error (MAPE) for both point estimates and prediction intervals, especially in the All Industries MAPE for earnings data. The study also identifies certain industry-specific forecasting challenges and discusses potential solutions including separate models. These findings highlight the potential of transformer models like Lag-Llama in macroeconomic forecasting and set a new benchmark for future analyses in this domain.
-
10:00 – 10:30
Morning Break
10:30 – 12:00
Session 10A – The use of machine learning in official statistics
Simon Goldberg Room
- Statistical Inference in the Presence of Non-Response and Machine Learning Methods: Some Recent Works
David Haziza, University of Ottawa, Canada-
Abstract
In recent years, machine learning methods have generated increasing interest within national statistical institutes. These methods allow for precise predictions by analyzing large datasets and identifying complex patterns and relationships. These predictions can be applied at various stages of a survey, particularly for handling missing data and small area estimation.
In this presentation, we will present the results of recent or ongoing work on inference in the presence of item non-response. We will begin by discussing statistical inference when random forests are used to impute missing values. Then, we will address doubly robust estimation methods that incorporate predicted response probabilities and imputed values—obtained using any machine learning method—into the construction of the estimators. We will highlight the advantages of doubly robust methods within the framework of machine learning and discuss their practical implementation. We will show how to estimate the variance of doubly robust estimators. Finally, we will present the results of simulation studies aimed at evaluating the performance of point and variance estimators.
-
- Fitting Classification Trees Accounting for Complex Survey Design
Minsun Riddles, Westat, USA-
Abstract
Survey practitioners have increasingly embraced the benefits of modern machine learning techniques, including classification and regression tree algorithms. These methods, which do not require a predefined functional relationship between outcomes and predictors, offer a practical means of conducting variable selection and deriving interpretable structures that link predictors with a target outcome. However, when applying these algorithms to survey data, it is common to overlook crucial factors like sampling weights, as well as sample design features such as stratification and clustering. To bridge this oversight, we propose an extension of the well-known Chi-square Automatic Interaction Detector (CHAID) approach. Our enhancement incorporates a Rao-Scott correction into the splitting criterion, accounting for the survey design. We discuss the statistical properties of the resulting algorithm under a design-based framework. Using data from the U.S. American Community Survey, we illustrate the use of the method and evaluate its performance through comparisons with existing weighted and unweighted algorithms.
-
- Tree-Based Algorithms for Official Statistics
Daniell Toth, US Bureau of Labor Statistics, USA-
Abstract
Over the past decade, machine learning algorithms have been increasingly employed to produce official statistics. This trend has been particularly notable following the adaptation of various algorithms for use with survey data collected from an informative sample design. Among these techniques are several tree-based methods such as regression and boosted trees, random forest models, and Bayesian tree models. These models have been utilized in analyzing survey data for purposes such as assessing data quality and nonresponse, as well as estimating official statistics through small area estimation, model-assisted estimators, and data imputation. In this article, we discuss how these tree-based methods have been adapted for use with survey data and explore their significance in the production of official statistics.
-
10:30 – 12:00
Session 10B – Society and Official Statistics
Jean Talon Conference Room
- A Safe and Inclusive Approach to Disseminating Statistical Information about the Non-binary Population in Canada
Claude Girard, France-Pascale Ménard, Statistics Canada, Canada-
Abstract
In 2022, Canada became the first country to release statistical information about its transgender and non-binary populations based on data collected from the 2021 Census of Population. Moreover, following a recent government-wide directive, Statistics Canada's surveys have begun to collect and disseminate information about gender rather than sex at birth.
Due to the small size of the transgender and non-binary populations – according to the 2021 Census, these represent 0.3% of individuals aged 15 or older in Canada – disseminating safe statistical information about them at detailed sociodemographic or geographical levels poses a challenge.
The dissemination strategy adopted for the 2021 Census, which was subsequently adapted and recommended for surveys, is centered on a new two-category gender variable (Men+ and Women+) that include non-binary individuals, and which is to be used at all but the highest dissemination levels. In this talk, we retrace the methodological considerations that have gone into creating and adopting this novel approach, which is deemed both inclusive of non-binary people and statistically safe.
-
- One-Stop-Shop for AI/ML for Official Statistics - Methodology at the Heart
Francesca Kay, Brendan O'Dowd, Central Statistics Office, Ireland-
Abstract
In a world of constant change, new and improved statistical products and services have become a necessity for official statistics. On the one hand, demand from users is evolving, as they ask for more granular, more timely and better integrated data. From a methodology perspective that means Official Statisticians have to reconsider what users need and how we can provide fit for purpose and methodologically sound information. On the other hand, technical opportunities appear at such a pace that it is very challenging to provide new and robust methodologies that ensure these opportunities can be integrated in the production of official statistics whille maintaining the understanding, transparency and trust in official statistics as a trusted source of information.
In April 2024, Eurostat awarded a grant to a consortium of 14 countries led by the Central Statistics Office of Ireland to create a One-Stop-Shop for AI/ML in Official Statistics (AIMLOS4). The potential for AI/ML is still being developed and the aim of the One-Stop-Shop is to enable systematic learning, sharing of experiences, identification of good practices and reuse of solutions. At the heart of the consortium is both researching and creating new methodologies to support the application of AI/ML solutions as well as the standardisation of methodologies and best practice to facilitate the scaling up or reuse of existing solutions.
The presentation will highlight some of the key objectives of the AIMLOS4 project and how it will look to develop innovative new methodologies in AI/ML to meet the evolving needs and challenges in the production of official statistics.
-
- Citizen-generated data and its impact on official statistics
Haoyi Chen, United Nations Statistics Division, United Nations, New York, USA-
Abstract
Citizen participation throughout the data value chain is increasingly recognized in addressing data gaps for marginalized communities and enhancing data fairness, inclusiveness, openness, accountability and transparency. The global official statistical community has partnered with other data stakeholders including civil society organizations, human rights institutions and academia marching forward with the draft "Copenhagen Framework on Citizen Data", to support the sustainable and use of data generated by citizens and communities.
The paper discusses how bringing citizen generated data (CGD) into the official statistics community will impact the national official statistics systems and transform the role of national statistical offices as data stewards. The areas that will be covered include: (a) discussing the role of CGD to fill data gap and to improve inclusivity of official statistics; (b) reassessing the relevance of existing quality frameworks for official statistics to allow data to be used fit-for-purpose; (c) challenges and opportunities in integrating CGD into official statistics and (d) discussing how the role of national statistical offices need to evolve in leveraging the power of the citizen generated data.
-
- Exploration of approaches to small area estimation with measurement errors; and their application to Indonesian household surveys
Ika Yuni Wulansari, University of Technology Sydney, Australia, Politeknik Statistika STIS, Indonesia, Statistics Indonesia, Indonesia
Stephen Woodcock, University of Technology Sydney, Australia
James J Brown, University of Technology Sydney, Australia-
Abstract
The UN's SDGs require highly disaggregated data on indicators that are typically available only from household surveys. However, this presents an issue as the required level of dis-aggregation is beyond what surveys can support through direct estimation. Therefore, NSIs are turning to small area approaches. However, that is not without its issues, as often the auxiliary variables that might be used in estimation models are themselves estimated from surveys, introducing the additional complication of measurement errors.
We aim to apply the approach of Ybarra and Lohr (2008) to adjust for measurement errors in a classic Fay-Herriot area-level model. We use a comprehensive simulation study to explore the impact of measurement error on a single auxiliary variable, and the situation where there are two auxiliary variables, one with and one without measurement error. The results demonstrate the robustness of the standard approach and ignoring measurement error, but show there are specific scenarios where correct adjustment for measurement errors is beneficial. We apply the approach to a case-study example utilising Indonesian household survey data, estimating at the sub-district level. In the case study, we estimate per capita household expenditure from the National Socio-Economic Survey (SUSENAS) as variable of interest and use village potential data (PODES) for auxiliary variables.
-
12:00 – 13:30
Lunch
13:30 – 15:00
Session 11 – Special Session in Honour of J.N.K. Rao
Simon Goldberg Room
- J.N.K. Rao's Contributions to Survey Research
Sharon Lohr, Arizona State University, USA-
Abstract
J.N.K. Rao has contributed to almost every subdiscipline of survey research. In this nontechnical talk, I will touch on some of Rao's work on unequal-probability and two-phase sampling, variance estimation, regression and categorical data analysis, small area estimation, and data integration.
For each of these topics, Rao's work anticipated and led future research directions, and I will discuss his contributions in the context of broader trends and current challenges in survey research.
-
- Celebrating J.N.K. Rao's Legacy in Small Area
Mahmoud Torabi, University of Manitoba, Canada-
Abstract
This talk pays homage to the remarkable contributions of J.N.K. Rao to the field of small area estimation (SAE). Rao's pioneering research has profoundly influenced the development and application of SAE methods, revolutionizing how statisticians tackle inferential challenges in areas with limited data representation. Beginning with an overview of SAE's significance across various domains, from official statistics to public health, the presentation highlights Rao's key methodological advancements.
These include his instrumental work in model-based approaches, notably the area-level and unit-level models, and estimation of mean squared prediction error of small area parameters. Furthermore, the talk explores Rao's enduring legacy in fostering interdisciplinary collaborations and promoting the practical adoption of SAE techniques in decision-making processes.
Through an examination of Rao's influential papers and collaborations, the presentation illustrates his pivotal role in shaping the theoretical foundations and real-world applications of SAE. Additionally, building upon Rao's legacy, it discusses contemporary challenges and recent advancements in SAE research. Ultimately, this talk serves as a tribute to J.N.K. Rao's indelible mark on SAE, inspiring continued innovation and impact in the field for generations to come.
-
- Contributions of J.N.K. Rao to complex survey multilevel models and composite likelihood
Mary Thompson, University of Waterloo, Canada-
Abstract
With H. O. Hartley, J. N. K. Rao was an early contributor to multilevel modeling with survey data, through methods of inference for variance components. In recent years, he has returned to this area of research. With F. Verret and M. Hidiroglou he proposed a weighted composite likelihood approach to inference under a two-level model (Survey Methodology, 2013). This method and its impact, applications and later extensions will be outlined.
-
15:00 – 15:30
Afternoon Break
15:30 – 17:00
Session 12A – Integration of probability and non-probability sample data
Simon Goldberg Room
- Some Theoretical and Practical Issues in and Strategies for Dealing with Non-Probability Samples
Changbao Wu, University of Waterloo, Canada-
Abstract
We provide an overview of recent developments in statistical inference for non-probability survey samples. We discuss issues arising from methodological developments related to inverse probability weighting and model-based prediction and concerns with practical applications. Three procedures proposed in the recent literature on the estimation of participation probabilities, namely, the method of Valliant and Dever (2011) based on the pooled sample, the pseudo maximum likelihood method of Chen, Li and Wu (2020), and the method of Wang, Valliant and Li (2021) using a two-step computational strategy, are examined under a joint randomization framework. The inexplicit impact of the positivity assumption on model-based prediction approach is examined, and the main issue of undercoverage is highlighted. We discuss potential strategies for dealing with standard assumptions and undercoverage problems in practice.
-
- Comparison of recent techniques of combining probability and non-probability samples
Julie Gershunskaya, U.S. Bureau of Labor Statistics, USA-
Abstract
We compare several recent quasi-randomization methods for inferences from non-probability samples. The considered techniques are developed under the assumption that the sample selection is governed by an underlying latent random mechanism and that it can be uncovered by combining non-probability survey data with a "reference" probability-based sample, obtained from the same target population. Challenges prompting the development of alternative procedures include (i) non-probability sample participation indicators are available only on the observed sample units and (ii) it is not generally known which units from the underlying population belong to both the non-probability and reference samples. We consider the ways different procedures address these challenges, discuss theoretical properties of the methods and compare them using simulations.
-
- Propensity Score Estimation and Optimal Sampling Design when Integrating Probability Samples with Non-probability Data
Anders Holmberg, Australian Bureau of Statistics, Australia
Lyndon Ang, Australian National University and Australian Bureau of Statistics, Australia
Robert Clark, Bronwyn Loong, Australian National University, Australia-
Abstract
Although non-probability data sources are not new to official statistics, a revived interest in the topic has emerged from pressures due to falling survey response rates, increasing data collection costs and a desire to take advantage of new data source opportunities from the ongoing societal digitalisation. Due to the exclusion of certain segments of the target population, inference derived solely from a non-probability data source is likely to result in bias. This work approaches the challenge of addressing the bias by integrating non-probability data with reference probability samples. We focus on methods to model the propensity of inclusion in the non-probability dataset with the help of the accompanying reference sample, with the modelled propensities then applied in an inverse probability weighting approach to produce population estimates. The reference sample is sometimes assumed as given. In this presentation however, we pursue an objective of finding an optimal strategy, that is, the combination of a data integration-based estimator and sample design for the reference probability sample. We discuss recent work in which we take advantage of the good unit identification possibilities in business surveys to study an estimator based on propensities and derive optimal (unequal) selection probabilities for the reference sample.
-
15:30 – 17:00
Session 12B – Challenges in Production of Official Statistics
Jean Talon Conference Room
- Using Non-Binary Gender to Calibrate Survey Weights for the Canadian Long-Form Census Sample
Alexander Imbrogno, Statistics Canada, Canada-
Abstract
In 2021, Canada became the first country to collect and publish data on gender in a national census giving Canadians the option to choose male, female, or non-binary. Due to their small sizes, non-binary population totals were excluded from the 2021 long-form sample calibration due to the risk of increasing the variance of estimates. This talk presents an alternative long-form calibration procedure which aggregates sub-provincial non-binary totals into a large provincial total to protect against variance inflation. Artificial sub-provincial non-binary totals are introduced as a tool to decompose the resulting provincial level calibration back into independent sub-provincial problems, maintaining the computational efficiencies of the usual long-form calibration. An algebraic expression for the artificial totals under the chi-squared distance is derived. Simulation results are presented demonstrating the benefits of non-binary calibration on data quality for the non-binary domain.
-
- A New Origin-to-Destination Table of Canadian Manufacturing Sales: Challenges with Imputing a Distribution from Annual Survey Data
Nicholas Huliganga, Statistics Canada, Canada-
Abstract
Until now, detailed data on the destination of manufacturing sales have not historically been available to Canadians. Through integration of annual survey (ASML) data, a destination of sales table by industry and province of origin was developed for the manufacturing annual and monthly surveys at Statistics Canada.
While not a question on the monthly survey, the respondents for the annual survey are asked for their distribution of sales as a percentage across 15 destinations. To tackle the difficulty of generating an establishment-level distribution for multi-province respondents, three approaches were compared:
using the respondents' total distribution for all their establishments, using optimization, and using the distributions of the single-province respondents. The imputed distribution of destination sales from the annual data was then applied to the monthly sales value. Monthly establishments not linked to annual respondents were imputed with a strategy involving the aggregate distribution of their industry-province group in the annual survey.Finally, point estimates and a sampling variance (and CV) were successfully generated for the entire origin-destination of sales table and for each industry in the monthly survey.
This presentation delves into challenges faced with imputing the destination sales (especially for respondents with establishments in multiple provinces), ensuring sales match marginal origin province totals, and allocating a distribution of destinations based on data from the annual program to the monthly estimates.
-
- The Usage of the ReliefF Algorithm for Edit & Imputation in the Canadian Census of Population
Irwin Khuu, Statistics Canada, Canada-
Abstract
Historically, the Canadian Census of Population Edit & Imputation (E&I) process has operated using a nearest-neighbour donor imputation methodology wherein the distance between a failed unit and a potential donor is obtained through a weighted combination of auxiliary variables. Revision to the model between cycles can be a complicated and time-consuming process given there is no standard approach to variable selection and weighting between topics. This presentation will illustrate the potential of the ReliefF feature selection algorithm to create a machine learning-driven approach to variable selection and weighting that is standardized and comparable between Census cycles and among the many topics of the Census. An overview on how this process may be applied in practice will be presented, followed by results on a diverse set of topics that indicate a general improvement over previous methods.
-
- Factors affecting response propensity, with an interest in units sampled multiple times – an empirical study using social surveys at Statistics Canada
Noah Johnson, Catherine Deshaies-Moreault, Cilanne Boulet, Statistics Canada, Canada-
Abstract
As the need for data has grown over the past number of years, the effect and burden of repeatedly sampling the same units for multiple surveys has become an increasing concern. Response burden is generally assumed to contribute to decreasing response rates, however there are few empirical studies looking into this question. One such study was undertaken by Statistics Canada, aggregating data on response to social surveys conducted between 2020 and 2023. It investigates factors contributing to the observed response patterns, including the effect of having been previously selected on response propensity. This presentation will describe the study, share its principal results, and address how they can inform decisions about sample coordination among household surveys at Statistics Canada.
-
17:00 – 17:15
Closing Remarks
Simon Goldberg Room
- Wesley Yung, Director General, Modern Statistical Methods and Data Science Branch, Statistics Canada, Canada