Social Data Linkage Environment - Privacy impact assessment
Introduction
The Social Data Linkage Environment (SDLE) builds on past record linkage experience to make possible a program of pan-Canadian socio-economic record linkage research. A well structured and regulated program of record linkage is required to: a) increase the relevance of existing Statistics Canada surveys without the need to collect new data or re-collect data held by other data sources; b) maintain the relevance of longitudinal surveys that have been terminated, including the National Population Health Survey, the National Longitudinal Survey of Children and Youth, the Youth in Transition Survey, the Longitudinal Survey of Immigrants to Canada and the Survey of Labour and Income Dynamics; c) substantially increase the use of administrative data; d) replace or supplement existing data collection programs in the social domain; e) maintain the highest data privacy and security standards.
Objective
A privacy impact assessment for the Social Data Linkage Environment was conducted to determine if there were any privacy, confidentiality and security issues associated with the program, and if so, to make recommendations for their resolution or mitigation.
Description
Statistics Canada has responsibility for securely storing and processing data sets and for the production of analysis files needed to carry out approved research studies. Because SDLE research projects will involve the use of linked records, approval on a study-by-study basis will also be required from Statistics Canada’s senior management in accordance with the Statistics Canada Directive on Record Linkage. A Derived Record Depository and separate Key Registry will be created to reduce privacy risks and to improve the efficiency and quality of the linkages.
The Derived Record Depository (DRD) is created by linking various Statistics Canada data files for the purpose of producing a list of unique individuals. Each individual in the DRD is assigned an anonymous SDLE identifier. The identifier is randomly assigned and has no value outside of the SDLE. Some of the data files used for the DRD include the Census of Population and National Household Survey, T1 Personal Master Files (Tax), Canadian Child Tax Benefits (CCTB) files, the Canadian Birth Database (CBDB), the Canadian Mortality Database (CMDB), the Landed Immigrant Database and the Indian Registry. The DRD is an unduplicated national longitudinal file and will be updated through further record linkages on an ongoing basis.
Only basic personal identifiers are stored in the DRD. Survey data from the various input databases are not required to create the DRD and will not be stored within the DRD. The DRD will initially be comprised of the following personal identifiers: Surnames; Given names; Date of birth; Sex; Marital status; Date of landing/immigration; Date of emigration; Date of death; Social Insurance Numbers (SIN), Temporary Taxation Numbers (TTN), Dependant Identifier Numbers (DIN); Spouse’s SIN/TTN; Dependant/Disabled individual SIN/TTN/DIN; Parent SIN/TTN; Health Information Numbers; Addresses; Address Registry Unique Identifier (ARUID); Standard Geography Classification (SGC) codes; Telephone numbers; Spouses’ surname; Mother’s surname; Father’s surname; Alternate surname and a Statistics Canada-generated sequential identification number for each individual identified through the annual Derived Record Depository linkage process. Access to the Derived Record Depository will be restricted to the Statistics Canada employees responsible for its development and maintenance.
The paired SDLE IDS and source file Record Ids identified through the record linkage will be stored in a separate Key Registry. Once a study cohort has been defined, these “linkage keys” can then be used to find the records associated with cohort members across the databases comprising the SDLE. This approach creates a virtual linkage environment which eliminates the need to build a large integrated database. Under SDLE, all survey data will continue to reside in their current locations and be maintained under existing arrangements. Thus, the SDLE is an environment in which data sources can be brought together to create an analysis file for specific, approved linkage studies. The SDLE does not house a large, integrated database of information from across survey data sources.
Risk Area Identification and Categorization
The PIA also identifies the risk areas and categorizes the level of potential risk (level 1 representing the lowest level of potential risk and level 4, the highest) associated with the collection and use of personal information of respondents.
- Type of program or activity – Level 1: Program or activity that does not involve a decision about an identifiable individual.
- Type of personal information involved and context – Level 3: Social insurance number, medical, financial, or other sensitive personal information, or the context surrounding the personal information is sensitive; personal information of minors or of legally incompetent individuals or involving a representative acting on behalf of the individual.
- Program or activity partners and private sector involvement – Level 1: Within the institution (among one or more programs within the same institution).
- Duration of the program or activity – Level 3: Long-term program or activity.
- Program population – Not applicable: The program’s use of personal information is not for administrative purposes. Information is collected for statistical and related research purposes, under the authority of the Statistics Act.
- Personal information transmission – Level 1: The personal information is used within a closed system (i.e., no connections to the Internet, Intranet or any other system, and the circulation of hardcopy documents is controlled).
- Technology and privacy: The program involves a modified version of the Longitudinal Health and Administrative Data (LHAD) Initiative methodology. It uses the LHAD data model but replaces LHAD’s Health Client Registries (health insurance client registries provided by provincial ministries of health) with a Derived Record Depository (DRD) using data collected or held by Statistics Canada. The program involves automated personal information processing, and personal information matching techniques for statistical analysis purposes only.
- Privacy breach: There is a very low risk of a breach of some of the personal information being disclosed without proper authorization. The impact on the individual would be low because personal identifiers such as a person’s name or address are never stored with survey or administrative data. Personal identification data are stored in separate index files that are only accessed by a small number of Statistics Canada staff whose work requires access.
Conclusion
This privacy impact assessment has not identified any outstanding issues relating to confidentiality or security. Confidentiality of information maintained in the secure environment of Statistics Canada is governed by the Statistics Act and the Agency has an exemplary record in that regard. Similarly, from a security perspective, Statistics Canada has had in place for many years, security policies and practices that are now just becoming a best practice in many other organizations.
Many activities of Statistics Canada–like the SDLE–by their very nature are privacy intrusive. Although a number of potential privacy concerns were identified, this assessment concludes that with the mitigation measures that have been put in place, any remaining risks are either negligible or are such that Statistics Canada is prepared to accept and manage the risk.