The Open Database of Healthcare Facilities (ODHF)
Metadata document: concepts, methodology and data quality
Version 1.1
Data Exploration and Integration Lab (DEIL)
Centre for Special Business Projects (CSBP)
August 7, 2020
Table of Contents
- 1. Overview
- 2. Target Population
- 3. Data Sources
- 4. Reference Period and Last Update Dates
- 5. Compilation Methodology
- 6. Database Coverage
- 7. Data Quality
- 8. Data Dictionary
- 9. Contact Us
- Appendix A: Open Data Sources
- Appendix B: Other Publicly Available Data Sources or Sources of Directly-Provided Data
1. Overview
The Open Database of Healthcare Facilities (ODHF) is a Canada-wide healthcare facilities database. It has been compiled by the Centre for Special Business Projects (CSBP) at Statistics Canada. This document discusses the methodology used to create the ODHF. This document pertains to the first update of the ODHF (version 1.1) in August 2020. The first version of the ODHF was published in April 2020 and the main updates for version 1.1 include the addition of 5 new data sources, updates to entries with the collaboration of the data providers, and enhanced deduplication.
The database uses both open data as well as publicly available data (a dataset being designated open depending on whether or not the data are distributed under an open data license). Most of the data are sourced from municipal, regional and provincial/territorial governments, federal agencies, or independent not-for-profit organizations specializing in the health information field. The data have been either web-scraped, downloaded, or obtained directly from the data sources.
The main objective of producing the ODHF is the dissemination of this information through the harmonization and integration of, and, to a limited extent, the addition of geolocation information to the data assembled from the various sources used.
Version 1.1 of the ODHF contains 7,033 individual records. This is a reduction of approximately 2,000 records relative to version 1.0. This difference is primarily due to enhanced deduplication (over 1,600 entries removed) applied in version 1.1, but also due to removing some records at the request of data providers and replacing the source of data used for the province of Québec, which was web-scraped in version 1.0 and replaced with an open source in version 1.1. The ODHF is provided as a compressed comma separated values (CSV) file. The database is expected to be updated periodically as new datasets become available or as other improvements are made.
The ODHF is one of several datasets created as part of the Linkable Open Data Environment (LODE), an initiative at CSBP. The LODE is an exploratory initiative that aims at enhancing the use and harmonization of open and publicly available data from authoritative sources by providing a collection of datasets released under a single licence. The LODE also provides open-source code to link these datasets together. Access to the LODE datasets and code are available through the Statistics Canada Linkable Open Data Environment website.
2. Target Population
A healthcare facility is a physical site at which the primary activity is the provision of healthcare. Healthcare facilities in Canada that provide healthcare services are in scope for this dataset. Specifically, in terms of the North American Industry Classification System (NAICS), the following industries are in scope:
- 621 - Ambulatory health care services
- 622 - Hospitals
- 623 - Nursing and residential care facilities
Facilities are included when their primary activities relate to healthcare, regardless of the source of funding, private or public status, operator type, location or other attributes not listed here. Furthermore, as only one type is assigned to each facility, it's possible it may offer multiple types of service but will only be listed as one. Alternative medicine (e.g., herbalists) and specialist areas (e.g., chiropractors, dentists, mental health specialists, etc.) are not included in the current ODHF version (version 1.1). However, when the sources used contained these out-of-scope facilities, some of these might still be present in the ODHF database.
Facilities that are in areas indirectly related to overall healthcare delivery, e.g., pharmacies, social assistance, etc., are also not in scope of the current version of the ODHF.
3. Data Sources
The sources used are detailed in Appendix A for open data sources and in Appendix B for publicly available data sources. The links to the original datasets, licenses or terms of use, attribution statements and additional notes are also included in Appendix A and Appendix B. An additional 5 data sources have been added in the 1.1 update. At the request of some of the data providers, some entries have been updated or removed.
Nearly all data sources used to create this database are publicly available sources, such as municipal governments, provincial/territorial governments and health authorities and agencies, and independent not-for-profit organizations specializing in the health information field. The data were obtained either from open data portals located on websites, through web-scraping, or were provided directly by the source. In most cases, sources were discovered using major search engines or through professional contacts. Sources were sought in all Canadian provinces and territories.
The distinction between open and other publicly available data is based on the licensing terms (explicit or implicit) attached to each source dataset used. Open data licenses permit, in varying degrees, usability for any lawful purpose, redistribution (re-sharing) and modification and re-packaging of the data. However, open data licenses can impose some restrictions, such as attribution of original source, share-alike (re-sharing only with like conditions), and no commercial use. Examples of open data licenses are Creative Commons, MIT, GPLv3, and Canada's Open Government License. In general, no warranty is expressed and there are very minor conditions stipulated by the provider.
Publicly available data that are not open data might be associated with proprietary licensing or terms of use that may restrict some of the aspects that would otherwise be permitted under open data licensing. The sources are detailed in Appendix A for open data sources and in Appendix B for other publicly available data sources.
The links to original datasets used for the current version of the ODHF (version 1.1), licenses or terms of use, attribution statements and additional notes are also included in Appendices A and B. For further information on the individual licenses, users should consult directly with the information provided on the data portals for the data providers.
4. Reference Period and Last Update Dates
In principle, the reference date of the database would represent the date for which all healthcare facilities in existence at that time would be included in the dataset. Ideally, this would be the same date for all datasets used. However, this is not the case and the reference dates vary by provider. In some cases, such detail was not present in the information made available by data providers.
Appendix A and Appendix B provide the date when each source dataset was last updated by the provider (this information is collected at the time the dataset was accessed for this project). As all data sources only had one version available, this is what has been used and taken to be the most current available.
Users are cautioned that the last update date should not be interpreted as the reference date of the data. If specific information concerning the reference period of data is required, users should contact the appropriate data providers shown in Appendix A: Open Data Sources and Appendix B: Other Publicly Available Data Sources.
5. Compilation Methodology
This section provides an overview of the processing done to compile the ODHF.
Data Cleaning
The primary processing component for the database comprised reformatting the source data to CSV format and mapping the original dataset attributes to the variable (column) names defined for this database. A data dictionary of the variables used for this database is provided in section 8 Data Dictionary. To clean the data, the following was done:
- Address parsing and normalization
- Concatenated address data were parsed and separated into the respective location variables using libpostal, a state-of-the-art natural language processing solution for address parsing. A small number of addresses were parsed incorrectly and were manually corrected.
- Data entry formatting (removal of excess whitespace and punctuation), normalization of postal codes and addresses, province/territory names.
- Some data entries that were filtered out by automated cleaning methods were manually corrected. See section 8 for more details.
- Removal of duplicates
- The removal of duplicates is done using fuzzy string matching based on criteria involving the facility name, street name, street number and geo-coordinates. The criteria were derived empirically and with the intent of avoiding false positives.
- Identification of erroneous entries
- Identifying erroneous entries was done both programmatically and manually. Data entries that could not be correctly processed by automated techniques were filtered and stored in a separate file and manually corrected later.
- Selection of record to retain in case of duplicates
- In some instances, a facility was present in more than one source.In such cases, the record with the most information available was retained. Where information between sources did not match, validation tools were used to decide which to retain.
For the 1.1 update, a more rigorous deduplication process was carried out to remove some duplicates that existed in the first release. This process was carried out using the Python Record Linkage Toolkit package to perform string comparisons on the various columns of the database, and the Scikit Learn package to perform a machine learning classification to identify potential duplicate records. Entries without enough information to be classified in this way were processed by considering all record pairs in the same province and with facility name comparison scores above a certain threshold as potential duplicates. All potential duplicates identified using this approach were then manually verified before removing. For the purpose of this database, the unit of analysis is a healthcare facility rather than any particular service, and therefore in instances where one facility (such as a hospital complex) contains multiple individual services, these are reduced down to a single entry. As a result of this process, over 1,600 duplicates were removed.
During validation, changes may have been applied to facility names and addresses when deemed appropriate. This may cause occasional discrepancies between the street number and street name columns and the original source address column.
For more details on the software used to process the data, please refer to CSBP's GitHub page.
Determination of Healthcare Facility Types
The original data sources use a variety of standards, classifications and nomenclatures to describe the type of healthcare facility. Unfortunately, there is no classification for healthcare facilities in Canada that is used universally. Health authorities classify their facilities independently using different classifications systems. The following classification of healthcare facilities is used currently for the database:
- Ambulatory health care services: Establishments primarily engaged in providing health care services, directly or indirectly, to ambulatory patients. (Example: medical clinic, mental health center.)
- Hospitals: Establishments, licensed as hospitals, primarily engaged in providing diagnostic and medical treatment services, and specialized accommodation services to in-patients. (Example: emergency department, general hospital.)
- Nursing and residential care facilities: Establishments primarily engaged in providing residential care combined with either nursing, supervisory or other types of care as required by the residents (Example: nursing home.)
The classification is intended to have broad categories that are helpful in distinguishing major types of facilities and yet enable accuracy in mapping source-specific facility types. Facility types are determined from source-specific facility types (e.g., cancer treatment centers are classified as 'Hospitals') and source coverage metadata information. Assignments are done using keywords and validated afterwards, with changes made manually whenever needed. When classifying facilities based on source metadata information, this was done analytically on a case by case basis.
Table 1 illustrates the use of keywords to assign type categories to the healthcare facilities based on the classification used for the ODHF.
Variable | Condition | Value | Classification |
---|---|---|---|
Facility type | contains the keywords | 'community health center', 'clinic' | Ambulatory health care services |
Facility type | contains the keywords | 'hospital', 'cancer treatment', 'emergency', cancer centre', 'health centre' | Hospitals |
Facility type | contains the keywords | 'senior active living', 'nursing home', 'long-term care' | Nursing and residential care facilities |
Geocoding and Determination of Census Subdivision (CSD or Municipality)
Geocoding was carried out for some sources that provide address data but no geo-coordinates. Latitude and longitude were determined and validated using tools on the internet. A subset of the source-provided geo-coordinates were also validated using the internet. Some coordinates have also been removed from the original sources when it was determined they were derived from postal codes or other aggregate geographic areas as opposed to street address.
Note: While efforts have been made to ensure the accuracy of geo-coordinates, no guarantees are implied, and errors and inaccuracies are possible.
Census subdivision (CSD)Footnote 1 (or municipality) was derived from the geographic coordinates by linking to the CSD polygons through a spatial join operation using the Python package GeoPandas or by using the city name available in the record's address field using GeoSuite.
6. Database Coverage
The ODHF current version (version 1.1) database as provided contains 7,033 healthcare facilities.
As the total number of all healthcare facilities in the country is not known with a reasonable degree of certainty, the coverage obtained with the sources used was not quantitatively assessed. However, many of the sources purport to list all institutions of a certain type (e.g. acute care hospital, residential care) within a jurisdiction. Thus, within these institution type categories and jurisdictions, coverage would be expected to be fairly complete. However, if facilities of a certain category were omitted in a source, e.g., outpatient medical clinics, then these might be missing from the database, unless they were obtained from a different source.
7. Data Quality
The accuracy and completeness of the information is in general a function of the source datasets used. Except as noted, the underlying datasets are taken "as is".
- Classifying facilities
- Assignment of facility type was largely based on facility types provided by source datasets. In instances where facility type was either unclear or not defined by the source, facility type was classified based on further research.
- Duplicates
- Some datasets provide data where the rows do not represent unique facilities. Although deduplication techniques are used, it is expected that there are some duplicates remaining.
- Address parsing
- Natural language processing methods were used to do the parsing and separation of address strings into address variables, such as postal code and street number. The methods are reputable for state-of-the-art performance and accuracy, but as with all statistical learning methods, they have limitations as well. Poor or unconventional formatting of addresses might result in incorrect parsing. Upon manual review of the database, no incorrect parses were identified. At this stage, address records in the database are expected to be correctly parsed.
- Geo-coordinates
- Some facilities that did not have geo-coordinates were geocoded using OpenStreetMap's Nominatim API. The accuracy of the geocoding was manually validated by using proprietary mapping services available on the internet. In some cases, facility coordinates were also manually determined from online map services.
8. Data Dictionary
This data dictionary describes the variables contained within the ODHF. The database is provided in a CSV format. Each facility is listed per row and its attributes provided in columns. The corresponding column variables are described in the data dictionary below.
Healthcare Facility Variables
Variable - Index
- Name
- index
- Format
- Alphanumeric
- Source
- Assigned serially
- Description
- Unique serial number for each facility. Supplemental entries to version 1.1 are identified by the prefix "S" followed by an assigned serial number.
Variable - Facility Name
- Name
- facility_name
- Format
- String
- Source
- Provided as is from original data
- Description
- Healthcare facility name
Variable – Source Facility Type
- Name
- source_facility_type
- Format
- String
- Source
- Provided as is from original data
- Description
- Regional health authority assigned healthcare facility type
Variable – ODHF Facility Type
- Name
- odhf_facility_type
- Format
- String
- Source
- Imputed from source data or metadata
- Description
- Value determined using the classification criteria used (see section 5)
Variable – Provider
- Name
- provider
- Format
- String
- Source
- Assigned based on the provider's identity
- Description
- The identity or name of the data provider
Location Variables
Variable – Unit Number
- Name
- unit
- Format
- String
- Source
- Parsed from a full address string or provided as is
- Description
- Civic unit or suite number
Variable – Street Number
- Name
- street_no
- Format
- String
- Source
- Parsed from a full address string or provided as is
- Description
- Civic street number
Variable – Street Name
- Name
- street_name
- Format
- String
- Source
- Parsed from a full address string or provided as is
- Description
- Civic street name (type and direction)
Variable – Postal Code
- Name
- postal_code
- Format
- String
- Source
- Parsed from a full address string or provided as is
- Description
- Civic postal code
Variable – City
- Name
- city
- Format
- String
- Source
- Parsed from a full address string or provided as is
- Description
- City name
Variable – Province/Territory
- Name
- province
- Format
- String
- Source
- Converted to two letter codes after parsing from a full address string, or provided as is, or indicated by the provider
- Description
- Province or territory name
Variable – Source-Format Street Address
- Name
- source_format_str_address
- Format
- String
- Source
- Street address from the data source provided as is
- Description
- Street address in the source data
Variable – CSD Name
- Name
- CSDname
- Format
- String
- Source
- Imputed from geographic coordinates and city names
- Description
- Census subdivision name
Variable – CSD Unique Identifier
- Name
- CSDuid
- Format
- Integer
- Source
- Imputed from CSD name using GeoSuite 2016
- Description
- Census subdivision unique identifier
Variable – Province or Territory Unique Identifier
- Name
- PRuid
- Format
- Integer
- Source
- Imputed from CSD unique identifier by taking the first two digits
- Description
- Province unique identifier
Variable – Latitude
- Name
- latitude
- Format
- Float
- Source
- Provided as is from original data or corrected value if source value found inaccurate during validation
- Description
- Latitude
Variable – Longitude
- Name
- longitude
- Format
- Float
- Source
- Provided as is from original data or corrected value if source value found inaccurate during validation
- Description
- Longitude
9. Contact Us
Statistics Canada's open data projects are modelled on ongoing improvement. To provide information on additions, updates, corrections or omissions, or for more information, please contact us at statcan.lode-ecdo.statcan@statcan.gc.ca. Please include the title of the open database in the subject line of the email.
Appendix A: Open Data Sources
Data provider | Province / territory | Link | License / Terms of Use | Last updated by provider | Description | New source for ODHF v1.1 |
---|---|---|---|---|---|---|
British-Columbia (Province) | British-Columbia | British Columbia - Data Catalogue - Emergency Rooms in BC | Open Government Licence - British Columbia | 12/24/2019 | Emergency services in British-Columbia | No |
British Columbia (Province) | British Columbia | British Columbia - Data Catalogue - Hospitals in BC | Open Government Licence - British Columbia | 12/25/2019 | Hospitals in British Columbia | No |
British Columbia (Province) | British Columbia | British Columbia - Data Catalogue - Residential Care Facilities | Open Government Licence - British Columbia | 12/26/2019 | Residential care in British Columbia | No |
British Columbia (Province) | British Columbia | British Columbia - Data Catalogue - Walk-in Clinics in BC | Open Government Licence - British Columbia | 12/27/2019 | Walk-ins in British-Columbia | No |
Moncton (Municipality) | New Brunswick | City of Moncton - Senior Care Facilities | City of Moncton - Open Data Terms of Use | 3/19/2010 | Senior care facilities within the Greater Moncton area | Yes |
Moncton (Municipality) | New Brunswick | City of Moncton - Medical Clinics | City of Moncton - Open Data Terms of Use | 3/19/2010 | Medical clinics in the Greater Moncton area | Yes |
New Brunswick (Province) | New Brunswick | Digital New Brunswick - Map of Licensed Nursing Homes | Open Government Licence - New Brunswick | 07/16/2019 | Licensed nursing homes in New Brunswick | Yes |
Nova Scotia (Province) | Nova Scotia | Open Data Nova Scotia - Hospitals | Nova Scotia Open Government Licence | 2/15/2019 | Hospitals in Nova-Scotia | No |
Prince Edward Island (Province) | Prince Edward Island | PEI Health Facilities | PEI Health Facilities | 4/17/2020 | Healthcare facilities in Prince Edward Island | Yes |
Prince Edward Island (Province) | Prince Edward Island | Open Data Prince Edward - Health PEI Facility Locations | Open Government Licence - Prince Edward Island | 8/8/2019 | Healthcare facilities in Prince Edward Island | No |
Québec City, Québec (Municipality) | Québec | Données Québec - Ville de Québec - Lieux publics | Creative Commons - Attribution 4.0 International (CC BY 4.0) | 2/24/2020 | Hospitals in Québec City, Québec | No |
Québec (Province) | Québec | Santé et des Services sociaux Québec - Fichier cartographique des installations - M02 | Données Québec - Licence Creative Commons (CC BY) | 5/20/2020 | Healthcare and social services facilities in the province of Québec | Yes |
Gatineau, Québec (Municipality) | Québec | Données Québec - Ville de Gatineau - Lieux publics | Creative Commons - Attribution 4.0 International (CC BY 4.0) | 2/25/2019 | Hospitals in Gatineau, Québec | No |
Nova Scotia (Province) | Nova Scotia | Open Data Nova Scotia - Long Term Care and Residential Care Facilities | Nova Scotia Open Government Licence | 2/15/2019 | Residential care in Nova Scotia | No |
Ontario (Province) | Ontario | Ontario GeoHub - Ministry of Health Service Provider Locations (via: Ontario Data catalogue - Hospital locations) |
Open Government Licence - Ontario | 10/15/2019 | Healthcare facilities in Ontario | No |
Horizon Regional Health Authority (New Brunswick) | New Brunswick | Digital New Brunswick - Hospitals in New Brunswick Operated by Horizon Health Network | Open Government Licence - New Brunswick | 3/18/2020 | Hospitals in New Brunswick operated by Horizon | No |
Vitalité Regional Health Authority (New Brunswick) | New Brunswick | Digital New Brunswick - Hospitals in New Brunswick Operated by Vitalité Health Network | Open Government Licence - New Brunswick | 3/18/2020 | Hospitals in New Brunswick operated by Vitalité | No |
Alberta (Province) | Alberta | Alberta Open Government - Hospital services in Alberta | Open Government Licence - Alberta | 7/1/2018 | Hospitals and healthcare facilities in Alberta | No |
Manitoba (Province) | Manitoba | Manitoba Government - Rural Health Care Facilities in Manitoba | (Waived) | 6/30/2017 | Healthcare facilities in Manitoba | No |
Appendix B: Other Publicly Available Data Sources or Sources of Directly-Provided Data
Data Provider | Province/ Territory | Link | License / Terms of Use | Last Updated by Provider | Description |
---|---|---|---|---|---|
Canadian Institute for Health Information | Canada | Provided directly via email | (Waived) | not available | Healthcare facilities in Canada |
Manitoba (Province) | Manitoba | Manitoba Government - Health Services Wait Time Information - Map of Facilities | Manitoba Government - Copyright (Waived) | not available | Hospitals in Manitoba |
Manitoba - Winnipeg Regional Health Authority | Manitoba | Winnipeg Regional Health Authority - Location and Services | Winnipeg Regional Health Authority - Terms of Use and Privacy Statement | not available | Locations of facilities managed by the Winnipeg Regional Health Authority |
Manitoba - Interlake-Eastern Regional Health Authority | Manitoba | Interlake-Eastern Regional Health Authority - Hospital Locations | N/A | not available | Locations of facilities managed by the Interlake-Eastern Regional Health Authority |
Manitoba - Northern Health Region | Manitoba | Northern Health Region | N/A | not available | Locations of facilities managed by the Northern Health Region |
Manitoba - Prairie Mountain Health | Manitoba | Prairie Mountain Health - Locations Map | Prairie Mountain Health - Legal Notice and Disclaimer | not available | Locations of facilities managed by the Prairie Mountain Health Authority |
Manitoba - Southern Health Region | Manitoba | Southern Health - Finding Care | Southern Health - Disclaimers - Terms and Conditions | not available | Locations of facilities managed by the Southern Health Authority |
Nunavut (Territory) | Nunavut | The Government of Nunavut - Qikiqtani General Hospital | N/A | not available | Single hospital in Nunavut |
Public Health Agency of Canada | Canada | Provided directly via email | (Waived) | not available | Hospitals in Canada |
Newfoundland and Labrador (Province) | Newfoundland and Labrador | Government of Newfoundland and Labrador - Services in Your Region | Government of Newfoundland and Labrador- Disclaimer / Copyright / Privacy Statement | not available | Healthcare facilities in Newfoundland and Labrador |
Northwest Territories (Territory) | Northwest Territories | Government of Northwest Territories - Hospitals and Health Centres | Government of Northwest Territories - Terms of use (Waived) | not available | Healthcare facilities in Nortwest Territories |
Manitoba (Province) | Manitoba | Interlake-Eastern Regional Health Authority | N/A | not available | Healthcare facilities in Manitoba |
Yukon (Territory) | Yukon | Provided directly to CSBP via email | (Waived) | not available | Healthcare facilities in Yukon Territories |
Saskatchewan (Province) | Saskatchewan | Saskatchewan Health Authority - Locating Facility and Service Information | N/A | not available | Healthcare facilities in Saskatchewan |
- Date modified: