Reducing data gaps for training machine learning algorithms using a generalized crowdsourcing application
By: Chatana Mandava and Nikhil Widhani, Statistics Canada
Introduction
Crowdsourcing is an online process in which a company or organization solicits contributions from a large group of people – this can be anything from ideas, content, services, to funding. This process allows companies to tap into the collective intelligence and creativity of individuals they have no connection with. It also helps companies access resources they would otherwise not have access to, such as new technology or expertise from outside their organization.
Crowdsourcing has emerged as a cutting-edge method of gathering important data for statistical purposes, as part of Statistics Canada's modernization. There have been multiple crowdsourcing projects Statistics Canada (StatCan) has implemented. These projects include:
- The OpenStreetMap (OSM) crowdsourcing pilot project that crowdsourced geographic information by mapping building footprints in the Ottawa, Ontario and Gatineau, Quebec areas. This project helped launch the Building Canada 2020 initiative, which mapped all building footprints of Canada in the OSM, by the year 2020.
- The COVID-19 crowdsourcing project in which the public use microdata file was released containing information from crowdsource questionnaires that helped analyze how COVID-19 has impacted Canadians' experiences with discrimination, sense of belonging, trust in institutions and access to health care services. This product is provided using StatCan's Electronic File Transfer Service. (see: Crowdsourcing: Impacts of COVID-19 on Canadians' Experiences of Discrimination Public Use Microdata File)
- The Crowdsourcing-Cannabis project in which StatCan recently crowdsourced the details the publics' most recent cannabis transactions, including the amount, quality, location, and reason for use. Respondents were also asked how frequently and the amount of cannabis they consumed on average, each month (see: Crowdsourcing - Cannabis 2020). This initiative continues to collect information on a relatively new market and helps to monitor prices in a confidential and non-intrusive manner.
There's increasing demand within StatCan and other agencies to collect alternate sources of data generated from crowdsourcing. A recent proof-of-concept project developed by StatCan's Data Science Division, in collaboration with Centre of Special Business Products (CSBP) and Nutrition North Canada created the Indigenous Communities Food Receipts optical character recognition (OCR) project. This proof-of-concept collects grocery receipt images from northern communities within Canada. Key variables from these receipts such as price, product name and subsidy are extracted using OCR methods. Also, the nutrition AI proof-of-concept project by StatCan's Center for Population Health Data (CPHD) explored food images to collect nutrition data such as portion sizes and calories (see: Context modelling with transformers: Food recognition). The major component for the above two projects is the crowdsourced data though, the data being collected for these two projects are different. In these cases, a generalized application will help the organization to crowdsource different formats of data. This application can be reused to crowdsource for multiple projects. This will reduce the workload to create multiple applications to collect information.
These exploratory projects have inspired us to develop and expand its use cases by crowdsourcing various unstructured data formats like text, PDFs, and satellite images, to then be transformed into structured data using various machine learning techniques.
Motivation and value proposition
The motivation behind investing in such an application is to provide a one-stop solution for government organizations to find the minimal infrastructure required to host crowdsourcing applications. This will not only generate a new stream of data collection but will also allow us to investigate data diversity with unconventional solutions. The pool of data will cover more use cases for where data sources are limited and allow our machine learning models to increase in performance and scalability.
The value for developing a generalized crowdsourcing application is twofold. First, it's an efficient tool that collects data from a large sample size. This enables us to generate reliable and timelier statistics on various topics with low cost, such as population trends or economic development. Second, the application could be used to facilitate collaboration between the public and researchers by allowing them to share their knowledge and experiences with one another to generate better insights into important issues facing the country. By leveraging the collective intelligence of Canadians across all demographics, StatCan would have access to rich information that can inform policy decisions and improve public services.
Architecture
Figure 1 can be projected into three core sections:
Backend
The tables were saved using an SQLite database. SQLite is a relational database management system (RDBMS) contained in a C library. Unlike other database systems, you don't have to configure or instal it to use it. It stores data in tables like other RDBMSs such as MySQL and PostgreSQL but requires less memory and disk space than these systems. SQLite databases can be used for applications ranging from small single-user projects to large, distributed web applications with millions of concurrent users. Data custodians who own the crowdsourced data, can access them in a structured format. In addition, the application will authenticate certain users who are administrators or developers of the application to manage security and functionalities. The schema used for this project is displayed in the below diagram.
Crowdsourcing Builder
The Crowdsourcing Builder is a feature that includes existing interfaces with design templates which can be used to build crowdsourcing apps based on use cases. Data custodians can use Crowdsourcing Builder from the application itself to generate forms without writing code. These custom templates can then be hosted and configured in the application by the data custodians. The idea is to allow users to build and host many crowdsourcing pages using one common application.
Frontend
The final functionality of the application is its frontend. The frontend of a crowdsourcing application is the interface that users interact with. It includes graphical elements such as buttons, images, menus, and forms that allow users to perform tasks within the application. The frontend also provides visual feedback to help guide users through their tasks. The goal of a well-designed frontend is to make it easy for users to understand how they can use the application and quickly accomplish their goals.
Potential challenges
- Ensuring security: One of the biggest challenges while developing a generalized crowdsourcing application is ensuring that all user data and interactions are secure. This includes protecting users' personal information.
- Creating an engaging UI: Building an intuitive and engaging UI is essential for any successful crowdsourcing application. Designing a UI that appeals to both new and experienced users can be difficult, so developers must ensure that the features are easy to use yet powerful and flexible enough to meet their needs.
- Implementing quality control measures: It's important to implement quality control measures to ensure that only high-quality tasks and results get posted. Measures include cross-checking the data submitted by the users in real time, such as image quality standards check, grammar check, sensitive data checks, and uploaded file extension verification. As this generalized application collects multiple formats of data, it's important to develop an extremely time efficient algorithm that can cross check the above-mentioned quality measures and notify the user if the uploaded results pass the quality check.
Conclusions
We have discussed how a single application can be built to perform the crowdsourcing task for different types of structured and unstructured data. This will allow an organization to investigate alternative data and use innovative methods to collect data and develop different solutions. It will also allow us to engage with the public to better understand issues at the planning or design stage of new projects. Crowdsourcing is a modern approach to collect data from audiences who are interested in bringing change and engage in the process of improving new statistics. By combining this with machine learning processing techniques, we can create new solutions which were not possible before due to limited and costly data.
Meet the Data Scientist
If you have any questions about my article or would like to discuss this further, I invite you to Meet the Data Scientist, an event where authors meet the readers, present their topic and discuss their findings.
Thursday, February 16
2:00 to 3:00 p.m. ET
MS Teams – link will be provided to the registrants by email
Register for the Data Science Network's Meet the Data Scientist Presentation. We hope to see you there!
Subscribe to the Data Science Network for the Federal Public Service newsletter to keep up with the latest data science news.
References
Statistics Canada (2008). Crowdsourcing: Impacts of COVID-19 on Canadians' Experiences of Discrimination Public Use Microdata File (accessed January 6, 2023).
Statistics Canada (n.d.-a). Crowdsourcing – Cannabis. Last updated January 22, 2020 (accessed January 6, 2023).
Statistics Canada (n.d.-b). Statistics Canada Data Strategy. Last updated August 16, 2022 (accessed January 6, 2023).
- Date modified: