2021 Census Comment Classification
By: Joanne Yoon, Statistics Canada
Once every five years, the Census of Population provides a detailed and comprehensive statistical portrait of Canada and its population. The census is the only data source that provides consistent statistics for both small geographic areas and small population groups across Canada. Census information is central to planning at all levels. Whether starting a business, monitoring a government program, planning transportation needs or choosing the location for a school, Canadians use census data every day to inform their decisions.
Preparation for each cycle of the census requires several stages of engagement, as well as testing and evaluating data to recommend questionnaire content for the next census, as is the case for the upcoming 2021 Census. These steps include content consultations and discussions with stakeholders and census data users, as well as the execution of the 2019 Census Test (which validates respondent behaviours and ensures that questions and census materials are understood by all participants).
At the end of the Census of Population questionnaires, respondents are provided with a text box in which they can share concerns and suggestions, or make comments about the steps to follow, the content or the characteristics of the questionnaire. The information entered in this space is further analyzed by the Census Subject Matter Secretariat (CSMS) during and after the census collection period. Comments pertaining to specific questionnaire content are classified by subject matter area (SMA)—such as education, labour or demography—and shared with the corresponding expert analysts. The information is used to support decision making regarding content determination for the next census and to monitor factors such as respondent burden.
Using machine learning to classify comments
In an effort to improve the analysis of the 2021 Census of Population comments, Statistics Canada's Data Science Division (DScD) worked in collaboration with CSMS to create a proof of concept on the use of machine learning (ML) techniques to quickly and objectively classify census comments. As part of the project, CSMS identified fifteen possible comment classes and provided previous census comments labelled with one or more of these classes. These fifteen classes included the census SMAs as well as other general census themes by which to classify comments from respondents such as "experience with the electronic questionnaire," "burden of response," as well as "positive census experience" and comments "unrelated to the census." Using ML techniques along with the labelled data, a bilingual semi-supervised text classifier was trained wherein comments can be in either French or English and the machine can use labelled data to learn each class, while leveraging unlabelled data to understand its data space. DScD data scientists experimented with two ML models—the strengths of each model, along with the final model are detailed in this article.
The data scientists trained the 2021 Census comment classifier using comments from the 2019 Census Test. The CSMS team manually labelled these comments using the fifteen identified comment classes and reviewed each other's coding in an effort to reduce coding biases. The classifier is multi-class since a comment can be classified into fifteen different classes. As a result, the classifier is also multi-label since a respondent can address multiple topics within a single comment falling under multiple classes, and so the comment can be coded to one or more class.
Deterministic question and page number mapping
When a comment contains a question or page number, that number is deterministically mapped to the SMA class associated to the question and then combined with the ML class prediction in order to output the final class prediction. For example, say that a respondent completes a questionnaire where question number 22 asks about the respondent's education. In the comment box, the respondent comments on question 22 by explicitly stating the question number and also mentions the sex and gender questions without stating any question numbers. The mapping outputs the education class and the ML model predicts the sex and gender class based on the words used to mention the sex and gender questions. The program outputs the final prediction which is a union of the two outputs: education and sex and gender class. When no question number or page is explicitly mentioned, the program only outputs the ML prediction. The ML model is not trained to learn the page number mapping of each question since the location of a question can change depending on the questionnaire format. There are, for example, questions on different pages when you compare the regular font and the large print questionnaires as fewer questions fit per page in large print, and the electronic or online questionnaire does not show any page numbers.
Text cleaning
Before training the classifier, the program first cleans the comments. It identifies the language of the comment (English or French) and then corrects the spelling of unidentifiable words with a word that requires the least amount of edits and is most frequently found in the training data. For example, the word toqn can be corrected to the valid words torn or town, but is corrected to town because town was used more frequently in the training data. Also, the words are lemmatized into their root representation. The machine thus understands the words walk and walked to have the same root meaning. Stop words are not removed since helper words have meaning and imply sentiment. For example, this should be better has a different meaning from this is better, but if the program dropped all stop words (including this, should, be and is), the two sentences becomes identical with only one word left: better. Removing stop words can alter the meaning and the sentiment of a comment.
Bilingual semi-supervised text classifier
The bilingual semi-supervised text classifier learns from the labelled comments and is used to classify comments. Bilingual semi-supervised text classifier is not a single concept but rather individual pieces combined to best classify census comments.
The data scientists have trained a bilingual model where the proportion of French to English labelled comments as detected by a language detecting python program was 29% and 71%, respectively (16,062 English labelled comments and 6,597 French labelled comments). By training the model on both languages, it leveraged identical words (such as consultation, journal and restaurant) that have the same meaning in both languages to improve the accuracy of French comments which have less labels than English comments.
The model is semi-supervised. Labelled data define the knowledge that the machine needs to replicate. When given the labelled training data, the model uses maximum likelihood to learn the model's parameters and adversarial training to be robust to small perturbations. Unlabelled data are also used to expand the data space that the machine should handle with low confusion but does not teach the model about the meaning of classes. The unlabelled data are only used to lower the model's confusion using entropy minimization to minimize the conditional entropy of estimated class probabilities and virtual adversarial training to maximize the local smoothness of a conditional label distribution against local perturbation.
The text classifier starts with an embedding layer to accept words as input. A lookup table will map each word to a dense vector since the machine learns from numbers and not characters. The embedding layer will represent a sequence of words into a sequence of vectors. With this sequence, the model looks for a pattern that is more generalizable and robust than learning individual words. Also, to prevent the machine from memorizing certain expressions rather than semantic meaning, a dropout layer directly follows the embedding layer. When training, the dropout layer drops random words from the training sentence. The proportion of words dropped is fixed but the dropped words are selected at random. The model is forced to learn without some words so that it generalizes better. When using the model to classify comments, no words are dropped and the model can use all identified knowledge and patterns to make a prediction.
Comparing CNN to Bi-LSTM
The data scientists compared a convolutional neural network (CNN) to a Bi-directional-Long Short Term Memory (Bi-LSTM) network. Both networks can classify text by automatically learning complex patterns, but learn differently because of their different structures. In this proof of concept, the data scientists experimented with three different models to learn all fifteen classes: a single-headed LSTM model, a multi-headed LSTM model and a multi-headed CNN model. Overall, the single-headed LSTM model consistently predicted all the classes the most accurately and will thus be used in production.
LSTM can capture long-term dependencies between word sequences using input, forget and output gates as it can learn to retain or forget previous state's information. Previous state's information is the context made by the group of words that preceded the current word that the network is looking at. If the current word is an adjective, the network knows what the adjective is referring to because it retained that information earlier in the sentence. If the sentence talks about a different topic, the network should forget the previous state of information. Since Bi-LSTM is bi-directional, the model gathers past and future information relative to each word.
The CNN model applies a convolution filter to a sliding window of group of words and max pooling to select the most prominent information from a phrase of words rather than looking at each word independently. CNN defines the semantic context of a word using neighbouring words, whereas LSTM learns from a sequential pattern of words. Individual features are concatenated to form a single feature vector that summarizes the key characteristics of the input sentence.
A multi-headed classifier was tested with a final sigmoid layer giving a confidence distribution of the classes. The sigmoid layer will represent each class prediction confidence score as a decimal between 0-1 (i.e. 0% - 100%) where each score is independent to each other. This is ideal for the multi-label problem of comments that talk about multiple topics.
The data scientists also tested a single-headed classifier where a model only learns to identify if a single class is present in the text using a softmax activation function. The number of single-headed classifier is equal to the number of classes. An input comment can have multiple labels if multiple classifiers predict that its topic is mentioned in the comment. For example, if a comment talks about language and education, the language classifier and education classifier will predict 1 to signal the presence of the relevant SMA classes and other classifiers will predict 0 to signal the absent.
A single-headed classifier learns each class better than a multi-headed classifier which needs to learn fifteen different classes, but there is the added burden for programmers to maintain fifteen different classifiers. The burden to run the multiple classifiers is minimal since it can easily be programmed to run all classifiers in a loop and output the presence of relevant class. As shown below, the single-head Bi-LSTM model performs the best across the different classes and also in the weighted average.
Table 1: Test weighted average F1-score of different models.
F1-score | |
---|---|
Single-head Bi-LSTM | 90.2% |
Multi-headed CNN | 76% |
Bi-LSTM | 73% |
Amongst the multi-headed classifiers, CNN had a 4.6% higher average test F1-score than Bi-LSTM when classifying comments into SMA classes such as language and education. On the other hand, the Bi-LSTM model's average test F1-score on general census themed classes (i.e. "unrelated to the census," "positive census experience," "burden of response," "experience with the electronic questionnaire") was 9.0% higher than CNN model. Bi-LSTM was better at predicting if a comment was relevant to the Census program or not because it knew the overall context of where the comment was directed. For example, a respondent's positive opinion on a Canadian sports team is not relevant to the census, so this type of comment would be classified under the class "unrelated to the census." In this case, the CNN model predicted the comment to be positive in nature and thus to the positive census experience class, whereas Bi-LSTM tied the positive sentiment to the context (sports teams) and since the context was unrelated to the census, it correctly labelled it to be of no value for further analysis by CSMS. CNN, on the other hand, only looks at a smaller range of words so it excels in extracting features in certain parts of the sentence that are relevant to certain classes.
Next steps
This proof of concept showed that a ML model can accurately classify bilingual census comments. The classifier is multi-class, meaning that there are multiple classes to classify a comment into. It is also multi-label, meaning that more than one class may be relevant to the input comment. The second phase of this project will be to transition this model into production. In production, French and English comments will be spell checked and stemmed to the root words depending on each comment's language. A bilingual semi-supervised text classifier will predict both the cleaned French and English comments. The labelled 2019 data will train the ML model to predict and label incoming comments from the new 2021 Census of Population and ensure that the respondent comments are categorized to and shared with the appropriate expert analysts. In the production phase, when 2021 Census comments come in, the CSMS team and data scientists will continue to validate the ML predictions and feed them back to the machine to further improve the model.
If you are interested in text analytics or want to find out more about this particular project, the Applied machine learning for text analysis community of practice (GC employees only) recently featured a presentation on this project. Join the community to ask questions or discuss other text analytics projects.
- Date modified: