Official Languages in Natural Language Processing
By: Julien-Charles Lévesque, Employment and Social Development Canada; Marie-Pier Schinck, Employment and Social Development Canada
It is no secret that English is the dominant language in the field of Natural Language Processing. This can be a challenge for Government of Canada data scientists, who must ensure the quality of French data and that data from both official languages receive equal treatment so as to avoid any biases.
The Data Science Division of the Chief Data Office (CDO) at Employment and Social Development Canada (ESDC) is launching a research project on the use of natural language processing (NLP) for both official languages. This initiative, funded by ESDC's Innovation Lab, aims to deepen the CDO's understanding of the impact of language (French or English) on the merit of tools and techniques used in NLP to enable more informed decisions in future NLP projects.
Why is it important to explore the use of both official languages in NLP?
ESDC has experienced this firsthand through their work on NLP projects, and some of their partners in other departments have reported facing this issue as well. While there are numerous possible approaches to the treatment and processing of data in multiple languages, it is unclear whether some of these approaches work better than others to provide predictions of comparable quality for both official languages. Essentially, because the impact of how language is processed is never the sole focus of projects, data scientists can only invest limited time and resources to explore that question—which could lead to suboptimal decisions. For the French language, there is a need to better understand the implications of the choices made by data scientists when applying NLP techniques. This will lead to better quality handling of French data, thus helping to reduce language-driven biases and increase fairness for solutions impacting service delivery to clients, as well as internal solutions.
New research into NLP techniques for official languages
To address this problem, ESDC is launching a research project that will focus on some recurrent questions surrounding the application of NLP techniques to both official languages. This includes techniques for preprocessing, embedding and modelling text, as well as techniques to mitigate the impact of imbalanced datasets. They want to gain transferable knowledge that could be leveraged by their team, as well as the GoC data science community, to help bridge the gap between French and English when it comes to the quality of NLP applications in the federal government.
For now, they will only be using text classification problems as use cases. It is both a very common NLP task as well as one that they have worked on in numerous projects. They have access to several labelled data sets from these past projects, enabling them to ground their findings in a more applied context, using real life data from their department.
Leveraging existing data sets
The ESDC team will be using datasets from four past text classification problems they solved. These vary in terms of the length of documents, the quality of the text, the classification task (binary vs. multiclass), the proportion of French and English content as well as the way the French content was handled. For context, each of these problems is explored in more detail below.
- The T4 project is a binary classification problem of notes written by call centre agents; the objective was to predict if a T4 had already been re-sent to a client or not.
- The Media Monitoring project is a binary classification problem of NewsDesk news articles; the aim was to predict if articles were relevant to senior management.
- The Record of Employment Comments (ROEC) project is a multiclass classification problem, where the objective was to predict which reason for separation corresponded to employers' comments on Record of Employment forms.
- The Human Resources (HR) project is a research project that explored the pre-selection of candidates for large entry-level staffing processes. It was framed as a binary classification problem where the objective was to predict the label attributed by HR staff based on the candidates' answers to screening questions.
Project name | Problem Type | Dataset size | Proportion of French content | Input length | Method used |
---|---|---|---|---|---|
T4 | Binary | Small (6k) | 35% | Short | Tokens in both languages,N-Gram & Chi-Square + MLP |
Media Monitoring | Binary | Large (1M) | 25% | Long | French translated to English, Meta-embedding (from GloVe, FastText and Paragram), Ensemble model (LSTM, GRU, CNN) |
ROEC | Multiclass | Medium-Large (300k+) | 28% | Short | Tokens in both languages, N-Gram & Chi-Square + MLP |
HR | Binary | Small (5k) | 6% | Medium to long | Pretrained multilingual contextual embeddings (BERT Base) followed by fine-tuning |
Key research questions
This work will explore key questions that typically arise when developing NLP solutions for classification. The recurring question of imbalanced datasets in GoC data (more observations in English than in French) will also be addressed. More specifically, this project will attempt to answer the following questions:
- What is the difference between using a separate model for French and English and using a single model for both? Can general rules or guidelines be inferred for when each approach might be preferable?
- Is the strategy of translating French data to English and then training a monolingual English model valid? What are the main factors to take into consideration when using that approach?
- Are models trained on a multitude of languages biased in favor of one language over the other? Is the understanding of French documents equivalent to the understanding of English ones?
- What is the impact of the imbalance in language representation in the data? Is there a minimum French to English ratio that should be targeted? Which methods should be used to mitigate the implications of this imbalance?
Sharing the results
The bulk of the experiments will be completed over the summer, and a presentation and report will be prepared and circulated sometime during the fall. This detailed report will document the research and exploration that took place as well as the findings. The report will be technical, with data scientists as the targeted audience, since the main goal of this initiative is to enable them to make more informed decisions when handling French data on NLP projects. Additionally, a Machine Learning Seminar will be prepared to discuss this research initiative. The specific topics discussed, and the number of sessions offered, will be driven by the conclusions of the study.
Let's connect!
The team hopes that this research initiative will bring value to future bilingual NLP projects through a more informed handling of French content, thus allowing a higher quality final product. In the meantime, if you have also been facing challenges when using NLP on bilingual datasets, if you have comments, ideas, or maybe some lessons learned that you think would be of interest, or if you simply would like to be kept in the loop, don't hesitate to reach out! The project team invites you to chat with the GC data science community by joining the conversation in the Artificial Intelligence Practitioners and Users Network!
Team Members
Marie-Pier Schinck (Data Scientist), Julien-Charles Lévesque (Data Scientist)
- Date modified: