(The Statistics Canada symbol and the Canada wordmarks are on screen with the title: "Learning optimal intervention strategies through agent-based reinforcement learning")
In the next talk, you'll hear how artificial intelligence can be used for public good. More specifically, the use of reinforcement learning to identify optimal non-pharmaceutical interventions strategies that help reduce the transmission of COVID 19.
What is it that we are trying to accomplish with this project? What kind of is unique about it? And spoiler alert. Instead, it's reinforcement learning, using force learning and this approach.
We want to move away this kind of picking a subset of scenarios and running the models for that. And rather, we want it to kind of explore an entire space of possible scenarios, if you will, a space of agent behaviors and allow the agents to optimize over the space and learn behaviors particular to themselves. And I'll get to what that means later in order to minimize the spread of infection.
What is reinforcement learning? Well, on a high level reinforcement learning is kind of a subset of branch, if you will, of machine learning. And what it entails is essentially kind of two interacting factors just to make it simple. A single agent. And so there's a single agent that essentially lives in some sort of an environment. An environment is a simulation environment. It's basically the world that the agent finds itself in. And at any given time point "T", the agent finds itself in a certain state "ST", and the state is all the information that the agent has at its disposal in order to decide how it's going to behave. What is it going to do? And so based on the current state, the agent takes an action. And so the action here, action at time "T", it's the agent has a set of actions at its disposal. And when the agent takes an action, life continues. The world progresses. The environment transitions to a new state.
How do we encode reinforcement learning? So we use something called Markov decision processes. So an MDPM and it's usually described as a tuple and there's lots of different formulations for this. I'm getting a very general one. The tuple "M" is made up of things like "S" a set of states, "A" a set of actions. These are the things the decisions the action could take given that they're in a certain state policy. So it map states the actions are mapped states to distributions over actions. And this is what ultimately we care about for this project. And we get that through something called a value function. And that's where we encode, the agent learns to estimate the expected cumulative return.
What about the simulation environment? So, OK, so we're going to have this this world where these agents live, how do we build this? What does it look like? So essentially, there are two kind of objects in this world. Agents and nodes. Nodes are essentially locations, and agents are the actors in this simulation. So we use purely open data to build this, this environment. And so each agent has information, demographics about specific to it. So it's age, it's employment. It lives in a house, it has a house, so it has housemates, or maybe not, a family or maybe not in and the nodes are basically locations that agents can can go to. So they have their obviously their house node, senior center and then nodes related to the schools and then nodes related to essentially economic businesses. Well, we basically run a simulation. We build a population of agents. We start with an initial subset of them infected and then the agents run through a course of time, for example, 120 days and every kind of hour of their day, and least when they're awake. The agents are in a state, they take an action and they may interact with someone that is infected or not.
Reinforcement learning is a famously computationally intensive problem. So we were approaching this problem with reinforcement learning, which has a large number of agents, you know, making actions in parallel. But the sequence of actions in the Markov decision process has to be computed sequentially. So in order to make these sort of, you know, simulations like more feasible, you have to make them really fast in order to get all of the different experiments you want done, as well as to get an accurate number of epochs of tests. So you're faced with this, you know, computationally intensive problem. We need the infrastructure in order to make that these sort of simulations possible and we turn to the advanced analytics where we used it in this system to create pipelines for the individual experiments that we were running. We used GitLab and GitLab CI in order to manage some of the development of our simulation code. So the simulation, as it was, took like weeks to run. But you know, this made it possible to do this in weeks instead of months.
So this was, you know, quite a successful sort of test of the technology and a testament to what our sort of new Open-Source tools can bring.
(Canada wordmark is on screen)