1. Introduction
This report provides the background, general methods and results of the model used to produce estimates for stocks of principle field crops in March. The work was completed by the Agriculture Division and the Economic Statistics Methods Division at Statistics Canada.
The general methodology for the model – which lists the targeted crop-province combinations - is outlined in Section 2. In Section 3, the data sources used to build the predictor set are listed. Section 4 contains the modelling methods and evaluation metrics used, while the fifth section outlines the results.
2. General methodology for ending stock modelling
A methodology for modelling stocks of principle field crops was developed and tested on the crop-province combinations that are typically published from the Field Crop Reporting surveys, as shown in Table 1. These crop-province combinations account for nearly the entirety of field crops stored on Canadian farms.
Crop type | Province | ||||
---|---|---|---|---|---|
Quebec | Ontario | Manitoba | Saskatchewan | Alberta | |
Barley | X | X | X | X | X |
Canola | X | X | X | X | X |
Corn for grain | X | X | X | X | |
Oats | X | X | X | X | X |
Rye | X | X | X | X | X |
Dry peas | X | X | X | ||
Flaxseed | X | X | X | ||
Lentils | X | X | X | ||
Soybeans | X | X | X | X | |
Wheat, durum | X | X | |||
Wheat, all excluding durum wheat | X | X | X | X | X |
Canary seed* | |||||
Chickpeas* | |||||
Mustard seed* | |||||
Sunflower seed* | |||||
* available only at a national level Note: Farm stocks were also modelled in the Maritimes and/or British Columbia for select crops (oats, canola, soybeans, corn, barley, dry peas, and wheat) when data was available. |
The goal of the model is to produce an official and accurate estimate of March ending on-farm stocks for select crop-province combinations using information from existing data sources. The quality of model estimates will be discussed further in sub-section 4.3 and Section 6.
3. Data sources used in the model
The modelling methodology used three data sources: 1) Statistics Canada's Crop Reporting Series data, 2) Statistics Canada's Farm Product Price Index and 3) alternative data from the Canada Grain Commission.
3.1 Crop Reporting Series data
Statistics Canada's Field Crop Reporting Series obtains information on the Canadian grain industry (for details regarding the methodology of the Field Crop Reporting Series, please visit Field Crop Reporting Series). The surveys are run each year in March (cancelled in 2023), July (June, starting in 2020 onward), September (cancelled in 2016), November, and December. As the crop year progresses, different aspects of the grain industry are collected. Some of these measures were used directly (seeding intentions [March], on-farm ending stocks [July, December], the production of field crops [November], and some of them were used to derive new predictors (the percentage of supplies which were produced, the percentage of disposition which was delivered).
3.2 Farm Product Price Index data
The Farm Product Price Index publishes estimates on the price of many different agricultural products, including nearly every grain and field crop for which ending stocks are published. Prices are published on a monthly basis for each province. Since the analysis spans more than 15 years, inflation would obfuscate the relationship between price and ending stocks. Therefore, price in its raw form was not a suitable predictor. Raw monthly provincial prices for each grain and field crops were transformed into two predictors (year over year percentage difference, quarter over year percentage difference) which better suited the model.
3.3 Alternative data from the Canada Grain Commission
The Canada Grain Commission (CGC) collects data concerning many aspects of the grain industry in Canada. Among these are field crops deliveries from farms to elevators. Deliveries are disseminated each week as part of the CGC's Grain Stats Weekly program at the provincial level.
Given the frequency that deliveries data are released, two predictors were derived from CGC's data: the total deliveries during a given crop year, and the total deliveries since the last Field Crop Reporting Survey.
4. Modelling March ending stocks
4.1 Modelling methods
The model is constructed using the historical relationships between the published estimate of on-farm stocks of grain and various characteristics of the crops and grain industries. Data from 10 years prior to the estimation year are used in deriving the model.
The on-farm stocks of grains are estimated using the statistical programming language R. Two learning algorithms are employed: the Least Absolute Shrinkage and Selection Operator (LASSO) from the glmnet package for feature selection and the robust regression from the MASS package for prediction.
The LASSO is optimized from 100 unique lambdas using leave-one-out cross-validation. Predictors which haven't been shrunken to zero from the LASSO are passed into the robust regression. The robust regression uses the MM-estimator.
4.2 Defining the Target
The highest quality estimate of on-farm stocks of grains is the published estimate. The published estimate is derived from the Field Crop Reporting survey estimate which has been adjusted using the Supply and Disposition equation and alternative data sources as reference. This process is outlined in greater detail here: Supply and Disposition of Grains in Canada.
The supply and disposition equation is:
,
and
,
where
DELIVERIES estimates are taken from administrative data from the CGC.
PRODUCTION estimates are taken from the November Field Crop Survey.
BEGINNING STOCKS are taken as the ending stocks of the December Field Crop Survey.
ENDING STOCKS estimates are currently derived from the March Field Crop Survey.
SEED USE estimates are derived from the November Field Crop Survey.
FEED, WASTE and DOCKAGE estimates are the residual of the equation outlined above.
When it comes time to estimate on-farm stocks of grain, the estimates for all variables but feed, waste and dockage are available. Therefore, the equation can be re-written as:
or
The model is ultimately built to replace the March on-farm ending stocks estimate, however, given the leftover amount is known, it is used to stabilize the target. The model target is the percentage of leftover which are ending stocks.
Like the survey estimate, the modelled estimate is subject to adjustments using the Supply and Disposition equation and alternative data sources as reference.
4.3 Model evaluation methods
During the research phase of this project, many model parameters were considered. This included the learning algorithms used for prediction, the size of the predictor pool, and the shape of the target.
Evaluating the success of the model was done by measuring the accuracy and precision of different parameter selections.
Accuracy is measured by comparing the predictions to the published March ending on-farm stock estimates. Accuracy was measured at both the crop and province-crop levels. In both cases, the weighted mean absolute percent error (wMAPE) was used as the accuracy metric.
The formula for the weighted mean absolute percent error is below:
where prediction is the model prediction, published is the published estimate, i is a given observation and n is the total number of observations in a crop or province-crop group.
5. Results
Overall, the model compares favourably to the results of the March survey. In all but two groups – barley and mustard seed – the model outperforms the survey. Further analysis for barley was conducted (feed deliveries published by the Canada Grain Commission were explored, subject matter experts were consulted), but nothing adequately explained the discrepancy.
6. Data Quality Indicator – the Confidence Interval
All predictive models are subject to errors; therefore, it is important to measure the degree of uncertainty in model estimates. This is done with the use of the confidence interval.
The following is a concise guide to the bootstrap estimation of variance for the March ending stocks model. The main goal of this estimate is to capture the variance of the entire modelling process. Consequently, the key element of the bootstrap – sampling with replacement – occurs at the beginning of the modelling process.
- Step 1: Stratified Random Sampling with Replacement
The training set, which contains all the observations the original model is built from, is stratified by province and then randomly sampled with replacement at the observation level. If there were 10 observations initially, then the new bootstrap set will also have 10 observations, but there may be recurrences of the same observation due to the stratified random sampling with replacement.
- Step 2: LASSO Regression to find Optimal Penalty Size
A lasso regression is cross-validated (at 4 folds, trying 100 different penalty sizes) on the bootstrap modelling set to find the optimal penalty size. This optimal penalty size is then used in the next step.
- Step 3: LASSO Regression to Create Reduced Pool of Predictors
The LASSO is retrained, using the optimal penalty size derived in Step 2. The resulting optimal model may shrink some predictor coefficients to zero. In this case, the predictors would be removed from the initial pool of predictors, creating a reduced pool of predictors.
- Step 4: Robust Regression with Reduced Pool of Predictors
The bootstrap modelling set is cut down to the reduced pool of predictors and then fitted with a robust regression. This robust regression is the model which provides the final estimate.
- Step 5: Prediction is Made using the Robust Regression Model
The trained robust regression is applied to the prediction set.
- Step 6: Repeat Steps 1-5 1000 times
Steps 1-5 are repeated 1000 times, with each new stratified random sample providing a different mutation of the initial modelling set.
- Step 7: Calculate CI using Percentile Method
After the 1000 predictions have been saved, the final output is created like this:
- Order the 1000 replicates
- Save values of 2.5 and 97.5 percentiles (25th and 975th largest replicates).
- Step 8: Repeat Steps 1-7 for each crop
All previous steps must be repeated for each crop.
We can build a confidence interval using the 1000 bootstrap replication. We take the percentile method and remove the bottom 25 and top 25 and it gives us the 95% confidence interval.
These intervals can be used by Subject Matter in their validation process and during the Supply and Disposition process.
7. Release Criteria
A set of rules were established to determine which modelled ending stocks are of an acceptable level of quality to publish. These rules are based on the confidence interval obtained from the bootstrap replicates. Based on these rules, estimates that do not meet quality standards may not be published.