Confidentiality Vetting support: Dominance and homogeneity using R
(The Statistics Canada symbol and Canada wordmark appear on screen with the title: "Confidentiality Vetting support: Dominance and homogeneity using R")
Welcome to Statistics Canada's data access training series.
This video series shows how to perform analyses required for releasing results from the RDCs. The code shown in this video is available. Please ask your analyst if you are unsure of its location. In this video. I will show you how to conduct dominance and homogeneity tests in R. We will be using a dummy census 2016 file, so there is no real cases in any of the examples in this video.
The dominance and homogeneity tests may be required for continuous dollar value variables. These tests are designed to prevent the dissemination of information two situations. The first one is dominance. It is cases where most of the contribution to the statistic comes from one or a few units. The second is homogeneity. That refers to situations where respondents occupy a narrow range of values. NK and p-percent tests are dominance tests.
You should always refer to the official vetting rule documents for detailed vetting requirements.
This video will show three examples of how to use the cd_test function in R to run dollar value tests. Let's start with the first topic - how to set up the R code. The first step is to run the census_dollar_test.R" file.
This will import the function cd_test. With the function imported, we are ready to conduct the tests. The basic statement of the test function is cd_test. Researchers will need to tailor the parameters of the functions to their specific testing needs.
Let's look at the parameters. The most important three are: data, dollar_value, and group.
Data refers to the name of the data object. Before running the code, you will need to first import your dataset and assign it to a data object.
The name of the data object we will see in the example is fake_census.
Dollar_value refers to the continuous dollar value variable.
Group is the name of the categorical variable. There are other parameters users can specify.
The three dots allow users to conduct multiple tests with the same dollar value variable.
The By statement allows users to conduct the same tests over different sub-samples.
Researchers can also specify the weight variable. Finally, path allows users to specify where the final test outputs will be saved. Let's look at three examples of different ways of applying the cd_test function. The first example is a simple test involving one dollar value variable and one categorical variable. The two variables involved are household income (hhinc) and sex. Here is the R code for conducting these tests.
As you can see, this code specifies the dataset, the dollar value variable, the group variable, the by variable which equals NULL, the weight variable and the path.
Of course, parameters in a function can be matched positionally.
Let's run the code in R. The object fake_census is a dataset imported into R. It has 7,428 rows and 482 columns.
In the next step, we will run census_dollar_test.R file in the background by introducing the source command. The cd_test function then is ready to be used.
Let's specify the parameters to the cd_test. Data equals fake_census, dollar value variable is hhinc, group variable is sex, by equals NULL, the weight variable and the path.
And then we hit enter and Success. That means the code ran and researchers can check the result sheet to determine whether outputs are releasable.
The file can be found with the path on-screen. The result files are automatically saved with the dollar value variable name and the date when the tests were conducted.
The file may get overwritten if the same dollar value variable is used on the same day. You may need to rename the result files. You can review the Excel files and use them to support your vetting request. In our next example, we will conduct four different tests with the same dollar value variable at once.
The four tests are: household income and sex, household income and province, household income and marital status, and household income and age groups.
The cd_test code set up is similar to that for the tests between two variables, except researchers will need to specify all four variables.
So after the dollar value variable, you will enter sex, pr, marst, and agegrp5. You can then fill out the rest of the parameters of the function.
Press ENTER. The result reads "Success!" Good.
In this case, all four combinations of tests are saved under different sheets of the same Excel file. Our final example shows how to conduct the same tests over different sub-populations. In this example. if we want to conduct two combinations of tests for citizens and non-citizens, we can use the by statement. For the two categorical variables sex and marst (marital status), this means that we will need a total of four different tests. The results of the tests are saved in two Excel files - one for citizens and the other file for non-citizens. We added the citizen variable in the cd_test command so it shows clearly that only one category of the variable is included in the results for one sub-sample. As expected, the Excel file contains the combinations of tests we are looking for. Now you have seen how to tailor the function to different testing scenarios. the final part of the video will briefly touch on the question of interpretation of the Excel files. It is important to note that the focus of this code is on dominance and homogeneity tests it may not cover all vetting requirements for your vetting request. Please refer to vetting documentation for more details. The Excel files produced by the R code show test results in two column blocks. The first one is a summary of the test results.
This is where you can see whether the specific category of your variables pass the test or not. ok indicates the cell passes the test. whereas "FAIL" shows that the cell fails the test. Having any FAIL indicates that the outputs may not be released. Of course, there are other data scenarios may not fit well into either a "ok" or "FAIL". When that happens, researchers may need to provide additional supporting documents or revisit their analysis.
One example remedy for failed cells is to regroup some categories to increase the counts to the minimum cell size. In addition to the summary results, the Excel files also provide a detailed breakdown of all test values. These additional columns are shown in the later columns of the Excel file. These values are useful if researchers want to have a more nuanced understanding of the tests.
I hope now you know how to conduct dominance and homogeneity tests using R.
Thank you for watching. If you have any questions, please reach out to your local RDC analysts or email us at the address shown on the screen.
(Canada wordmark appears.)