A Use Case on Metadata Management
By: Ekramul Hoque, Statistics Canada
What is metadata?
Metadata are data that provide information about other data. In other words, it is "data about data." It's one of the core components of data governance as it imposes management disciplines on the collection and control of data. Data scientists spend a significant amount of time gathering and understanding data. We can generate faster insights when we have access to underlying metadata.
Why does an organization need a metadata management system?
When an organization has a metadata management system, it means their employees can add metadata into their repositories quickly and accurately without affecting the access to data within their systems. Doing this improves creative workflows and enhances business processes. For example, one of Statistic Canada's core activities is to do a statistical analysis of a wide range of data types and quantities. To do this effectively, analysts must be able to quickly locate the most useful data to determine its structure and semantics.
Some key benefits of metadata management include:
- maximizing the use of relevant data and improving its quality,
- a common platform by which diverse groups of data citizens can discuss and efficiently manage their work. For example, data engineers who works with technical metadata and data type standards can provide support for generating and consuming metadata
- Creating faster project delivery timelines due to improved data integration across various platforms.
Naturally, successful data analysis relies on strong metadata management. A strong metadata management story can also mean improved data discovery capabilities. It summarizes the most basic information about the data, making it easier to find and track.
Metadata automation is a recent industry trend that is replacing the increasingly tedious process of manual data mapping when managing metadata. Some key benefits of automation include data quality assurance and faster project delivery timelines due to improved data integration across various platforms. Metadata management ensures regulatory compliance through data standardization. It also improves productivity and reduces costs. Metadata management enables knowledge of what data exists and its potential value, thus promoting digital transformation which allows organizations to know what data they have and its potential valueReference 1.
Data standardization
When data are provided by external partners, it's likely their system or application was created independently. Data standardization establishes a mutual understanding of the data's meaning and semantics, allowing its users to interpret and use the data correctly.Reference 2
As a part of the Data Standardization Governance Collaborative (DSGC), Statistics Canada has aligned with the Statistical Data and Metadata eXchangeReference 3 (SDMX), an international initiative that aims at standardizing and modernizing the mechanisms and processes for the exchange of data and metadata. SMDX has been published as an International Organization for Standardization and approved as an official standard within Statistics Canada.
SMDX is a framework that helps to standardize of both data and metadata. Though it's well-entrenched in the System of National Accounts at Statistics Canada, it's still in the initial phase of being introduced to other areas within the organization. This method of data interoperability should lead to:
- a reduction of duplication;
- better understanding of concepts;
- help to identify data gaps;
- facilitate easier reconciliation; and
- allow in-depth analysis.
The SMDX Standard could be leveraged and aligned in an "agile-light-standard" format allowing the use of tools to rapidly produce infrastructure and interoperability layers, which would allow for sharing of information quickly.
Data cataloging
Another key component for metadata management is data cataloging. Data cataloging is commonly defined as the discovery of data assets from participating data holdings. Its primary objective is to use consistent methods to find the data and the information associated with it. Figure 1 illustrates how analysis processes change when analysts work with a data catalog.
Figure 1: Process with and without a Data Catalog. Graph from Alation - Data Intelligence + Human Brilliance
Without a data catalog, analysts search for information by figuring out documentation based on its ancestral information, by collaborating with associates and working with other recognizable datasets. This cycle requires experimentation and the need to "squander and improve" data. The analyst is then required to look through familiar datasets.
With a data catalog the analyst can search available datasets, assess the data and make informed decisions on which information to use. They can then examine and plan their information effectively and with greater certaintyReference 4. CKAN (named derived from the acronym for Comprehensive Knowledge Archive Network) has been created to support this process.
What is CKAN?
CKAN - The world's leading open source data management system is an open-source data management system for national and regional data publishers, governments and organizations that want to publish, share and make their data open and available for use.
Why use CKAN?
- It's open source and free, which means users retain all rights to the data and metadata they store within the software.
- It's written in Python and JavaScript. The JavaScript code in CKAN is broken down into modules: small, independent units of JavaScript code. CKAN themes can add JavaScript features by providing their own modules. It keeps the code simple and easy to test, debug and maintain, by breaking it down into small, independent modules. Developers are allowed to write extension which is a Python package that modifies or extends CKAN. Each extension contains one or more plugins that must be added to user's CKAN config file to activate the extension's features
- It provides user management and data management.
- It provides a custom extension development.
- It provides an Automation Programming Interface (API) endpoint to store, edit, extract and analyze data.
Metadata use case
At the end of 2019, Statistics Canada's Data Science and Operationalization team began working with the agency's Integrated Business Statistics Program (IBSP). The IBSP is the common data processing system for the majority of Statistics Canada's economic surveys.
The aim of the project is to address the limitations of the current analytical space. A new solution will help:
- address the need for a self-serve analytical solution;
- improve the ability to connect to analytical tools;
- add searchability and discoverability of datasets;
- avoid data duplication;
- move away from one-size-fits-all security access; and
- allow for horizontal analysis using data outside of IBSP.
IBSP and the Data Science Division partnered with the Fair Data Infrastructure (FDI) to determine whether a prototype could be created using open-source tools.
The FDI aims to provide a collaborative data and metadata ecosystem for all data suppliers and users. The core of this space is a data catalog, as well as data and metadata management tools.
Knowledge transfer from analysts to administrator before a Cloud is established
IBSP have analysts who want to access surveys. These surveys are managed and updated by an administrator from the IBSP team; however, the process of updating and creating access causes duplicate and redundant data. Also, the analysts struggle to search for these data and the corresponding metadata since they're available through shared directories.
Figure 2: Identified bottleneck for the IBSP Proof of Concept (PoC)
Knowledge transfer from analysts to administrator after a Cloud is established
The team introduced three components to find a solution to the identified bottleneck:
- The Search Service from FDI: The FDI team has been facilitating metadata registration and discoverability through a data virtualization layer.Reference 5 The search engine is written on top of Elastic Search with API endpoints, allowing external and internal users to manage their data assets.
- CKAN
- The Azure Cloud tenant
The IBSP uploads data and metadata into CKAN and the Search Service from FDI. This enables analysts to search and access data and metadata. The two systems are in sync with the Azure Cloud tenant to manage user authentication and data storage
Figure 3: Solution provided for IBSP PoC
A metadata management solution
A successful metadata management implementation should include; a metadata strategy, metadata integration and publication, metadata capture and storage, metadata governance and management. A metadata strategy guarantees that an organization's whole data ecosystem is consistent. This strategy explains why the company tracks metadata and identifies all of the metadata sources and methods it employs. Such strategy can be very complex in terms of the volume and the variations of the data along with the company's technological capabilities to support it. The diagram below is a high level overview on how such strategy can be implemented.
Figure 4: Metadata Infrastructure
For any organization there is a list of data sources that comes in various forms such as structured data, flat file formats or through web APIs. And these data are consumed for analysts to visualize and report, create analytics or cognitive services. A metadata management strategy is central in ensuring that data is well interpreted and can be leveraged to bring results.
The first step of this data management is data ingestion which commonly goes through a set of transformations and classifications. Adopting a data standardization here is a key process as this will establish a common way of structuring and understanding the data, and include principles and implementation issues for using it. The business objective of this process will also allow collaborative analysis and exchange with external partners.
Through this standardization, the data custodians should be able to register data assets and metadata. They should have the ability to ingest and register their metadata, which will make their data assets discoverable and will allow them to continue to manage their data through a data virtualization layer. And this can be achieved by introducing a data cataloging tool that will help facilitate in providing a consistent method to find data and information that is available for both internal and external data partners of the organization.
With the use of open-source technology and modern cloud infrastructure, it's possible to create a platform where internal and external partners can ingest raw data from various sources to a secure storage space (i.e. Data Lake or Blob storage). Rather than having a 'on-prem' database for the data cataloging tool or the metadata registration like Postgres, etc. its more scalable and robust to have a cloud storage as a backend to these systems. It will not only allow to update, sync and share easily, but also help manage access control sensitive data as well.
Search-services can be implemented at the last layer of this strategy to make the data and metadata discoverable to its end users. When there is a data gap, users should be allowed to report so that data stewards could know what data needs in order to fill. All the communication between the components in the diagram can occur through APIs or SSH to allow modular integration system.
Lastly, an organization needs a metadata governance structure, which includes an assessment of metadata responsibility, life cycles, and statistics, as well as how metadata are integrated into various business processes.
If you have any questions about this article or would like to discuss this further, we invite you to our new Meet the Data Scientist presentation series where the author(s) will be presenting this topic to DSN readers and members.
Tuesday, June 21
2:00 to 3:00 p.m. EDT
MS Teams – link will be provided to the registrants by email
Register for the Data Science Network's Meet the Data Scientist Presentation. We hope to see you there!
- Date modified: