Data Catalogs

Data Catalogs and Datasets

Datasets are the files and tables that data workers need to find and access. They may reside in a data lake, warehouse, master data repository, or any other shared data resource.

Data Catalogs help organise all of your metadata in one place. It is combined with data management and search tools helping analysts and other data users to find the data that they need. The stakeholders can be data analysts, data scientists, data stewards, and other data consumers. It is an inventory of all of the available data sets, providing info. to enable the evaluation of the ‘fitness’ of the data for intended use cases and decisions. Datasets are connected with rich information to assist you in finding the correct data.

What are Data Catalog tag templates?

Data Catalogs and their tag templates assist in creating and managing common meta-data concerning data assets in a particular single location. Tags are attached to the data meaning discovery is achievable in the Data Catalogs system.

Data may be stored in all kinds of ways. A relational database, a cloud storage facility, a data lake or just some kind of file type. You need to go to one place and find what you are looking for. A catalog looks at all the different types of data. There are three main users of data catalogs:

  1. Data Engineer, Data Analysts, etc. They grab the information and set it up in the database. They may profile it and look for ‘dirty data’ and clean it up before it shows up, ultimately to the user.
  2. Data steward. They are like a librarian.  Correctly organise the data, identify the data through various tags and analyse some KPI’s e.g. the quality, how much it being used and where did it came from i.e. its lineage). They perform the data governance role – who can get access to what data.
  3. Data Consumer. They may need to pull the data and do some analysis. They want to do it themselves, rather than have to go and ask an IT person. It’s much like an Online shopping experience. They browse and take the parts of the data they need.

Data Catalogs should part of your Data pipeline and is the last part of the process.