What metadata to capture?

In two recent LinkedIn posts (here and here), I posed the question of what kind of metadata to register. The underlying premise is that one needs information about the data itself to manage it: metadata is a prerequisite for sound data management.

I received great feedback and suggestions. In this article, I will provide a summary of all the kinds of metadata that were suggested as candidates. Additionally, I will elaborate on several concerns that need to be taken into account when implementing metadata.

Types of metadata

First of all, it is necessary to distinguish different types of metadata. A common classification is business, technical and operational metadata. This classification is also described in DAMA DMBOK.

Initially, I described my interest in metadata that needs to be captured manually, which can be regarded as primarily business metadata. A great refinement was made to that scope, stating that the focus is on metadata that is not intrinsic to the data itself. Instead, it needs to be attributed explicitly. In other words, it concerns external attributes of data. Moreover, it implies that the judgement of a human being is required to determine the metadata (value). The metadata can’t be derived or generated automatically.

Focus and scope

As I mentioned in my original post, what metadata to register is dependent on specific circumstances and organisational context. Not every organisation requires the same kind of metadata. This was highly endorsed by many commenters. As with normal data, specific requirements should also be a key driver for capturing metadata. Regulatory requirements and risk management were mentioned as some of those key drivers. Not adopting a requirements-driven approach, imposes the risk of too much metadata. Metadata, as well as normal data, needs to be managed as well. So, there’s a logical imperative that metadata that is captured, is useful and serves specific purposes. 

Although required for sustainable metadata management, data governance or metadata tools pose a risk as well in that sense. That has to do with the capability of defining your own custom metadata fields. If not managed correctly (Who is allowed to add new custom fields? What are the requirements for adding custom fields? How do we ascertain that new custom fields align with existing metadata? Etc.), the use of such capabilities may lead to an explosion of metadata as well.

This is one of the drivers why we adopt a metamodel-driven approach for metadata management at Deltiq. Every type of metadata is managed by way of an integrated metamodel, independent of the physical implementation of that metadata or the tools used to capture and administer it.

List of metadata

Based on all the feedback, a list of metadata types is given below. Where relevant, I have used some of the original comments that were given. Furthermore, I have provided a global definition or description. This list is not by any definition exhaustive and can and should be refined and expanded for specific applications. Hopefully, it provides a quick reference to start capturing metadata that is relevant to your organisation’s specific context.

Please note that the value for specific metadata itself can be context-specific as well! For example, access policies may differ, depending on the intended use of the data.

Definition: Description that explains the specific meaning of a data element.

Allowed values: A specification of values that are allowed for a data element. This obviously may differ from the actual values that are encountered in the data itself. For our Mortgage Sum data element, a value rule may be that any amount must be higher than € 0,-.

Datatype & field length: Specification whether a data element is a string, integer, date, floating-point number etc.

Personal data: Information on whether or not a data element concerns personal data.

Confidentiality or sensitivity classification: Indication, often based on a taxonomy, of the level of confidentiality with which a data element is to be treated, like ‘Public’, ‘Internal use only’, ‘Strictly confidential’.

Data owner, -steward, SME: Allocation of responsibilities to a specific organisational position or employee about different kinds of tasks. Data steward: responsible actor for monitoring data quality; SME: subject matter expert; the go-to person for a question about the data itself or its definition.

Retention requirements: Specification of how long data can or should be retained; archiving and purge policies.

Business processes: Business processes that either create or consume the data.

Collection objective: Original business purpose or objective why data was captured in the first place.

Collection method: The way data is initially collected, for example via manual data entry, sensors, semi-automatic (e.g. manually scanning retail goods).

Allowable use: Conditions that define or explain how data is allowed to be used. As a result, such policies should also make clear, how data is not to be used.

Usage, lifetime information: Statistics or information on how (often) data is used and for what purpose.

Legacy vs. active data: Indication whether or not data is still used.

Delivery agreements, data handshake: Agreements that concern the exchange of data between a supplier of data and a user of data. Can be either internally or externally.

Contractual / legal obligations / Digital rights: Rules or agreements that originate from laws, regulations or contracts (e.g. with a commercial data provider) that either limit or enable the processing of data.

Source system / System of record / golden source: Applications or systems that contain the data. A system of record or golden source is the system that is designated or seen as the authoritative source system.

Lineage, horizontal data architecture: Information about the subsequent use of data, where data follows a flow pattern. Data is exchanged between a source and a target where the target can act as a source for a subsequent exchange.

Access policies: Policies, guidelines and/or rules that govern and guide how data can or should be accessed.