Originally published November 13, 2007
We have spoken about the need for data quality in generating business intelligence in a previous article and nothing illustrates this more than an acronym we are all too familiar with: GIGO – garbage in, garbage out. Yet even before we focus on data quality, we need to step back and address the issue of data integrity, a key concept in information assurance and one with a much wider footprint than data quality.
Without data integrity, we cannot even begin to be concerned with data quality since it assumes that we might not have all the data, or that we cannot access it either physically or logically, or that we can have no certitude of its condition.
There are many definitions of data integrity. Wikipedia says that “data integrity is the assurance that data is consistent, correct, and accessible.” Webopedia states that data integrity “refers to the validity of the data.” SearchDataCenter.com points that “in terms of data and network security, (data integrity) is the assurance that information can only be accessed or modified by those authorized to do so.”
Furthermore, as we move into the database world, data integrity comes in many different types: null rules, unique column values, primary key values, referential integrity rules and complex integrity checking.
While these definitions are all relevant, let’s go with the Committee on National Security Systems and its National Information Assurance Glossary. They define data integrity as the “condition existing when data is unchanged from its source and has not been accidentally or maliciously modified, altered, or destroyed.” (National Information Assurance Glossary, CNSS Instruction No. 4009)
Integrity means wholeness, entirety and soundness. It comes from the Latin for “entire,” and Webster provides three meanings:
Even more telling, it provides the following synonym – honesty – for integrity, and the following synonyms for each of the above meanings: incorruptibility, soundness and completeness.
When looked at under this linguistic magnifying glass, it becomes clear that data integrity is basically about trust in our data.
So what then must we do? Each enterprise must first choose an acceptable working definition for data integrity – one that fits its own culture, mission and priorities. From this base, they must develop a framework within which to address data integrity throughout the organization. In other words, this framework will provide a set of guidelines to answer these key questions:
In developing the framework, it is essential that the all the principal aspects of data integrity are addressed. Hence, the physical and logical integrity of the data must be looked at, along with access control, identification and preservation of data integrity for systems of record and, of course, data quality.
By physical integrity, we are referring to the need to assure its physical protection from either malicious or accidental harm or from natural disasters. In this context, continuity of operations planning and tasking are an essential component of the data integrity framework, as are the more mundane backup/restore processes that IT has instituted for decades.
Preserving the logical integrity of the data needs a bit more explanation. First, integral data must be transparent. That means that when asked, we should be able to provide information about its original source, time stamps, formats, and other relevant facts related to the data models and individual data elements through an equivalent of a data archaeology. Ideally, these would be available on demand to authorized users through an enterprise metadata repository.
Furthermore, it becomes important to track logical data integrity as applications are implemented on architectures where data movement, operations or migrations might alter, truncate, delete or corrupt the data. This is particularly important in the context of systems of record. These are information storage systems that are considered the authoritative data source for a given piece of data or information. The word “authoritative” carries special importance in the legal context. If you submit a tax return to the IRS on paper and they make a data entry mistake as they input your return into an electronic medium, what is the system of record? Should they audit you or request that you pay based on an erroneous calculation? Systems of record are extremely important in many such contexts, and their integrity has to be preserved even as we manipulate and convert the data into other formats or integrate it into other systems for specific applications.
In order to protect data integrity, we also need to establish some controls over access to that data. This, of course, means entering the realm of identity management and data security. It is through identity management that we will control who accesses the data assuring that the person (or system) seeking access is both authentic (they are who they say they are) and authorized (they have the appropriate permission to access the data). Beyond that, we need to address the broader context of security with all its specific ramifications.
Last but not least, we arrive at data quality. Not to in any way diminish its importance, I have left data quality for the end since this is an area that at least we tend to understand better. Under the data quality rubric, we must address all key attributes of data: accuracy, consistency, completeness, timeliness, uniqueness and validity. Each one of these must have its own set of rules on how to deal with specific instances or exceptions.
Ultimately, a data integrity framework at the enterprise level will have to address any compliance requirements related to data integrity (e.g., HIPAA, Sarbanes Oxley) as well as any interrelationships of data integrity with privacy or other salient issues.
So before we can aspire to obtain business intelligence from our bits and bytes, we must be sure that it is sound, incorruptible and complete: in other words, that we can trust it. How else can we be sure that the result of our analysis is really intelligence rather than merely intelligent-looking garbage?
Recent articles by Dr. Ramon Barquin