Legacy data inventory and profiling is a structured and comprehensive way of learning about the corporate data asset. This activity is centered on a professional data analyst who gathers information, runs a variety of reports and ad hoc queries to assess the existing data and creates or updates documentation about the existence, scope, meaning and quality of the data asset.
The resulting documentation (including high-level summaries and easily navigated detail data behavior documentation) should provide answers to the following key questions:
In addition to the technical issues of where the data is located, the lead data analyst should also assess and document the political issues surrounding the data—an issue such as who “owns” any portion of the data or who seeks to limit read-only access to the data.
The meaning of the data is partly determined, by the physical and logical architecture in which it resides. Therefore, if logical data models exist for the data asset (more than one may be necessary in larger, older and complex organizations), then such a data model should be ratified or updated by this review process. This is necessary because businesses often change to meet competitive demands faster than the physical data architecture can keep up with it. For example, the lack of fields for new business categorizations may tempt clever analysts and programmers to re-use an existing field for a secondary business purpose. Such modifications are often unmentioned in the documentation, but can be discovered by thorough data profiling and domain studies.
And where no logical data models exist, the actual data behavior provides a basis (along with other sources of business knowledge) for construction of an “as-is” model of the business. Such models (new or modified and ratified) are essential for evaluating the adequacy of any new application design (package or home-grown) to meet both existing and future business requirements.
The core activity of this data inventory is surveying actual data behavior by inspection—running queries against production (not test) data—production data which has grown over normal business activity. So while original documentation (reflecting the intent of the systems designers) is useful, actual data behavior reveals what has occurred and how things may have changed.
Three Basic Levels of Data Inventory
Level-1 Identifies all the databases (or “data mass” if not on a DBMS) in the organization, with a summary of the number of tables, the business scope, updating applications and read-access requirements.
Level-2 Identifies all the tables in each database: their size, their growth patterns, the business events creating new records, their age distribution, etc.
Level-3 Looks at each column or field in each table: its business meaning, its format, the source of the data in it, its quality and completeness and any lateral relationships with other data fields. This Level-3 survey uses a domain study format to provide a quick grasp of the scope of values found in the field, giving visibility to the anomalous values and the extremes of the collating sequence.
Tools
A technical environment must be established to allow the data analyst unrestricted read-only access to production business data or exact copies (as might be found in some backup files). Much of the survey requires sudden, ad hoc queries against the data to answer new questions which come up in the analysis. Thus, it is important to allow the data analyst to compose such queries through a simple programming language or simple reporting tools.
The initial survey of the data can be accomplished by a studious data analyst without the time and expense of buying high-end data quality tools. A simple query tool or fourth-generation language may be sufficient. The goal is to reduce the cost of posing ad hoc queries or simple reports against the business data. Cooperative measures should obviously be exercised to prevent these queries from disrupting existing business processing. Such a query tool strategy would also place a high value on ease-of-use and re-use of code.
If, after using these techniques for running domain study reports, selected formatted dumps, cross-field consistency tests and referential integrity tests, it becomes obvious that a more sophisticated data quality tool would be helpful, then such a purchase may be warranted.
Data Quality
A comprehensive survey of the data assets must include estimates of data quality. Data quality includes the presence of rows of data for corresponding objects in the environment of the business, the validity, reasonability, precision, accuracy and currency of the data in each field or column. Other things to consider are an assessment of the adequacy of data definitions, the consistency of definitions over time and other dimensions.
Organized Data Documentation
If a metadata repository exists, use it. If none exists in the enterprise, then a simple network of documents may be created to allow the recording of findings of data exploration and data profiling. These documents would be networked from database, table and field details. Other documents may also be warranted, such as a summary of anomalies and exceptions found, with recommendations for remediation.
Proven Benefits
I have applied these techniques repeatedly to a variety of databases for several companies with success. A thorough data profiling analysis uncovers anomalies and errors in the data, as well as “rogue” uses of data fields. For example:
The domain study techniques I have developed are very good at finding anomalies. Data anomalies may either be data errors, or data which accurately describes anomalies in business behavior—either rogue employees being creative, or entrepreneurial. In all cases, such surveys of the business data provide many business users and managers with a better understanding of their data assets and current business behavior.
As the data analyst runs these studies, ad hoc reports and tests of the data asset, he or she develops a “feel” for the quality, behavior and anomalies. This valuable knowledge needs to be properly documented so anyone else using the data is not surprised by the content. This documented knowledge also provides a good foundation for the design of a data warehouse.
As this body of knowledge grows, key metrics of data quality and completeness can be developed, and run on a regular basis to establish a time-series chart of progress.
Conclusion
Building a comprehensive data inventory requires the work and skills of an insightful (and at times cynical) data analyst, who thoroughly examines the content of production databases. It is not only a technical activity, but also requires comparing the data behavior and derived meaning against the expectations of programmers, analysts and business users. It is a business-oriented activity—not hardware oriented.
In spite of the effort, the payback in finding data anomalies, errors and future problems is well worth the effort.
Recent articles by Michael Scofield
Michael is a widely known speaker and author on data quality and semantic data integration. He has held data architecture and data quality management positions in banking, finance and education. He has taught workshops for numerous organizations including the information quality conferences, numerous DAMA chapters, The Data Warehousing Institute, the Institute of Internal Auditors, chapters of the Quality Assurance Association, the Enterprise Data Forum, European Meta-data Conferences, Association of Computing Machinery, numerous DBMS user groups, and business intelligence tool conferences. His articles appear in numerous professional journals, and he writes occasional humor for the Los Angeles Times and other magazines. Michael can be reached at NMScofield@aol.com.