Originally published October 13, 2005
In my previous article, Data Quality: Building a Data Inventory Utilizing Data Profiling, I gave a general overview to some of the tasks and benefits of a comprehensive inventory of data in the enterprise. I divided the general effort into three phases or three levels (each more detailed than the previous) of inventory knowledge.
Three basic levels of data inventory
The three levels of data inventory correspond to three layers of knowledge about business data. They also tend to be done in sequence, left to right (as shown below):
Fig. 1: Summary of three levels of data inventory.
Level-1 identifies all the databases (or “data mass” if not on a DBMS) in the organization, with a summary of the number of tables, the business scope, updating applications, and read-access requirements. This inventory includes major data flows, and any enterprise-owned data residing outside the premises.
Level-2 of the inventory looks at the tables in each database: their size, their growth patterns, the business events creating new records, their age distribution, etc.
Level-3 of the inventory looks at each column or field in each table: its business meaning, its format, the source of the data in it, its quality and completeness and any lateral relationships with other data fields. This Level-3 survey uses a domain study format to provide a quick grasp of the scope of values found in the field, giving visibility to the anomalous values, and the extremes of the collating sequence.
Data Inventory Level-1 Identifying databases
You can’t manage what you can’t detect. You can’t measure what you can’t see. And you can't generally see what you don’t know about, and where to look. So in constructing an inventory of the data assets of an enterprise, one must search far and wide. If there are production system controls in place in your shop, then there is probably a master list of the production programs, and perhaps the files.
If all the data were in orderly relational databases, we would have a fairly simple activity here; we could go to the RDBMS directory and list the database, list the tables, and list the columns. All done! Go to lunch!
Fig: 2: Simple metamodel of an RDBMS database, table, and column.
But what about other files, not sitting on a DBMS? How about legacy flat files? Many of those flat files going through legacy COBOL systems had multiple record types.
Fig. 3: Multiple type records in one physical file.
In the example above (Fig. 3) we see one physical file with multiple record types. (This will look very familiar to older readers!) Record-type A contains data about the whole invoice, including the shipping address and the total invoice amount, and the invoice date. Record type-B contains the detail of each invoice line item, including the product ordered, the quantity, the unit price and the extended price. In traditional systems, the various record types would not always fill up the same number of bytes, so there would be some filler at the end.
This flat file has a logical data architecture. It contains (in this very simple example) two subject entities:
Fig. 4: Logical model of a flat file with two record types.
So we would, from a normalized view, see these two record types as two “table equivalents.” Or we could call them “logical tables."
Politics
Our survey of the enterprise data asset will be enhanced significantly by including the initial political considerations of data asset. That is, to really inventory data, you must see it (the actual data!). This means you must be able to pose read-only queries against the data to profile it and test its [non-null] existence and quality. There are two potential obstacles to doing that; technology and politics.
Some data files are more sensitive (private) than others--for a variety of reasons. And in large, mature bureaucracies, there are people (or departments) that feel they “own” the data in a particular database, table or field. They often don’t want to share it.
I must say here that an astute data professional would never tell someone in the business user community that they “own” some data. No, that is dangerous, because with that illusion of “ownership” comes a feeling that they can do anything they want to with the data (without regard to downstream users or consequences). Business users and analysts are often tempted to alter the meaning of fields, or use a single field for two, diverse purposes. Or, they may expand the range of valid values, or redefine some of those valid codes. In making such changes, they may have not been aware of, or considered the downstream usage of the data. No, we want them to see the data as a corporate resource, and their role as a custodial one--a sense of stewardship of the data for the benefit of the entire enterprise.
Yet, they may politically control the data. Because of this, one of the first things we must do (often with a substantial critical path) is determine whose permission we, as data analysts, must obtain to actually see the contents of the data.
And we don’t want just to see test data, or some sample of the data. We want, ultimately, to have read-only access to all the data. I stress “read-only” because not only do we not intend to change any data (that is not our responsibility or our privilege in an audit or inventory), we also don't want the power to change business production data, lest something does change (through some other cause), and we get blamed.
No, we just wish to look at the data. But we may still need permission. Hence, we must enhance our level-1 metamodel of knowledge about the data, this sense of ownership or power over the data.
Fig. 5: Enhanced metamodel of a level-1 legacy data inventory.
And we need to start to cultivate a relationship with those “owners” of the data to gain their trust. Down the road, in looking at their data, we are going to find anomalies, and possibly even data errors. Sometimes there will be serious data errors. And we may be in the position of having to deliver bad news. We want to already be on their good side long before we have to do that.
How to inventory cascading data
A common scenario in many legacy systems (especially using COBOL on mainframes) is data cascading through a succession of programs and files.
Fig. 6: Cascading files in a COBOL legacy system.
In the above sketch, a daily invoice file is merged with the previous night’s weekly invoice file to create a “week-to-date” invoice file.
By the way, I am using a tape icon because many of these applications began as batch tape systems. They were converted (in a mainframe environment) to disk-based systems (with catalogued flat files replacing the tapes), but there was no other change to the processing. Sometimes the disk files (in lieu of tape) may have been catalogued as generation data files. So they might have similar names, with only the generation number differing.
In many such situations, all these files have the exact same physical format (with one or many record types). Only their scope is different. The scope is, primarily, defined by the time.
But how do we know that for sure? We could examine the code or processing in programs IV201A, and IV305A. First, are there any filters which would exclude any records? Also, are there any processes in the code which would compute or change any of the data coming through? (If not, why didn’t they just use some kind of sort utility?)
If there is any kind of “value added” in these programs, then we must treat each file as a separate item in our inventory. It is much safer.
Again, here is where we find a profound difference between file formats, and files. Many files can share a common file format, but they may have different scopes.
Fig. 7: Metamodel of legacy cascading flat files (with record types)
So the data element is not the data. The file format is not the same as the actual file. This distinction is lost on many junior programmer/analysts. The distinction was also lost on a major CASE (computer assisted software engineering) tool designer in the late 1970’s. They assumed that the file format and the data element (which were all re-usable--a good thing) were all there was to the data. They provided no way of recognizing physical files (multiple files using the same record description) which may have different scopes.
Distributed databases
There is another kind of parallel design where files may mimic each other in format and structure. This is where an organization installs distributed versions of a common system design. A corporate bank can install identical servers in each branch, each containing some basic reference files. Similarly, a manufacturing company can install identical mini-computers in each plant, warehouse or distribution center (below).
Fig. 8: Parallel distributed systems in warehouses.
In such a design strategy, each Warehouse Management System server has a product master file on it. And to make things convenient, each Product Master should probably be identical to the others. (Hopefully, they are updated on a regularly scheduled basis from a common master maintained at corporate headquarters.)
While the Open Order files should probably only contain those orders appropriate for that warehouse, the Product Master will probably contain the entire range of potential products. The Open Order file is transactional in nature, while the Product Master is a slowly-changing entity.
Some questions for data inventory efforts include…
A logical data model of the knowledge about these distributed systems would be as follows:
Fig. 9: Logical data model for distributed system instances.
One question remains: How do we recognize these distributed systems? So each physical instance of single logical design may require an additional entity on our meta-model.
Fig. 10: Enhanced metamodel accomodating physically distributed (but identical) servers.
Off-site, hosted databases
Not all enterprise data resides on premises owned by the enterprise. A variety of data are “outsourced” or even “offshored." Many companies contract their billing to a service bureau where the data resides on another computer owned by the vendor, often at the vendor’s facility. Lately, many enterprise web sites have developed links (for certain business functions) to external web applications hosted on the vendor’s computer, miles away from the enterprise.
This introduces a new meaning of “distributed.” Yet these databases are part of the data assets of the enterprise. Or, if one considers the database (and hardware) owned by the vendor, the data in the database is our corporate asset. This must be recognized in a data inventory of the whole enterprise.
Now, this introduces a new challenge in data asset inventory. We know it is there, but how can we read it? It is rare that the hosting vendor will allow you to pose ad hoc SQL queries against their database (albeit your data).
You may wish to create your own “shadow” or copy database of the database currently residing on a service bureau vendor’s machine. The shadow copy would be on a server on your premises which you own and control. And you should establish a regular, scheduled refresh of your copy database. If a vendor will not accommodate that, something is wrong. Obviously, you would want to have all the tables, if possible.
But there can be a problem here. The vendor may be running your business on a proprietary software package with what the vendor considers a proprietary database design. This was, of course, a bad situation to get into in the first place. But it is your data (even if it is in their logical architecture). How do you get it out? One solution might be to employ some kinds of API’s which are designed around major, “kernel” subject entities (e.g. customer, product, sales transaction event, etc.). Then, you would construct a slightly-less-than-optimally-normalized database containing YOUR data, copied out of the vendor’s system.
The vendor may offer (perhaps at a cost) to create some “survey” reports, which give you a feel for your data on their site (such as the number of records in “their” customer file). Resist this temptation. What you ultimately want (for the data profiling and quality tests which will come later) is a complete and accurate copy of that data (“Your data!”, he shouts), held on a server which you control, on your premises.
Imported data flows from external sources
Many organizations import data on a regular basis. This can be data about their customers from a third party, or data about competitive behavior, market conditions (commodity prices, etc.), the environment of the business (weather, ocean temperatures, traffic conditions, etc.). Much of this data may be held temporarily in some buffering function before being loaded or integrated into your enterprise-owned applications or data warehouses. This data could be imported hourly, daily, weekly or monthly.
One crucial question arises immediately: Is this imported data scrutinized by any automated screening or quality checks? We shall address that later, but if not, it may be prudent to design a production-grade buffer file where the imported data sits on your box. Perhaps, you may wish to store the on a server, retaining the past number of installments or versions for later examination.
Production servers vs. “personal computer” data
So far, we have talked about formal, production databases, generally under the purview of an MIS or IT department. But for the past 15 years, we have seen data assets created and distributed through networks of personal computers (towers and laptops) all over most enterprises. In general, these are beyond the control (and discipline) of MIS. No matter how important such data files may be to the individual and his/her function in the business, they may or may not have been subject to discipline, such as being backed-up and security-protected.
Data files of interest might be customer prospect lists, distribution lists for broadcast e-mails, budgets, simulation models, forecasts, etc. These can be so extensive, that a formal inventory of the enterprise data may have to limit itself to the most critical of these PC files.
Data vs. information
Surveying “data” on personal computers scattered around the enterprise leads us to the next logical question: What about all the non-tabular files on those personal computers? Are they important? If so, are they subject to any formal backup and recovery?
The number of unique items of tabular data are small compared to the number of original created documents (and some spreadsheets) on PC’s around the enterprise. While it may be important to acknowledge these, perhaps as a mere acknowledgement of the existence of the PC on which they sit, and while they are valuable, they may not be subject to the kind of data profiling activities we shall discuss in future chapters of this series. For example, if they are not tabular, we cannot perform the most common data quality tests on them. Yet they are still valuable. Hopefully, the enterprise has an inventory of all its PC’s and laptops. Few enterprises have inventories of their information contents.
Recap overview
A recap of all the different places we would look to find all the fragments of the total enterprise data asset is shown below:
Fig. 11: Recap overview of many possible corporate data assets.
Again, getting to this data to study it often entails some political issues. Overcoming political opposition may take time, thus they should be addressed early in any enterprise data inventory project.
Read-access environment
If we are going to progress further in this inventory project, we must be able to pose ad hoc queries against any of this data asset. We will discuss (particularly in level-3 of the data inventory) a variety of queries and reports that we will want to produce from the data. To accomplish this, we truly need a fourth generation language or a query tool with the following characteristics:
So a technical environment must be created where the data analyst can create these queries, on any of the data, for himself/herself. The creation of a new report should never take longer than five minutes. This is not something which might be handed off to a professional programmer somewhere else. Generally, that could take too long.
With granting read-only permission, and providing ease of query access, the next levels (phases) of the data inventory can proceed.
So here we have a conceptual framework for performing the first step in building an inventory of enterprise data. Just knowing what exists and where to find it, and gaining permission to read it.
Recording the data inventory
As we go around the bureaucracy learning about data assets, databases, files, etc., we need to record our findings in a reliable and consistent manner. This can be done in a crude but effective tabular manner. For each database, data mass or significant data flow, we would record the following information:
Technical information about a database
Political information about a database
Then, we can record the information as we survey the enterprise. The tabular expression (just some of the fields) about databases and “data masses” could appear as follows:

Fig. 12: Table of metadata about databases.
Again, the table count field refers to physical tables, or “virtual” tables--that is, logical segments, record types, etc.
Gaining reliable read-only access
As discussed before, data residing on hardware outside the premises of the enterprise should be copied into it. This is shown on the right side of the sketch below:
Fig. 13: Copying external data back onto in-house platforms.
This shows that we are making in-house copies of external databases (containing data which our enterprise owns), and making in-house copies (on our servers) of “imported” data files. These are brought into the enterprise from external sources that we don’t control.
What’s next?
In future articles, we will look at the level-2 functions of an enterprise data inventory. This involves understanding the behavioral characteristics of each table. These include the size, scope, growth patterns, creation events, aging and subtypes of each physical table or logical “table-equivalent” in the corporate data assets.
Then, we will look at the level-3 analysis of data, through intense data profiling, and data quality assessment of individual fields and columns.
Recent articles by Michael Scofield
Comments
Want to post a comment? Login or become a member today!
Be the first to comment!