Introduction
In my previous article, Enterprise Data Asset Inventory - Level One, we developed an initial survey of all the significant data assets in the enterprise. These data assets included the formal production databases; master files and transaction files of legacy systems; some databases on personal computers not subject to I.T. production disciplines; data owned by your enterprise but outsourced to service bureaus beyond your corporate premises; and data you import from external sources on a regular basis.
Having identified all those data assets, you must manage them. But you cannot manage what you cannot see. Getting to the data becomes very important, and is the focus of this article.
Looking at the Data
This article introduces an idea which may be novel to many programmers, analysts, and even some folk from the data management heritage. We must look at actual business production data. I say that it is novel to many people, because most technology-oriented programmers and analysts are more interested in hardware, and process (application code, screens, and such), than the behavior of the data which is the residue of a live, production application running the business.
In the first stages of a systems development life cycle, inexperience or non-assertive analysts will often spend minimal time in talking to the business user, and in getting an understanding of the business problem. They want to get to programming. They don’t feel productive until they are writing great code. So they are not interested in looking at data (which means posing queries against the databases)--either legacy data (in the applications which are going to be replaced) or the data that begins to emerge from testing the application they are building and testing. Many IT professionals are more interested in process than data.
And, of course, this leads to loneliness of data bigots--data management people of all kinds. We should have a different perspective. Our focus should not be just on metadata, but the business data which the metadata describes. If a data management professional has no interest in posing a simple ad hoc query against production data, they are taking a great risk of missing some important knowledge about the enterprise data asset.
In many large, mature (and complex) enterprises, it is the legacy data underneath legacy applications which may be the only source of knowledge about some obscure business process, business policy, business rules, or entity subtype. So we must become very familiar with the legacy data.
The legacy is relevant
If there is a legacy system (provided that this is not a new enterprise), the data (and its architecture) in that legacy system is an important source of knowledge about how the existing business works. This must be considered even if the business sponsor of the new application or database has a different vision of what the future data architecture and business rules should be. This is because in large, mature enterprises, no single person knows all the detail or the complexity of the business logical data architecture (either legacy or to-be). That complexity must be made visible for everyone to ratify or change. You don’t want to discover complexity after a new database has been designed, built, and code has been written.
So the legacy business data architecture form a baseline (no matter how radical the visionary change) for the future. And many aspects of that architecture which do not change may not be on the consciousness of those visionaries (executives or analysts) designing the future of the business. The legacy must be considered.
So, the legacy databases underneath legacy applications are an important source of knowledge about what has been done in the past (particular at the detail level, and dealing with business anomalies which executives may not know about). So we need to look at that business data, not just the legacy metadata.
Example: Product master file
A good example of the knowledge that can be gleaned from a legacy database may be found in examining a product master file. Products go through a natural life cycle in this company, being first thought up and designed, tested, then finally being manufactured and offered for sale, and later for a variety of business reasons, withdrawn from sales activity. Here are three basic conditions which are indicated by a PRODUCT_STATUS_CODE (short name of PROD_STAT_CD) in the customer master file. (Having such a field may be a miracle for some older companies where the legacy system designer had not thought that such a field would be necessary because it “was intuitive” and well-known around the office what products are available for sale.)
Because there has been some heritage of documentation in this organization, we do have a chunk of data documentation somewhere about this field, PROD_STAT_CD. It looks like this:
Looking ahead, in designing a new Product Management System for this business, we must consider how this table (PROD_MAST) and this field behaved in the past. The best source of that particular knowledge is the legacy data.
So we can pose a simple query against this field in the legacy table. Expressed in English, we want to know what values occur in PROD_STAT_CD, and how often they occur.
Whether you pose the query in SQL, or some other query language, it should be very easy (hopefully trivial) to write, and to get back results which look like this:
Notice that we find several values in this field which were not mentioned in the legacy documentation. Several of them may be considered anomalies. First, we see just two records that were blank; that is curious. Values of “Q” and “V” occur too infrequently to probably be a normal business phenomenon. But the value “I” occurs 492 times; this is probably not just a keying error.
But what does “T” mean? We need then to ask around as it was not mentioned in the original documentation above. That may be a subtype of product that never got into the original documentation.
This illustrates only some of the kinds of discoveries we can make by examining the legacy data.
The central role of the cynical human analyst
The data analysis process may be described (at a high level) in the following sketch:
Fig. 1: Central role of the data analyst
The data analyst should certainly review legacy documentation, but also should look at the “data mass” which is the legacy database. As the analyst discovers anomalies and “outliers” in the legacy data, he or she must communicate with experts on that data--business experts, and application experts (if they are still around).
There are at least two documentary outputs of this analysis process: A logical data model describing the “as-is” business architecture, and a collection of data behavior documentation which describe our findings in the legacy data. A third “anomaly summary” document may be prepared which gives quick visibility to the data anomalies found, and suggests causes or sources.
The legacy data mass
I use the term “data mass” because legacy data may not be in a modern RDBMS. It may be in a wide variety of files, tables, or archaic data management systems or techniques. (Can you say “flat files”?)
Exploration of the data mass requires unfettered query access to this data. The data analyst must be able to pose any ad hoc query against that data deemed appropriate. One cannot always predict what those queries will be. Hence, you will need a query tool which makes it easy to write a new query (often, cannibalizing code already written) and get the results in five to fifteen minutes. No programmer needs to be involved; indeed no programmer should be involved.
In surveying the legacy data, you may want to distinguish between the physical database, and the business data.
Fig. 2: Physical table vs. data elements of business interest.
Some fields in the database are for technical functions (like hashed keys). The only test you may wish to make of them is whether they are truly unique, and truly random (if they are supposed to be). Otherwise, your focus is on the data elements which have business significance and meaning.
On the other hand, some physical fields (of business value) may need to be parsed into sub-fields to glean their true business meaning.
A part number in a manufacturing system….
4092-0091-G-420A-001
…may consist of sub-fields of business significance. Your query tool must be able to separate those sub-fields, and treat them as if they were distinct physical columns (or fields, or data elements).
You may also want to be able to join data in your ad hoc exploration queries across platforms. The production, “system of record” may be on a DB2 platform, while the data warehouse (to which you occasionally may wish to make comparisons) may be in Oracle. You need read access to both. You want to look at data wherever it is.
The legacy data experts
There are generally two kinds of legacy data experts whom you will be talking with during this analysis process. The first is the business person(s) most familiar with the contents of the database. They see the data from the business process point of view. In nearly every large bureaucracy, there is some person or department which is responsible for maintaining the central reference tables, and/or critical central files (e.g. customer master file, product master file, invoices history, etc.). These people are often underestimated (by executives) in their knowledge (often not documented) of the core data which runs the business. Treat them well.
The other kind of expert you will be talking to are those IT personnel (programmers or analysts) who maintain the legacy systems. Some of these legacy programmers may be older (valued for their COBOL skills, for example) and have more history with the enterprise. They know a lot, and they have probably forgotten a lot.
As you explore the data, you are going to find anomalies (as we saw in the product status code tally above). These anomalous values require some explanation; these experts may have the answers. But how you approach them with your findings will influence whether they are threatened (and thus uncooperative) or helpful and supportive. More about that below (“politics”).
The data analysis process
The data analysis process certainly requires considering the legacy documentation (even if obsolete), as well as any other knowledge gathered. But the core of the analysis is actually posing queries against the data. And it is not merely running formatted dumps of the first few rows of a table (albeit having the code to run such a formatted dump on selected, “problem” records will be very useful). You want to consider the full range of values in a column, and this means counting records according to basic behavior and values observed.
For fields with low cardinality (i.e. a modest number of observed or expected values), a simple tally might be useful, as shown in the PROD_STAT_CD example above. Where there are hundreds or thousands of values, a summary approach to understanding the values will be more practical. I like to use Domain Studies, a technique which will be demonstrated in subsequent articles.
But as the data analyst explores the data, new questions come up, often only after initial surveys. The most important questions include:
There may be a succession of questions, none which could be predicted at the start of this exploration process. Hence, the need for the analyst to write his/her own ad hoc queries.
The artifacts of data exploration--new documentation
This data exploration is going to result in the analyst gaining some useful (and unexpected) insights into the behavior of the data, and additionally the behavior of the business which the data describes. This knowledge will have been acquired with considerable effort. And there is some synergy here. The further the analyst goes into the detail, the more interactions are discovered between data elements and columns. The analyst may intuitively start to see patterns or new relationships. That is valuable.
Yet, to help anyone else in the future who may wish to use this business data (if this database is still in production status) we need to document our discoveries. If a logical business data model exists, it may have to be modified. Although the architecture of the physical database may remain fairly constant, a variety of internal and external pressures may force the logical architecture of the business to morph--generally towards more complexity. And this greater complexity (e.g. new entities, new subtypes of existing entities, and perhaps new relationships or altered cardinalities) needs to be documented, ideally as reflected in an updated logical business data model.
Also, for each column (data field) it is appropriate to create some data behavior documentation--a collection of domain studies, at least one page for each field, showing what kinds of values actually appear in each column.
Ah, but this is tedious. It shouldn’t be. One approach is to just create a document in your text editor, and paste your findings into it, perhaps with some chapter organization named after the table, and the fields subordinate to tables. One must always balance the effort of documentation with the reasonable likelihood that someone else (or even yourself six months later) will come back and actually reference it. Lazy analysts sometimes prefer to underestimate the likelihood of someone needing such documentation.
Additionally, you may also need to go back and update the original data documentation (in case anyone is really looking at it). So in the example earlier in this article, we would have to chase down the cause for those 492 records with “I” in PROD_STAT_CD. And if it turns out to be another subtype, we may have to modify the logical data model (if one exists) with this new information about the business architecture.
The politics of data exploration
Digging into the data, you are probably going to discover values in the data which nobody has seen. You may find anomalous values, unexpected extreme values, or unexpected relationships between fields. That newly-discovered knowledge may be valuable. Those discoveries may be, in some circumstances, bad news to some people (such as those responsible for maintaining the application).
How you deliver that news (sometimes bad news) may influence what access (again, “read-only” access) you may have to that or subsequent databases. So you need to be tactful.
The language you use in introducing these discussions may be crucial. Instead of, “I’m finding garbage in your data…” you might depersonalize it with something like, “Help me understand what these values mean in this field. I don’t find any mention of them in the documentation.” This question, as non-threatening as possible, may evoke a response like, “Oh, yes, we added that status code seven years ago, but nobody had time to update the documentation.” Or, you may hear something like, “Yes, I remember the research department wanted to have a new group of products for some test they were doing.”
You never know what you will discover. You may have to show them examples (a custom-written formatted dump) of the records containing the anomalies. And as they verbally enlighten you as to the history of the legacy system and its data, you want to capture that in your documentation. Hence, perhaps a fourth item in documentation (especially if you do not have authority to update the original legacy data documentation) is explanation of anomalies.
These anomalies have significance, and may be part of the functional requirements for any data warehouse, or any new, replacement application.
What’s next?
In my next articles, we will look at the level-2 functions of an enterprise data inventory. This includes surveying table meaning and behavior, such as understanding age distribution of records on a master file, and the growth pattern of each table.
Recent articles by Michael Scofield
Michael is a widely known speaker and author on data quality and semantic data integration. He has held data architecture and data quality management positions in banking, finance and education. He has taught workshops for numerous organizations including the information quality conferences, numerous DAMA chapters, The Data Warehousing Institute, the Institute of Internal Auditors, chapters of the Quality Assurance Association, the Enterprise Data Forum, European Meta-data Conferences, Association of Computing Machinery, numerous DBMS user groups, and business intelligence tool conferences. His articles appear in numerous professional journals, and he writes occasional humor for the Los Angeles Times and other magazines. Michael can be reached at NMScofield@aol.com.