Following a review of the evolution of business needs and technology over the past two decades, the second article in this series introduced the Business Integrated Insight (BI2) architecture as an evolution of data warehousing / business intelligence. Part 3 drilled into the information layer, the Business Information Resource (BIR) and began to describe its structure. This article continues our exploration of the BIR, focusing on the knowledge density axis.
The Information Space
Figure 1: The Three Axes of the BIR Information Space
As described in Part 3 of this series and shown in Figure 1, the independent axes of the three-dimensional space along which information is positioned are:
- Timeliness / Consistency: Describes how information moves from creation for a specific purpose to broader, consistent usage and on to archival or deletion.
- Knowledge Density: Describes the journey of information from soft to hard and the related concept of the amount of knowledge embedded in it.
- Reliance / Usage: Describes the path of information from personal to enterprise usage and beyond and the implications for how far it can be relied upon.
This article deals with the knowledge density axis and the related concept of information structure. The reliance / usage axis will be considered in Part 5.
The KD Axis: Knowledge Density
The knowledge density axis is an enhanced view of the binary hard / soft information distinction first mentioned in Part 1 of this series
. Now, here’s an intriguing thought: every piece of hard information you can think of was once soft. Address line in customer database: hard; customer sign-up form: soft. Order line in SAP: hard; customer call, email or whatever placing the order: soft. ATM transaction: hard; customer deciding to use ATM: very soft. The fact is that all hard data originates in the real world, which is far more loosely structured than anything in the hard world of computer data. This thought may feel somewhat philosophical, but its relevance comes from the fact that all “data processing” (as IT used to be called) begins with a set of steps that gets us from soft information to hard. First analyze the structure and content of the soft information describing a real world event. Second, build a model of the key fields and their relationships that can be represented in a database (hard information) and third, create a process to ensure that the fields are filled every time the real world event occurs. In this way, soft information is transformed into metadata (field names, meanings and relationships) and hard data (values).
Note that I’m not
proposing here that any or all soft information you come across should be transformed in hard information. Far from it; my thesis is the exact opposite. But, what I am describing are the fundamentals of the process by which the data used in computer-based applications is constructed. Such hardening of information is a prerequisite for all traditional data processing which requires very well-defined and understood facts, in largely numerical or categorical form, that can be created, updated, deleted, and manipulated—summarized, averaged, plotted—in precise, well-understood and repeatable ways.
From the above discussion, the knowledge density axis clearly represents a key IT process of structuring information into a form more amenable to computer processing. Elsewhere, I have called this the “structuredness” axis, but the spell-checker objects! The axis does represent the fact that the level of structure increases as we move along it. The label “knowledge density” reflects a perhaps more interesting observation that as we move from soft to hard information, more explicit knowledge is packed into smaller amounts of data. IT generally sees increasing knowledge density as a positive move—processing is faster, storage is smaller and algorithms are simpler. However, the downside of this process is that some of the more tacit knowledge residing in soft information can be easily lost. This happens when such knowledge has not been explicitly recognized in the modeling step, and thus is not captured in the hard data. The negotiations and documents exchanged leading to a large services contract, for example, contain far more information than is represented in the order entry line item. Such knowledge loss explains why it doesn’t make sense to convert all soft information to hard data and throw the original content away.
Soft information exists in varying levels of complexity: hence its label “multiplex
” on the KD axis. The complexity ranges from the simplest form of plain text, through formatted text, audio and image to video. A future version of this architecture may find it useful to explicitly separate the textual and audio categories from image and video. At the opposite end of the spectrum, hard information falls into two classes, atomic
. Atomic data contains a single piece of information (or fact) per data item and is extensively modeled. It is the most basic and simple form of data, and the most amenable to traditional (numerical) computer processing. Derived data, also usually modeled, typically consists of multiple data items that are derived or summarized from atomic data; the latter process may result in data loss.
As previously mentioned, hard data, both atomic and derived, stands separated from the metadata that describes it (and without which, it is meaningless!). But, where does metadata reside in this model? Traditional data warehouse architectures place metadata off to one side of the main “business” information. BI2
explicitly defines metadata as a key component of the business information resource. And the compound
class on the KD axis is where metadata resides. Strictly speaking, the compound class is defined as information that contains both hard and soft information elements together. Consider a “structured” text format such as XML. Raw textual and other soft information as well as tagged hard information reside in the same store. Tags express metadata and delineate structures within the soft information. Because of these properties, the compound class is ideal for business information that has intermediate or mixed levels of structure, as well as for metadata, which shows similar characteristics.
Technology Implications of the KD Axis
From a data warehousing point of view, soft information has long proven somewhat of a challenge. The initial approach is to treat it like any other data source—extract, transform and load any such information required by the business into the warehouse. Job done! But, not so fast. Relational databases (RDBs) provide only limited support for text and BLOB (binary large object) data. Even if we confine ourselves to text, where RDBs have increased search and analysis function over the years, the function and ease of use that users have come to expect from Google-like tools for text is largely lacking. Native XML databases and XML extensions of general purpose RDBs offer more extensive text functionality. But the big problem is that the growth in volumes of soft information already seen and predicted in the future militates strongly against any wholesale copying of soft information.
Content management systems (CMS) take a different approach to data warehouses when providing access to soft information. Rather than copying content from diverse sources, a CMS builds an inverted index containing pointers to all source occurrences of all significant words or phrases. In more advanced tools, various text analytic processes are also performed at this ingestion phase to add meaning via entity extraction, clustering, sentiment analysis, or classification. To some extent, each incoming piece of content is modeled on the fly. All later search and analysis activity occurs against this index with users routed to the required source only as the last step in the process. These indexes are the exact equivalent of metadata in a data warehouse environment, but their manner of use in CMSs shows how widely distributed soft information can be used without extensive copying into an informational environment.
In real data warehouses, metadata has, despite our best intentions, a peripheral role. Of course, it is used in all queries to identify tables, columns and relationships. However, such metadata resides in the heart of the RDB and is seldom even considered explicitly beyond the modeling stage. And most of us are acutely aware of the long-standing limitations on building and using business meaning, data sourcing and cleansing metadata that plague most data warehouses. As a consequence, analysis of hard information occurs directly against that information itself and relies only marginally on the metadata.
So, why does analytics occur so differently against soft and hard information? While there is a difference at the end-point—hard information analysis goes to the deepest level of detail in the data—the process of searching for relevant information and joining across different sources is very similar. And the CMS approach of extensive use of metadata provides significant advantages in terms of context, agility and ease of use. These considerations lead to the concept of a “Unified Information Store” for soft and hard information and are discussed in depth in my recent white paper1
on this topic. The conclusion reached there is that while hard and soft information will continue to be best served by different storage options, with neither one being copied to the other, business users do require a consolidated and integrated view of both. This view is best achieved through the novel, direct use of metadata, both what data warehousing traditionally considers metadata and the advanced inverted indexes created in the content management environment.
The growing business need for meaningful convergence of hard and soft information has long presented data warehousing with particular challenges. Now, an understanding of the knowledge density axis of the BIR provides the basis for a modern approach to the problem. This approach steps back from copying soft information into the warehouse and instead provides integrated access through an enhanced and expanded set of metadata, which is an integral part of the Business Information Resource.
In the next article in this series, I’ll look at the Reliance / Usage axis.End Notes:
- Devlin, B., “Beyond the Data Warehouse: A Unified Information Store for Data and Content,” May 2010.
Recent articles by Barry Devlin