From Business Intelligence to Enterprise IT Architecture, Part 4

Originally published July 7, 2010

Following a review of the evolution of business needs and technology over the past two decades, the second article in this series introduced the Business Integrated Insight (BI2) architecture as an evolution of data warehousing / business intelligence. Part 3 drilled into the information layer, the Business Information Resource (BIR) and began to describe its structure. This article continues our exploration of the BIR, focusing on the knowledge density axis.

The Information Space



Figure 1: The Three Axes of the BIR Information Space

As described in Part 3 of this series and shown in Figure 1, the independent axes of the three-dimensional space along which information is positioned are:
  1. Timeliness / Consistency: Describes how information moves from creation for a specific purpose to broader, consistent usage and on to archival or deletion.
  2. Knowledge Density: Describes the journey of information from soft to hard and the related concept of the amount of knowledge embedded in it.
  3. Reliance / Usage: Describes the path of information from personal to enterprise usage and beyond and the implications for how far it can be relied upon.
This article deals with the knowledge density axis and the related concept of information structure. The reliance / usage axis will be considered in Part 5.

The KD Axis: Knowledge Density

The knowledge density axis is an enhanced view of the binary hard / soft information distinction first mentioned in Part 1 of this series. Now, here’s an intriguing thought: every piece of hard information you can think of was once soft. Address line in customer database: hard; customer sign-up form: soft. Order line in SAP: hard; customer call, email or whatever placing the order: soft. ATM transaction: hard; customer deciding to use ATM: very soft. The fact is that all hard data originates in the real world, which is far more loosely structured than anything in the hard world of computer data. This thought may feel somewhat philosophical, but its relevance comes from the fact that all “data processing” (as IT used to be called) begins with a set of steps that gets us from soft information to hard. First analyze the structure and content of the soft information describing a real world event. Second, build a model of the key fields and their relationships that can be represented in a database (hard information) and third, create a process to ensure that the fields are filled every time the real world event occurs. In this way, soft information is transformed into metadata (field names, meanings and relationships) and hard data (values).

Note that I’m not proposing here that any or all soft information you come across should be transformed in hard information. Far from it; my thesis is the exact opposite. But, what I am describing are the fundamentals of the process by which the data used in computer-based applications is constructed. Such hardening of information is a prerequisite for all traditional data processing which requires very well-defined and understood facts, in largely numerical or categorical form, that can be created, updated, deleted, and manipulated—summarized, averaged, plotted—in precise, well-understood and repeatable ways.

From the above discussion, the knowledge density axis clearly represents a key IT process of structuring information into a form more amenable to computer processing. Elsewhere, I have called this the “structuredness” axis, but the spell-checker objects! The axis does represent the fact that the level of structure increases as we move along it. The label “knowledge density” reflects a perhaps more interesting observation that as we move from soft to hard information, more explicit knowledge is packed into smaller amounts of data. IT generally sees increasing knowledge density as a positive move—processing is faster, storage is smaller and algorithms are simpler. However, the downside of this process is that some of the more tacit knowledge residing in soft information can be easily lost. This happens when such knowledge has not been explicitly recognized in the modeling step, and thus is not captured in the hard data. The negotiations and documents exchanged leading to a large services contract, for example, contain far more information than is represented in the order entry line item. Such knowledge loss explains why it doesn’t make sense to convert all soft information to hard data and throw the original content away.

Soft information exists in varying levels of complexity: hence its label “multiplex” on the KD axis. The complexity ranges from the simplest form of plain text, through formatted text, audio and image to video. A future version of this architecture may find it useful to explicitly separate the textual and audio categories from image and video. At the opposite end of the spectrum, hard information falls into two classes, atomic and derived. Atomic data contains a single piece of information (or fact) per data item and is extensively modeled. It is the most basic and simple form of data, and the most amenable to traditional (numerical) computer processing. Derived data, also usually modeled, typically consists of multiple data items that are derived or summarized from atomic data; the latter process may result in data loss.

As previously mentioned, hard data, both atomic and derived, stands separated from the metadata that describes it (and without which, it is meaningless!). But, where does metadata reside in this model? Traditional data warehouse architectures place metadata off to one side of the main “business” information. BI2 explicitly defines metadata as a key component of the business information resource. And the compound class on the KD axis is where metadata resides. Strictly speaking, the compound class is defined as information that contains both hard and soft information elements together. Consider a “structured” text format such as XML. Raw textual and other soft information as well as tagged hard information reside in the same store. Tags express metadata and delineate structures within the soft information. Because of these properties, the compound class is ideal for business information that has intermediate or mixed levels of structure, as well as for metadata, which shows similar characteristics.

Technology Implications of the KD Axis

From a data warehousing point of view, soft information has long proven somewhat of a challenge. The initial approach is to treat it like any other data source—extract, transform and load any such information required by the business into the warehouse. Job done! But, not so fast. Relational databases (RDBs) provide only limited support for text and BLOB (binary large object) data. Even if we confine ourselves to text, where RDBs have increased search and analysis function over the years, the function and ease of use that users have come to expect from Google-like tools for text is largely lacking. Native XML databases and XML extensions of general purpose RDBs offer more extensive text functionality. But the big problem is that the growth in volumes of soft information already seen and predicted in the future militates strongly against any wholesale copying of soft information.

Content management systems (CMS) take a different approach to data warehouses when providing access to soft information. Rather than copying content from diverse sources, a CMS builds an inverted index containing pointers to all source occurrences of all significant words or phrases. In more advanced tools, various text analytic processes are also performed at this ingestion phase to add meaning via entity extraction, clustering, sentiment analysis, or classification. To some extent, each incoming piece of content is modeled on the fly. All later search and analysis activity occurs against this index with users routed to the required source only as the last step in the process. These indexes are the exact equivalent of metadata in a data warehouse environment, but their manner of use in CMSs shows how widely distributed soft information can be used without extensive copying into an informational environment.

In real data warehouses, metadata has, despite our best intentions, a peripheral role. Of course, it is used in all queries to identify tables, columns and relationships. However, such metadata resides in the heart of the RDB and is seldom even considered explicitly beyond the modeling stage. And most of us are acutely aware of the long-standing limitations on building and using business meaning, data sourcing and cleansing metadata that plague most data warehouses. As a consequence, analysis of hard information occurs directly against that information itself and relies only marginally on the metadata.

So, why does analytics occur so differently against soft and hard information? While there is a difference at the end-point—hard information analysis goes to the deepest level of detail in the data—the process of searching for relevant information and joining across different sources is very similar. And the CMS approach of extensive use of metadata provides significant advantages in terms of context, agility and ease of use. These considerations lead to the concept of a “Unified Information Store” for soft and hard information and are discussed in depth in my recent white paper1 on this topic. The conclusion reached there is that while hard and soft information will continue to be best served by different storage options, with neither one being copied to the other, business users do require a consolidated and integrated view of both. This view is best achieved through the novel, direct use of metadata, both what data warehousing traditionally considers metadata and the advanced inverted indexes created in the content management environment.

Conclusion

The growing business need for meaningful convergence of hard and soft information has long presented data warehousing with particular challenges. Now, an understanding of the knowledge density axis of the BIR provides the basis for a modern approach to the problem. This approach steps back from copying soft information into the warehouse and instead provides integrated access through an enhanced and expanded set of metadata, which is an integral part of the Business Information Resource.

In the next article in this series, I’ll look at the Reliance / Usage axis.

End Notes:
  1. Devlin, B., “Beyond the Data Warehouse: A Unified Information Store for Data and Content,” May 2010.
  • Barry DevlinBarry Devlin
    Dr. Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. He was responsible for the definition of IBM's data warehouse architecture in the mid '80s and authored the first paper on the topic in the IBM Systems Journal in 1988. He is a widely respected consultant and lecturer on this and related topics, and author of the comprehensive book Data Warehouse: From Architecture to Implementation.

    Barry's interest today covers the wider field of a fully integrated business, covering informational, operational and collaborative environments and, in particular, how to present the end user with an holistic experience of the business through IT. These aims, and a growing conviction that the original data warehouse architecture struggles to meet modern business needs for near real-time business intelligence (BI) and support for big data, drove Barry’s latest book, Business unIntelligence: Insight and Innovation Beyond Analytics, now available in print and eBook editions.

    Barry has worked in the IT industry for more than 30 years, mainly as a Distinguished Engineer for IBM in Dublin, Ireland. He is now founder and principal of 9sight Consulting, specializing in the human, organizational and IT implications and design of deep business insight solutions.

    Editor's Note: Find more articles and resources in Barry's BeyeNETWORK Expert Channel and blog. Be sure to visit today!

Recent articles by Barry Devlin



 

Comments

Want to post a comment? Login or become a member today!

Posted August 26, 2010 by Barry Devlin

Hi Alexandru, thanks for your comment, and glad to see you've read my BI Journal article as well.

Well spotted on the layout direction of categories on the Knowledge Density axis!  As I've said in the articles, this is a work in progress and I'm always interested in input on how to best represent the different aspect of the BI2 architecture.  Regarding this axis, I've received conflicting views from different people about the best way to lay it out - particularly as I've expanded the label in the diagram to include "Structure" as well as "Knowledge Density".  The direction of the arrow and the order of the categories therefore needs to represent conflicting concepts, and I've been trying different ways of doing this - hence the different representations between this series and the BI Journal article.

I agree that the "Atomic" --> "Multiplex" dirction feels more comfortable to BI folks, and it was my first idea in drawing the axis.  But, if you think about it, both Structure and Knowledge Density decrease in this direction: atomic information is more structured and has higher knowledge density than multiplex information.  Also, given the discussion about how multiplex information is the conceptual "source" of all of the other categories, it seems that for human information processing, multiplex information is the way we "naturally" work.  So, for these reasons, I reversed the direction of the arrow and the categories.

Does that make sense?  If you, or any other reader has a view on this, I'd love to hear it.

Regards, Barry.

Is this comment inappropriate? Click here to flag this comment.

Posted August 21, 2010 by Alexandru Draghici alexandru.draghici@gmx.de

Dear Barry,

 

it seems that your axe "Knowledge Density" is wrong, than the values shoud be written the other way around: first atomic, then derived, etc.

 

In your article in "Business Intelligence Journal" of TDWI the axe is correct.

 

It would be good to exchange the diagram.

 

Regards, Alexandru

Is this comment inappropriate? Click here to flag this comment.