We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


History, Truth and Data

Originally published August 5, 2009

Everyone who works in data management is aware that there are fundamental aspects of data that we do not fully understand. Normally, these aspects are not anything that immediately impact our daily work, but in the long run, all data professionals would like to see data management on a sounder theoretical footing. The problem is, however, that although we can dimly perceive that there are things about data that we do not understand, it can be frustratingly difficult to even describe what it is we do not understand.

In the spirit of trying to get to the point where it might be possible to ask some clear questions, rather than provide any answers, I think it is worthwhile to consider the relationship between history, truth and data from a very general perspective.

What is History?

Let us begin with history. Just as there is Science channel and a History channel on cable TV, so there is scientific knowledge and historical knowledge. I cannot get into the distinctions between them here, but they have an interesting history, at least in the West. Science started about 2,500 years ago in Greece and has been going in fits and starts ever since. This has given us quite a heritage, much of which we tend to take for granted today, but which has had an enormous effect on our civilization. Unfortunately, the ancient Greeks were anti-historical. They believed that Greece always existed as Greece and had never really developed or evolved. The Romans thought the same thing about Rome. As a result, the ancient world never produced any theoretical foundation of history in the same way that it bequeathed us the basis of modern science. It is true there were “historians” in the ancient world, but they usually tended to be more like journalists, recording eyewitness events.

Only in the 18th Century did Westerners realize that not only did human societies and institutions evolve, but that you could ask questions about the past and discover, at least sometimes, why this happened. Gibbon’s Decline and Fall of the Roman Empire became famous for proving this approach, not simply for its subject matter.

What this means for us is that the theory of history is vastly less mature than the theory of science. This is not very relevant to economic life if you are dealing with scientific problems rather than with historical problems (e.g., if you are an engineer). Unfortunately, data often seems to be more of a historical problem than a scientific one.

Events

It is widely acknowledged that history consists of events, and in most enterprises, data management begins with events. An event is normally the execution of an instance of a predefined transaction, like a sale. Because of information technology, we can handle millions, or more, of these per day. However, once the transaction instance is over, it really is over and the only residue of it that remains that the enterprise has access to is, in nearly all cases, the data.

Yet there is no assurance that the data for a transaction instance will be correct. When someone buy a pack of chocolate cookies and a pack of vanilla cookies, and the checkout clerk scans them as two packs of chocolate cookies, something has been lost and distorted. But since the event involved has passed into history we cannot go and observe it again. Science cannot help since there are no laws of the universe that predict cookie buying, unlike the ways we can work out the dates of eclipses in past history. The only residue of the past event that remains embedded and accessible in the present is the data that the checkout clerk generated.

So it appears we have one chance to get the data right, and no recourse if it is wrong. Now, suppose we capture in correctly – what can happen to this data? Basically, it can be preserved, it can be used, it can be lost, it can be added to, and it can be misunderstood. This brings us to one of history’s mysteries – the history of history.

The History of History

One of the fuzzily understood aspects of history is that it has its own history. The way we understand an event like the Vietnam War is different today than people understood it 25 years ago, and will probably be different again in another 25 years. This means that history is always changing, partly because of the way we perceive things, and thus interpret the data, and partly because new data emerges. The fact is that history is simply not constant.

This problem is reflected in data management. First, history was denied altogether, and still is in many places. Databases were designed and built to store only “current” data (whatever that is), and the relational theory had little if anything to say about history. Eventually, the data warehouse revolution forced us to account for history, but all too often, this has been confined to implementing Type II Slowly Changing Dimension designs, rather than having a holistic approach to history. Outside of data warehousing, the bitemporal design strategy has also been helpful. This has focused on maintaining both business effective dates and transaction effective dates. But is it enough?

Personally, I doubt that it is. Take metadata for instance. I have found little exploration of the idea that metadata has to be kept updated to reflect what data was thought to mean at various times and how this changed. Yet I have seen plenty of "overloaded" columns in databases, which obviously meant one thing at one time and another thing at another time. I have also seen computed and derived columns where the underlying calculations or derivations have changed over time because of sound business or regulatory reasons. Yet the metadata has not changed in step with this. If we cannot be sure of what the data meant at a particular point in time, we are on very uncertain ground – and that brings us to truth.

And What About Truth?

The issue of historical data gets even more difficult when we consider truth. What is truth – at least, when it comes to data? One answer comes from traditional logic. Logic has been defined as the science which treats of the conceptual representation of the real order (George Hayward Joyce, Principles of Logic, 1908). Within logic, Aristotle defined truth as follows:

To say of what is that it is not, or of what is not that it is, is false, while to say of what is that it is, and of what is not that it is not, is true (Metaphysics 1011b25).

In other words, the data that is under our care is supposed to correspond exactly to things in the real world. This is what is known as the Correspondence Theory of Truth. But how often do we stop to think about just how true our data is? Even when we consider data quality problems, the subject of truth does not come up very often. Perhaps we would feel uncomfortable if it did.

On the other hand, why should we? Aristotle's definition of truth in logic (and thus data) is a throwback to the civilization of ancient Greece which explicitly affirmed science and implicitly denied history. It is true we can match up our data to a thing that exists in the real world, but it is impossible to match up our data to an event that has passed into history and is beyond our reach to observe and measure. Data is really the only thing that remains from such events. Perhaps, therefore, for historical data, the best we can hope for is that truth is the consistency within data, our fidelity in recording and managing it, and our capacity to truly interpret it.

Yet even this is difficult to say with certainty. What it reveals is that we have remarkably shallow and shaky theoretical foundations for data that relates to historical events. We have some empirical guides that have emerged over the past couple of decades, but astonishingly little beyond that. None of this is trivial. The practical problems of data management in the modern world are so urgent that we really do need help from the academics whose job it is to work these things out. Let us hope it will soon be coming.

  • Malcolm ChisholmMalcolm Chisholm

    Malcolm Chisholm, Ph.D., has more than 25 years of experience in enterprise information management and data management and has worked in a wide range of sectors. He specializes in setting up and developing enterprise information management units, master data management, and business rules. His experience includes the financial, manufacturing, government, and pharmaceutical industries. He is the author of the books: How to Build a Business Rules Engine; Managing Reference Data in Enterprise Databases; and Definition in Information Management. Malcolm writes numerous articles and is a frequent presenter at industry events. He runs the websites http://www.refdataportal.com; http://www.bizrulesengine.com; and
    http://www.data-definition.com. Malcolm is the winner of the 2011 DAMA International Professional Achievement Award.

    He can be contacted at mchisholm@refdataportal.com.
    Twitter: MDChisholm
    LinkedIn: Malcolm Chisholm

    Editor's Note: More articles, resources, news and events are available in Malcolm's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Malcolm Chisholm

 

Comments

Want to post a comment? Login or become a member today!

Posted August 11, 2009 by Seth Grimes grimes@altaplana.com

In my reading, Aristotle did not explicitly affirm science.  He promoted truth by assertion rooted in rationalization rather than in empirical study and experiment.  Aristotelianism, especially as adopted and developed by the church, retarded Western scientific progress for the better part of 2,000 years.

Is this comment inappropriate? Click here to flag this comment.

Posted August 11, 2009 by Neil Raden nraden@hiredbrains.com

I typed this once before but it didn't show up. I hope they don't both appear.

First of all, the title was a real grabber. It reminded me of A.J. Ayers' "Language, Truth and Logic." Was that deliberate?

Your version of history, however, reveals a common, western prejudice. The Chinese were excellent scientists at least 3000 years ago and the ancient Sumerians were meticulous astromomers over 5000 years ago, giving us the 360-degree circle and 60 seconds.60 minutes. Why they didn't choose 365 1/4 degrees to a circle I can't say, but Immanuel Velikovsky had a pretty good idea (though the late, great charlatan of science, Carl Sagan, did his best to squelch it).

Part of the problem of assembling facts about history from "events" in operational systems is that too much detail is not only lost in those systems, much of it is never captured, a not to distant echo of our resource-constrained, managing-from-scarity mentality. We have the resources now to capture and understand events and sub-events and keep them. How they are assembled into a tableau of history is, as you point out, subjective, but there are only so many problems we can solve with technology. But if you keep the bits around, you at least have a chance of modeling something that approaches the phenomena you are trying to understand.

Neil Raden

Hired Brains

Is this comment inappropriate? Click here to flag this comment.

Posted August 5, 2009 by George Allen

This was very thought provoking, for me.  We deal with a set of artifacts, remnants of some past events, and we gather those together in packages we reveal as truth, but can only be, at best, theories.  Much as archaeologists deal with their own artifacts, we need to develop data models that embrace conjecture, that can posit a variety of "truths" each weighted by various tangible factors, so we can present our knowledge with the caveats that will help the decision makers understand what they see.

And, as archaeologists, our systems need to "learn" over time, to develop rules that better ferret out the outliers in our data and help us add greater weight to one conclusion over others.

We still deal with GIGO but we seem to have come to expect absolutism.  Since our data world is based in 0 and 1, black and white, we expect that what we are being presented with is one or the other.  Controls at the point of entry can assist us in containing some level of error, like a 2-dimensional barcode on an item which identifies it from all others, thus preventing or signalling duplication if scanned.  But, we still have these humans involved, and as computing devices, humans are notoriously fallible.

That's what makes my job so dang interesting. 

George

Is this comment inappropriate? Click here to flag this comment.

Posted August 5, 2009 by Benjamin Taub btaub@dataspace.com

Really nicely done, Malcom.  You've read Herodotus, haven't you?

Are you advocating that the relational model is no longer appropriate for accurate capture of history?

-- Ben

Is this comment inappropriate? Click here to flag this comment.