History, Truth and Data
Originally published August 5, 2009
Everyone who works in data management is aware that there are fundamental aspects of data that we do not fully understand. Normally, these aspects are not anything that immediately impact our daily work, but in the long run, all data professionals would like to see data management on a sounder theoretical footing. The problem is, however, that although we can dimly perceive that there are things about data that we do not understand, it can be frustratingly difficult to even describe what it is we do not understand.
In the spirit of trying to get to the point where it might be possible to ask some clear questions, rather than provide any answers, I think it is worthwhile to consider the relationship between history, truth and data from a very general perspective.
What is History?
Let us begin with history. Just as there is Science channel and a History channel on cable TV, so there is scientific knowledge and historical knowledge. I cannot get into the distinctions between them here, but they have an interesting history, at least in the West. Science started about 2,500 years ago in Greece and has been going in fits and starts ever since. This has given us quite a heritage, much of which we tend to take for granted today, but which has had an enormous effect on our civilization. Unfortunately, the ancient Greeks were anti-historical. They believed that Greece always existed as Greece and had never really developed or evolved. The Romans thought the same thing about Rome. As a result, the ancient world never produced any theoretical foundation of history in the same way that it bequeathed us the basis of modern science. It is true there were “historians” in the ancient world, but they usually tended to be more like journalists, recording eyewitness events.
Only in the 18th Century did Westerners realize that not only did human societies and institutions evolve, but that you could ask questions about the past and discover, at least sometimes, why this happened. Gibbon’s Decline and Fall of the Roman Empire became famous for proving this approach, not simply for its subject matter.
What this means for us is that the theory of history is vastly less mature than the theory of science. This is not very relevant to economic life if you are dealing with scientific problems rather than with historical problems (e.g., if you are an engineer). Unfortunately, data often seems to be more of a historical problem than a scientific one.
It is widely acknowledged that history consists of events, and in most enterprises, data management begins with events. An event is normally the execution of an instance of a predefined transaction, like a sale. Because of information technology, we can handle millions, or more, of these per day. However, once the transaction instance is over, it really is over and the only residue of it that remains that the enterprise has access to is, in nearly all cases, the data.
Yet there is no assurance that the data for a transaction instance will be correct. When someone buy a pack of chocolate cookies and a pack of vanilla cookies, and the checkout clerk scans them as two packs of chocolate cookies, something has been lost and distorted. But since the event involved has passed into history we cannot go and observe it again. Science cannot help since there are no laws of the universe that predict cookie buying, unlike the ways we can work out the dates of eclipses in past history. The only residue of the past event that remains embedded and accessible in the present is the data that the checkout clerk generated.
So it appears we have one chance to get the data right, and no recourse if it is wrong. Now, suppose we capture in correctly – what can happen to this data? Basically, it can be preserved, it can be used, it can be lost, it can be added to, and it can be misunderstood. This brings us to one of history’s mysteries – the history of history.
The History of History
One of the fuzzily understood aspects of history is that it has its own history. The way we understand an event like the Vietnam War is different today than people understood it 25 years ago, and will probably be different again in another 25 years. This means that history is always changing, partly because of the way we perceive things, and thus interpret the data, and partly because new data emerges. The fact is that history is simply not constant.
This problem is reflected in data management. First, history was denied altogether, and still is in many places. Databases were designed and built to store only “current” data (whatever that is), and the relational theory had little if anything to say about history. Eventually, the data warehouse revolution forced us to account for history, but all too often, this has been confined to implementing Type II Slowly Changing Dimension designs, rather than having a holistic approach to history. Outside of data warehousing, the bitemporal design strategy has also been helpful. This has focused on maintaining both business effective dates and transaction effective dates. But is it enough?
Personally, I doubt that it is. Take metadata for instance. I have found little exploration of the idea that metadata has to be kept updated to reflect what data was thought to mean at various times and how this changed. Yet I have seen plenty of "overloaded" columns in databases, which obviously meant one thing at one time and another thing at another time. I have also seen computed and derived columns where the underlying calculations or derivations have changed over time because of sound business or regulatory reasons. Yet the metadata has not changed in step with this. If we cannot be sure of what the data meant at a particular point in time, we are on very uncertain ground – and that brings us to truth.
And What About Truth?
The issue of historical data gets even more difficult when we consider truth. What is truth – at least, when it comes to data? One answer comes from traditional logic. Logic has been defined as the science which treats of the conceptual representation of the real order (George Hayward Joyce, Principles of Logic, 1908). Within logic, Aristotle defined truth as follows:
In other words, the data that is under our care is supposed to correspond exactly to things in the real world. This is what is known as the Correspondence Theory of Truth. But how often do we stop to think about just how true our data is? Even when we consider data quality problems, the subject of truth does not come up very often. Perhaps we would feel uncomfortable if it did.
On the other hand, why should we? Aristotle's definition of truth in logic (and thus data) is a throwback to the civilization of ancient Greece which explicitly affirmed science and implicitly denied history. It is true we can match up our data to a thing that exists in the real world, but it is impossible to match up our data to an event that has passed into history and is beyond our reach to observe and measure. Data is really the only thing that remains from such events. Perhaps, therefore, for historical data, the best we can hope for is that truth is the consistency within data, our fidelity in recording and managing it, and our capacity to truly interpret it.
Yet even this is difficult to say with certainty. What it reveals is that we have remarkably shallow and shaky theoretical foundations for data that relates to historical events. We have some empirical guides that have emerged over the past couple of decades, but astonishingly little beyond that. None of this is trivial. The practical problems of data management in the modern world are so urgent that we really do need help from the academics whose job it is to work these things out. Let us hope it will soon be coming.
SOURCE: History, Truth and Data
Recent articles by Malcolm Chisholm
Copyright 2004 — 2019. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC