Originally published December 20, 2012
On one of our current projects involving the absorption and analysis of a very large number of records, we have discovered that there are some common “failure” patterns in the source data that are somewhat predictable, albeit frustrating. These are not uncommon failures, and may be somewhat dependent on each other. For example, some of the records seem to have extra fields and values inserted in the middle of the record. In others, data values are misfielded, with value that should be in one column shifted to the next column. Other times, the field lengths are off, with some value sizes way exceeding what was presented in the metadata.
These are just a few examples, and they are actually very common. We have experienced the same or similar data mishaps numerous times. These examples shed some light on the process the data suppliers must have gone through in preparing the data for delivery to the data consumers. Perhaps the data extraction process changed in the middle, and different fields were being pulled. Perhaps there are instances of the field separator value inside string values that inadvertently create extra fields. Perhaps some field values are separated by quotation marks while other similar values are not. These are just a few of the scenarios that might have happened.
And none of these scenarios is relevant. First of all, by the time we have acquired the data, it is far removed from the original source. Second, we have no influence at all over the original source. The data extraction was performed for a particular operational purpose, and we are just analyzing the data for our own task that, much like so many analyses, is far removed from the original operational activity.
The existence of flawed records in a small data set is sometimes addressable. If you have a well-defined process for data cleansing coupled with data stewards to review the changes, it may be possible to prepare the data for the analysis even in the presence of source data flaws. But what happens when the data set is huge? As an example, consider a data set with 5 billion records. If only one half of one percent of the records is known to be flawed, that amounts to 25,000,000 known bad records. Ten years ago most people would have considered that number of records itself to be a big data set, and today we can consider that just the noisy bad records. Actually, let’s ratchet that down again to 0.1% (one tenth of one percent) – that is still 5,000,000 presumably unusable records.
So here is the question: What are we supposed to do with flawed data when the scale of those errors exceeds our ability to effectively deal with them. Here are a few choices:
Recent articles by David Loshin