On one of our current projects involving the absorption and analysis of a very large number of records, we have discovered that there are some common “failure” patterns in the source data that are somewhat predictable, albeit frustrating. These are not uncommon failures, and may be somewhat dependent on each other. For example, some of the records seem to have extra fields and values inserted in the middle of the record. In others, data values are misfielded, with value that should be in one column shifted to the next column. Other times, the field lengths are off, with some value sizes way exceeding what was presented in the metadata.
These are just a few examples, and they are actually very common. We have experienced the same or similar data mishaps numerous times. These examples shed some light on the process the data suppliers must have gone through in preparing the data for delivery to the data consumers. Perhaps the data extraction process changed in the middle, and different fields were being pulled. Perhaps there are instances of the field separator value inside string values that inadvertently create extra fields. Perhaps some field values are separated by quotation marks while other similar values are not. These are just a few of the scenarios that might have happened.
And none of these scenarios is relevant. First of all, by the time we have acquired the data, it is far removed from the original source. Second, we have no influence at all over the original source. The data extraction was performed for a particular operational purpose, and we are just analyzing the data for our own task that, much like so many analyses, is far removed from the original operational activity.
The existence of flawed records in a small data set is sometimes addressable. If you have a well-defined process for data cleansing coupled with data stewards to review the changes, it may be possible to prepare the data for the analysis even in the presence of source data flaws. But what happens when the data set is huge? As an example, consider a data set with 5 billion records. If only one half of one percent of the records is known to be flawed, that amounts to 25,000,000 known bad records. Ten years ago most people would have considered that number of records itself to be a big data set, and today we can consider that just the noisy bad records. Actually, let’s ratchet that down again to 0.1% (one tenth of one percent) – that is still 5,000,000 presumably unusable records.
So here is the question: What are we supposed to do with flawed data when the scale of those errors exceeds our ability to effectively deal with them. Here are a few choices:
- Fix the errors. As someone who preaches constantly about eliminating the root cause instead of just correcting bad data, it is somewhat disconcerting to suggest actually correcting bad data. Yet if the business situation is sensitive to errors and requires a high degree of data completeness, making those known flaws as right as possible may benefit the outcome in the long run. For example, any healthcare-related management scenario in which lives are put at stake based on the results of an analysis may drive a discrete process for ensuring that the data sets used are as “clean” as possible to prevent medical providers from doing any harm.
- Ignore the errors. Yes, this may sound harsh, but when you consider the situation, you may be constrained as to what you are able to do. One cannot “fix” the data if that poses the threat of making records inconsistent with others known to be sound. And if the business scenario is tolerant of some degree of noise, then ignoring the data may not have significant impact. As an example, the precision of a recommendation engine may be minimally improved with a complete data set, but much of the value can be achieved even in the presence of a tolerable percentage of errors.
- Automate conditional use of the data. This third choice seeks to modulate between the previously posed “all-or-nothing” options. Here, if there are enough patterns of failure that can be recognized and that have an intuitive correction, try to employ a manageable set of business rules to be applied when it makes sense.
The third choice may be the right one to use in situations similar to what I earlier described – misfielded data, inadvertent extraneous fields, shifted data. For example, if you see that 1% of the time a known U.S. state code appears in a City field and a ZIP code appears in a State field, then there must be some issues that shifted the data across field boundaries. In this case, you might assume the fields can be moved to their correct locations and that the missing or incorrect City value can be looked up or inferred from corresponding values in other records with the same ZIP code.
The remaining challenge is that automating a massive-scale data correction will require some careful logging and tracking of automated corrections to monitor the percentage of time that data corrections have been applied and to ensure that the changes have been applied in a consistent manner. This suggests some key directives for what we might call “big data governance,” and we will cycle back on some of these directives in future articles.
SOURCE: Considering the Usability of Flawed Data in a Big Data Analysis
Recent articles by David Loshin