Benford’s Law—Information Analysis and BPM
by David Loshin
Originally published April 18, 2005
Sometimes what people perceive to be the truth is less than consistent with reality. In the business intelligence world, situations like these often present opportunities for discovery that lead directly to actionable knowledge. One example is a curious observation (in the 1920s) by a General Electric physicist named Frank Benford that led to his description of a counter-intuitive law of logarithmic sizes associated with numeric distributions. This law, which is now referred to as “Benford’s Law,” states that in data value sets with certain properties, there is a predictable, albeit, uneven distribution of the initial digits of numbers within the set. In other words, in some number distributions, if you analyzed the frequency of the leftmost digit of all the numbers, you are much more likely to find the digit ‘1’ than any other digit, followed by ‘2,’ then ‘3’,’ etc.
Benford’s Law applies to data sets with these criteria:
For example, consider the closing prices of any day of all the stock on the New York Stock Exchange, the populations of every municipal jurisdiction in the United States or the volumes of all the freshwater lakes in the world. While your intuition might suggest that the distributions of the leftmost digit would be equal, it turns out that each of these data sets, when analyzed, do indeed reflect the Benford distribution.
Ultimately, Benford observed a phenomenon that had been noticed earlier by others, notably 19th-century astronomer Simon Newcomb, that for data sets meeting those criteria, the frequencies of the initial digit generally corresponded to the probability function:
P(dd) = log(1 + 1/dd),
Where dd represents the initial digit(s). The probabilities for the first digit are shown in Table 1.
Table 1: Benford's Law Probabilities for Frequency Distribution
This implies that in a Benford data set, a ‘1’ has a 30 percent chance of being the initial digit, while a ‘9’ has a less than 5 percent chance, as can be seen in Figure 1. This law “works” due to the logarithmic nature of increasing numbers. For example, take the stock prices: there are as many dollar increments in a stock’s price between $10 and $20 as there are between $80 and $90, yet to get from $10 to $20 the price of the stock must double, while to go from $80 to $90, the price must increase by only 12.5 percent. The implication is that it takes longer for the price to double than to increase by a smaller percentage, and consequently the first digit stays at ‘1’ for a longer time than it would at ‘8.’
Figure 1: Initial Digit Distributions Predicted by Benford
The non-randomness of these digit frequencies has led to some interesting uses, most notably in the areas of auditing and fraud detection. For example, a person intending to commit fraud through “cooking the books” might assume numeric randomness and pepper their incorrect entries with numbers that reflect an equal distribution. But since these numbers meet our specified criteria, even a relatively small number of invalid entries will skew the distribution away from the Benford curve, and will highlight areas for further exploration.
Interestingly, Benford analysis is consistent with concept of data profiling, introducing an alternate dimension across, which numeric data can be subjected to frequency analysis. However, the implications of Benford’s Law open up other curious doors. Certain kinds of time durations meet the Benford criteria, allowing for both information quality analysis and for business process improvement opportunities. On a simple level, consider that time durations associated with inbound call center operations should conform to Benford’s law:
If each of these durations should reflect a Benford’s distribution, then variance from the Benford distribution should signal an anomaly for analytical review. For example, the time that a customer remains on hold is dictated by two factors: the time that a customer waits before a CSR picks up the line, and the time that a customer decides to hang up, or “defect.” In the absence of any defection, the times should map to the Benford curve, which reflects the expected distribution of hold times that customers might be willing to tolerate. But defections hovering at a particular hold time can be construed as a “built-in maximum,” which violates one of the Benford criteria, indicating a failure in the business process to properly support the activity, possibly indicating a need for additional resources, or for better resource training.
The time between product purchase and initial customer contact should also be consistent with the Benford curve. We might expect that a larger number of people will require help at an earlier stage, while those who have survived without making the call are more likely to go longer without calling. So any variations from the Benford curve might indicate a problem that is occurring with unexpected frequency or a product failure occurring faster than expected. Yet again, variance from the natural expectations may indicate a process or product failure requiring more focused attention.
There are multiple ways that a natural logarithmic size law like Benford’s Law can be applied, as well as alternate mathematical functions and laws that can (and should) be integrated into analytical profiling tools, and I will be exploring these in future articles. And there is definite value in collectively exploring how techniques and methods developed for different industries and applications can be abstracted and integrated into the information integration and semantic convergence process.
Recent articles by David Loshin
Copyright 2004 — 2019. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC