Blog: Rick van der Lans Subscribe to this blog's RSS feed!

Rick van der Lans

Welcome to my blog where I will talk about a variety of topics related to data warehousing, business intelligence, application integration, and database technology. Currently my special interests include data virtualization, NoSQL technology, and service-oriented architectures. If there are any topics you'd like me to address, send them to me at rick@r20.nl.

About the author >

Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy. An internationally acclaimed speaker who has lectured worldwide for the last 25 years, he is the chairman of the successful annual European Enterprise Data and Business Intelligence Conference held annually in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian and German. He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

The 1880 census for the US was taking a long time, much too long for the decision makers of the United States Census Office (USCO). The reason it took so long was that it was a very manual process. Eventually, it took eight years to complete the exercise.

Halfway through the process, in 1884, it was evident that it would take a long time. Therefore, one of the employees of the USCO was asked to design a machine that would speed up the process for the upcoming 1890 census. This machine had to make it possible to process the enormous amount of data much faster.

That employee was Herman Hollerith. In fact, William Hunt and Charles Pidgin were asked the same question. A benchmark was prepared where all three could demonstrate how fast their solutions were. Coding took 144 hours for Hunt's method, Pidgin's method took 100 hours, and Hollerith's method 72 hours. The processing of the data took respectively 55, 44, and 5 hours. Conclusion, Hollerith's solution was the fastest by far and was, therefore, selected by the USCO.

For the 1890 census, 50,000 men were used to gather the data and to put it on punch cards. It was decided to store much more data attributes: 235 instead of the 6 used in the 1880 census. Hollerith also invented a machine for punching cards. This machine made it possible for one person to produce 700 punch cards per day. Because of Hollerith's machines, 6,000,000 persons could be counted per day. His machines reduced a ten-year job to a few months. In total, his inventions led to $5 million in savings.

Hollerith's ideas for automation of the census are described in Patent No. 395,782 of Jan. 8, 1889 which starts with the following sentence: "The herein described method of compiling statistics ..."

Does this all sound familiar? Massive amounts of data, compiling statistics, the need for a better performance. To me it sounds as if Hollerith was working on the first generation of big data systems.

Hollerith started his own company in 1896 the Computer Tabular Recording Company (CTR). In 1924, after merging with some other companies, the name CTR was changed in IBM. In other words, IBM has always been in the business of big data, analytics, and appliances.

Why did it take so long before we came up with the term big data while, evidently, we have been developing big data systems since the early beginnings of computing? You could say that the first information processing system was a big data system using analytics. This means that Hollerith, besides being a very successful inventor, can be considered the grandfather of big data.


Posted February 1, 2013 7:00 AM
Permalink | No Comments |

Leave a comment