Originally published September 22, 2005
Albert Einstein wrote “Make everything as simple as possible, but not simpler.” This principle holds true for most solutions that any programmer, business analyst, or executive will ever implement. What happens, however, when the bridge between simplicity and utility grows beyond the scope of reality?
Long before the advent of computer systems capable of sorting, indexing, and searching vast amounts of information, governmental organizations and private citizens alike kept bloated and inefficient paper records in whichever manner best suited their individual needs. Illiteracy was rampant, and the United States was fast becoming the cultural melting pot it is today. Many citizens could not speak English. As diversity in language, culture and customs grew, so did the standard form of everyday words.
Immigrants to the United States had a native language that was not based on Roman characters. To write their names, the names of their relatives, or the cities they arrived from, the immigrants had to make their best guess of how to express their symbolic language in English.
The United States government realized the need to be able to categorize the names of private citizens in a manner that allowed for multiple spellings of the same name (e.g. Smith and Smythe) to be grouped. Thus, the United States Census bureau, more specifically Robert C. Russell of Pittsburgh, Pennsylvania created an algorithm capable of indexing the English language in a way that multiple spellings of the same name could be found with only a cursory glance. Thus, in 1918 Soundex was born.
Russell knew that letters of the alphabet were phonetically divided into categories. In his patent he describes assigning a numeric value to each category.
Russell also described a few additional rules to complete the indexing
Today, the United States Government uses a system very similar to Russell’s original design. They simply dropped the letters "h", "w" and "y", combined the letters “m” and “n”, dropped vowels completely unless it was the initial letter of the word and removed the rule regarding words that end with “gh”, “s” and “z”.
U.S. Government Soundex Table
Examples:
The foundation and inspiration behind Soundex was solid, but unfortunately the actual algorithm was inadequate in most cases.
Since 1918 several advances in Soundex have been made, all with varying efficiency in different areas - Phonix, Q-gram/N-gram, Edit-distance based algorithms, and several other proprietary indexing systems have been developed and for the most part, they have all but been replaced by a new and powerful indexing system called Double Metaphone.
Lawrence Phillips’ Double Metaphone phonetic matching algorithm was the first sound indexing system to group words not just by spellings, but also by different pronunciations.
Though much more complicated systems have been created since 1918, the dynamic capabilities of Double Metaphone can still be employed in a variety of solutions to create more powerful, and effective searches of proper names and short sentences in databases and other information storage mediums.
Though powerful and far superior to anything else available today, Double Metaphone technology does have its limitations and drawbacks.
Despite its limitations, Double Metaphone technology—which is free to use and completely Open Source still holds as the most flexible and powerful Soundex system today.
Recent articles by Adam Carstensen
Want to rate this article? Login or become a member today!
Comments
Want to post a comment? Login or become a member today!
Be the first to comment!