We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


An Introduction to Double Metaphone and the Principles Behind Soundex

Originally published September 22, 2005

Albert Einstein wrote “Make everything as simple as possible, but not simpler.” This principle holds true for most solutions that any programmer, business analyst, or executive will ever implement. What happens, however, when the bridge between simplicity and utility grows beyond the scope of reality?

Long before the advent of computer systems capable of sorting, indexing, and searching vast amounts of information, governmental organizations and private citizens alike kept bloated and inefficient paper records in whichever manner best suited their individual needs. Illiteracy was rampant, and the United States was fast becoming the cultural melting pot it is today. Many citizens could not speak English. As diversity in language, culture and customs grew, so did the standard form of everyday words.

Immigrants to the United States had a native language that was not based on Roman characters. To write their names, the names of their relatives, or the cities they arrived from, the immigrants had to make their best guess of how to express their symbolic language in English.

The United States government realized the need to be able to categorize the names of private citizens in a manner that allowed for multiple spellings of the same name (e.g. Smith and Smythe) to be grouped.  Thus, the United States Census bureau, more specifically Robert C. Russell of Pittsburgh, Pennsylvania created an algorithm capable of indexing the English language in a way that multiple spellings of the same name could be found with only a cursory glance.  Thus, in 1918 Soundex was born.

Russell knew that letters of the alphabet were phonetically divided into categories. In his patent he describes assigning a numeric value to each category.

  1. Oral resonants a, e, i, o, u, y.
  2. Labials and labio-dentals b, f, p, v.
  3. Gutterals and sibilants c, g, k, q, s, z.
  4. Dental-mutes d, t.
  5. Palatal-fricative l.
  6. Labio-nasal m.
  7. Lingua-nasal n.
  8. Dental fricative r.

Russell also described a few additional rules to complete the indexing

  1. The initial letter of the word is always kept.
  2. Two consecutive letters that had the same code are considered as a single letter (e.g. “bb” is the same as just “b”).
  3. If a word ended with “gh”, “s” or “z” those letters were discarded.
  4. Only the first occurrence of a vowel (Group 1) is counted.

Today, the United States Government uses a system very similar to Russell’s original design. They simply dropped the letters "h", "w" and "y", combined the letters “m” and “n”, dropped vowels completely unless it was the initial letter of the word and removed the rule regarding words that end with “gh”, “s” and “z”.

U.S. Government Soundex Table

  1.  b,f,p,v
  2.  c,g,j,k,q,s,x,z
  3.  d, t
  4.  l
  5.  m,n
  6.  r

Examples:

  • Johnson = J525
  • Miller = M460
  • Ricardo = R263
  • Peters = P362

The foundation and inspiration behind Soundex was solid, but unfortunately the actual algorithm was inadequate in most cases.

Since 1918 several advances in Soundex have been made, all with varying efficiency in different areas - Phonix, Q-gram/N-gram, Edit-distance based algorithms, and several other proprietary indexing systems have been developed and for the most part, they have all but been replaced by a new and powerful indexing system called Double Metaphone.

Lawrence Phillips’ Double Metaphone phonetic matching algorithm was the first sound indexing system to group words not just by spellings, but also by different pronunciations.

Though much more complicated systems have been created since 1918, the dynamic capabilities of Double Metaphone can still be employed in a variety of solutions to create more powerful, and effective searches of proper names and short sentences in databases and other information storage mediums.

Though powerful and far superior to anything else available today, Double Metaphone technology does have its limitations and drawbacks.

  • Double Metaphone was designed for searching lists of proper names rather than large amounts of text.
  • The ranking ability of Double Metaphone is very poor; however it is still much more powerful than that of Soundex or any other sound indexing system. The three match levels are described below.
    • (Primary Key = Primary Key) = Strongest Match
    • (Secondary Key = Primary Key) = Normal Match
    • (Primary Key = Secondary Key) = Normal Match
    • (Alternate Key = Alternate Key) = Minimal Match
  • Double Metaphone may not match grossly misspelled words that seriously alter the phonetic structure of the word.

Despite its limitations, Double Metaphone technology—which is free to use and completely Open Source still holds as the most flexible and powerful Soundex system today.

  • Adam Carstensen

    Adam, a Consultant for the Data Management Group, specializes in Web technologies and application development. During his tour of duty in the US Army, Adam designed and implemented complex software solutions for various requirements in support of intelligence operations worldwide and designed an application adopted as a legacy software system for use in worldwide intelligence operations. Adam has held positions in Web development, application development and intelligence.

Recent articles by Adam Carstensen

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!