We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Search Engines – GIGO

Originally published July 9, 2009

Everybody knows Yahoo and Google. Everybody can go to the Internet and find just about anything with search engines. Everybody knows the price of Google stock (or at least everyone knows that Google stock is very high). Entire conferences focus on search engines.

But for all the hype surrounding search engines, there is one major Achilles’ heel that leaves all search engines vulnerable and gasping for air. That Achilles’ heel is that a search engine is no more effective or no more useful than the text on which the search engine operates. In other words, search engines have the great vulnerability of GIGO – garbage in, garbage out.

In order to understand this fully, consider what today’s search engine does. The search engine takes a word and finds the word, combinations of the word and permutations of the word. And the search engine does this efficiently and well. There is no argument about that. If you enter “Jane Fonda” into a search engine, you are likely to get Jane Fonda’s web site, the filmography of Jane Fonda, the Jane Fonda total workout web site, and so forth. You may even get references to Peter Fonda and Henry Fonda. And that’s what search engines do well.

So as long as what you want is a simple search, then search engines are your thing. But what if you want to do analytical processing? (Note – while searching is important, a search is different from an analytical process that operates on text.) To give some examples, suppose you want to examine the important dates in Jane Fonda’s life. You have a body of text about Jane Fonda. In one place, it says that Jane Fonda was married in 1967. In another place, it states that Jane Fonda’s first movie was made in September 1963. In another place, it lists the filmography of Jane Fonda and here it states “…10-1-68 – Klute.”

The search engine has a problem here because there is no uniformity of dates. In one place, date is in one format. In another place, date is in another format. There is no standardization of data. And so the search engine displays one of its limitations. (Note: data standardization for textual analytics is patent pending. Contact the author for licensing rights.)

But standardization of dates is only the starting point for textual analytics. Consider the reading of numeric values. In the text about Jane Fonda it states that “…gross sales for Klute were thirty million dollars…” The problem is that the value “thirty million dollars” cannot be meaningfully compared to another figure, say “fourteen thousand dollars,” until the written figures are converted to numerics. Business intelligence software understands what to do with “$30,000,000,” but business intelligence software does not understand what to do with “thirty million dollars.” Therefore, in order to prepare the raw text for analytical processing, conversion from text to numerics must be done. (Note: conversion from text to numerics is patent pending. Contact the author for licensing rights.)

And date conversion and conversion from text to numerics only scratches the surface when it comes to addressing the needs of preparing textual data for textual analytics.

As a more sophisticated form of preparation of textual data for textual analytics, consider the need for classifying textual data into broad categories. Looking at Jane Fonda, the body of work left by Jane Fonda could be classified into several broader categories. For example, there may be a category for workout videos. This would include Cathy Smith, Jane Fonda and other notables who have produced fitness and fitness motivation routines on tape or other media.

Then there might be the category of films where fathers played in the film with daughters. This might include Ryan and Tatum O’Neal, Jane and Henry Fonda and others. Then there might be a category of people who actively protested the Vietnam War. This category might include Jerry Rubin, Squeaky Fromme, and Jane Fonda.

There indeed can be many classifications of data and Jane Fonda (like almost everyone) fits into multiple categories. These categories of data are very important in the creation of the structure of textual data for textual analytic processing. (Note: external categorization of text for the purpose of textual analytics is patent pending. Contact the author for licensing.)

Yet another aspect of analyzing text is that of examining the proximity of words together to infer meaning. In the case of Jane Fonda, there may appear the words “war protestor” and “Santa Fe.” When these two words/phrases are found separately and far apart, they probably refer to different events. In Jane Fonda’s early life she was a war protestor. And reference here refers to her trip to Hanoi and other activities. In the case of the mention of Santa Fe, this probably refers to her later life when she built a home in Santa Fe. So when these words are found far apart, they have one set of meanings. But suppose these words were found close together. Then the reference is probably to something quite different. When these words are found close together, the reference probably is to the construction workers that protested against building Jane Fonda’s house in Santa Fe.

The proximity of words then is an important factor in determining the meaning of those words. In fact, in order to aid the analytics process, proximity variables can be created. A proximity variable is a word that is created whenever two words or phrases are in close proximity to each other. (Note: Proximity variables are patent pending. Contact the author for licensing.) 

In fact, there are many other aspects to the techniques and approaches needed for taking raw textual data and preparing that data for textual analytics. (Note: for a list of those other techniques and approaches not mentioned here that are patent pending, contact the author. Licensing of intellectual property will be considered.)

From these simple examples, it is seen that a search engine is incapable of doing anything other than looking at data literally. When it comes time for looking at text in a manner that is suitable for analysis, the data first needs to be edited, or integrated. Then and only then can textual analytics proceed.

If you have read and understood what has been said here, there is one question that should be going through your head. That question is – what about data on the Internet? Is the implication that we should be integrating data on the Internet? There are several major drawbacks to trying to integrate textual data found on the Internet. The first challenge is the volume of data found on the Internet. Any integration is going to have to swallow a LOT of data. The second challenge is that of creating and assigning external categories to data. External categorization makes sense to one person but not the next person. External categorization is like art – it is subjective. As such, it is questionable whether external categorization can be done for the massive and diverse data found on the Internet.

However, internal corporate data is another matter entirely. There is much less data corporately than there is on the Internet. While the volume of corporate data is a challenge, it is not an insurmountable challenge. The second issue is that of external categorization. With external categorization done for a corporation, there are far fewer interpretations as to the content and structure of the categorization of external data than there are in the general population.

Therefore, textual analytics is a real possibility for corporate data while it can probably only be achieved to a limited extent for the Internet.

  • Bill InmonBill Inmon

    Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.

    Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Bill Inmon

 

Comments

Want to post a comment? Login or become a member today!

Posted July 9, 2009 by Charles Knight charles@altsearchengines.com

Re: Shae it

May we republish thins on AltSearchEngines, with full attribution?

LMK,

Charles Knight, editor

AltSearchEngines.com

charles@altsearchengines.com

Is this comment inappropriate? Click here to flag this comment.

Posted July 9, 2009 by George Allen

For the humor alone (Patent Pending) this is worth a read.  Thanks, Bill.  I just came out of a meeting on textual mining of medical record notes and read this.  Made the afternoon much more relaxing.

Is this comment inappropriate? Click here to flag this comment.