Oops! The input is malformed! For Text Data Quality, Focus on Sources - BeyeNETWORK
We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


 

Podcasts

For Text Data Quality, Focus on Sources


 

Originally published December 9, 2009


Overview

Seth Grimes discusses emerging best practices for selection and processing of text sources that can lead to better text data quality.

Seth Grimes
Seth Grimes

SOURCE: For Text Data Quality, Focus on Sources

 
 

Comments

Want to post a comment? Login or become a member today!

Posted December 10, 2009 by Justin Langseth

Seth, very interesting article.  

At Clarabridge we have encountered similar challenges as we have added Social Media-sourced content into Clarabridge text analytics systems for our customers.  

In the Social Media sphere there are multiple related issues.  Advertisements pollute otherwise useful results, but can be largely removed via linguistic detection.  Spam also creeps through the typically-used statistical spam filters, but can also be removed by looking for linguistic fingerprints that are common of spam, such as a very low ratio of grammatical connections per sentence, and often a high average per-sentence word count.

Then there is the issue of the off-topic comments.  If someone wants to analyze a mid-Atlantic burger chain, even postings that actually say something useful about that chain may also talk about other aspects of the person's day to day activities.  The burger-related post may be 3 sentences out of a 30-sentence blog that talks about various other companies and activities as well.  This is another area where text analytics can greatly help, to separate the the on-topic "wheat" (from the specific analyst's perspective that is)  from the unrelated "chaff."

So thank you for raising these important concerns.  These are areas where collaboration and best-practice sharing could benefit us all.

Justin Langseth, President & CTO, Clarabridge, Inc. 

Is this comment inappropriate? Click here to flag this comment.