Oops! The input is malformed! Integrating Unstructured Data into Your Analytic Environment by Kirby Lunger - BeyeNETWORK
We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Integrating Unstructured Data into Your Analytic Environment

Originally published June 23, 2008

Your corporate reports and dashboards only display transactional database information. You know there is a lot more unstructured information available in your company, but you’re not sure how to integrate this information into your analytic environment or how to make this content useful to your corporate audience. The good news? You can use a new breed of emerging technologies and techniques to integrate unstructured information into your existing reports and dashboards.

Changes Afoot

An often-quoted statistic mentions that only 20% of an organization’s content is available for analysis through a traditional business intelligence (BI) environment, while the other 80% is unstructured or semi-structured and therefore much more difficult to access and integrate. More recent information from TDWI and other research sources puts the unstructured and semi-structured data estimate at a little over 50% of total information sources. No matter what the statistic, a large part of your organization’s content is not currently in a format that a “traditional” BI, ETL, data profiling and/or data warehouse tool is able to analyze. The unfortunate part about this is that this less structured information is often the most predictive of customer and financial outcomes. For example, a large volume of email in a certain region can reveal a problem with a customer service organization, which will eventually translate into losing customers and bad financial results, whereas standard transactional data might incorrectly (and after the fact) imply there was a problem with the sales representative.

The problem is compounded because your analytical environment probably was not intended for a large corporate audience when it was initially built. Most organizations designed and built their performance management environment for less than 10% of their employees to use. Now, participants in information supply chains are asking for access to important enterprise key indicators, including stakeholders such as customers, partners, non-technical managers and operations employees.

The result is a convergence of needs: the need to access additional content types to gain further insight into organizational performance, and the need to democratize information access to enable an organization’s community to make better decisions. At first glance, this could imply that you need to spend millions of dollars to upgrade your analytic environment to allow people at all levels of the organization to access and analyze these new information sources. This article cannot help with the cost of incremental BI or CPM software access for more people in your organization, but it will suggest several techniques you can use in your performance management environment to make all types of content readily digestible and understandable across your organization.

The Status Quo

Most organizations have a reporting and analysis platform that is displayed in a dashboard format. Dashboards generally address different audiences with various performance management needs:


Figure 1: Dashboard Types by Purpose and Audience1

Most dashboards are built using BI or corporate process management technologies such as Business Objects, Cognos, Information Builders, Microsoft or Oracle. These technologies all utilize fairly similar visualization and analysis techniques to present information. In fact, information visualization has come a long way since the early days of the Internet. Visualization techniques are now so commonly accepted that one academic institution has even developed a periodic table of visualization methods!

The point here is that integrating unstructured data into your analytic environment does not require a wholesale rethinking of information display – it just requires using existing techniques in new and interesting ways, possibly for different audiences, to produce more useful action items.

How to Change Your Perspective

Most BI applications start from the assumption that users are accessing the system with a fairly precise query in mind. For example, your customer service manager might use an operational dashboard to look for the average duration of a call in the call center over the last 24 hours. The issue with this assumption is that when people stop thinking about highly structured, transactional content, they often need to ask broader questions. So your customer service manager might instead say, “I see that one of my representatives isn’t keeping up with our maximum call duration requirements. I wonder what’s going on.” In your current environment, this manager might have to call five people to talk for some amount of time to figure this out. In a dashboard enabled for less precise queries, this manager might be able to spot trends and key words in this rep’s emails, and determine that the rep is so good that he is being sent all of the most complex calls or that his rate of first-call resolution far outpaces his peers’.

The real revolution in information presentation to enable this type of analysis to happen is to combine the precision of BI querying (generally SQL) with the fuzziness of search technology. This produces a whole new paradigm in two areas: 1) How to perform searches in your BI environment and 2) How to display searches in your BI environment.

How to Improve Search in Your Analytic Environment

When most of us think about integrating “search” into our analytic solutions, we visualize inserting a search bar into a dashboard and displaying a results list based on the popularity of the documents the search engine returns. This might work well on the public Internet, where every mom and pop website is trying to optimize their web page to appear at the top of a web search. This doesn’t work so well inside an organization, where the creator of the Excel file “Forecast 2007” could care less what its popularity ranking is or even if other people can find the document. So, very importantly, you need to use a search technology that is optimized for “behind the firewall” search. Some of the BI vendors have established agreements with search vendors to perform this function, or you may need to evaluate stand-alone players in the enterprise search and information access markets to determine which company’s search methodology works best for your specific needs.

How to Improve Unstructured Information Display in Your Analytic Environment

Once you have an adequate search technology, you will need to use techniques to provide exploratory search – generally on large volumes of data – to promote faster understanding of what a data set contains. Commonly used techniques in this area include tag clouds and advanced natural language processing methods. A tag cloud is a weighted set of related tags or information items. Flikr was one of the first popular websites to use this method. Users can tag photos with certain words, and they use a tag cloud to find photos with something in common.


Figure 2: Tag Cloud Example: Flikr Most Popular Photo Category Tag Cloud2

At a very high level, natural language processing converts human languages into items that are easier for computers to control – or vice versa. Certain visualization techniques easily translate the results of language processing into analysis. For example, clustering is a statistical technique that is used to group data with similar characteristics. This cluster diagram shows the results of a survey regarding consumers’ opinions about a certain brand of alcohol.


Figure 3: Representative Cluster Analysis Example: Alcohol Brand Opinions3

Another popular natural language processing technique is sentiment analysis, which is a broad category of analyses related to defining the perspective or attitude of a person or group of people regarding a topic. This chart demonstrates a mining of blogger sentiment on the Apple iPhone versus the LG Voyager, which could be very important feedback for these two companies on their competitive landscape.


Figure 4: Representative Sentiment Analysis Example: Blogosphere Opinion on iPhone versus Voyager4

Other analytics techniques can combine transactional and other types of information in one display; most of these methods have been used successfully for years in business intelligence applications. For example, you could have a search result set that contains traffic lighting for text items that are most closely related to a numerical result. Another opportunity might be to create an X-Y scatter plot, where the chart could analyze sales numbers by different salespeople on one axis, and the communication volume by salesperson to their customer base on the other axis to demonstrate a correlation between communication frequency and sales outcomes.

Next Steps

The examples we just walked through represent a small number of the almost countless opportunities for including and displaying other information types beyond transactions in your analytical environment. The good news is that you can work incrementally to use functionality already available in most performance management tools to start experimenting with display techniques while addressing the longer-term opportunity of integrating additional types of content into your performance management process and platform.


1.  Source: TDWI; Attivio research and analysis
2.  http://www.flickr.com/photos/tags/
3.  http://en.wikipedia.org/wiki/Cluster_analysis_(in_marketing)
4.  http://iphonevsvoyager.parnassusgroup.com/

SOURCE: Integrating Unstructured Data into Your Analytic Environment



Want to post a comment? Login or become a member today!

Be the first to comment!