Business Intelligence Network business intelligence resources

Blog: Dan E. Linstedt

« DNA Computing & Tic-Tac-Toe | Main | Virtual "Data Tables" for EII »

Data Mining and the Active Data Warehouse

Where is data mining these days? What power can it bring to the table? If I build an Active Data Warehouse will it come, is it necessary?

There are many questions floating around these days, and I've written a little bit about this topic in the past. In this post I will attempt to discuss some of the newer thoughts about this subject, and push the envelope out a little further than maybe we're comfortable with. This entry is a thought experiment, but has implications in today’s computing arena.

Data Mining has grown up over the recent years. It's been around for a long long time, but I guess I should say, it's become much more "usable" in the business users eyes, and it's beginning to appear as embedded technology. It's now plugged in to Teradata RDBMS, Oracle RDBMS, SQLServer 2005 Integration Services and RDBMS, DB2 UDB RDBMS, FirstLogic IQ Suite, SAS ETL and BI tools, and so on - there are too many to list here. The point is that it is beginning to be utilized to enhance the quality of information.

"According to the example of Baosteel production, this paper introduces the way of using data mining technology -- SAS/EM to discover the rules that we don’t know before and it can improve the quality of products and decrease the cost." (1)

Data mining is not just about data quality, it's also about business process quality, deeper understanding of our environment, and the quality of our products. In this companies' case they concluded that "...How to use data is an important thing that faces everyone. We should apply the data mining technology to more fields." (1)

I would tend to agree. Additional fields (in my mind) include mining active data as it arrives - in context with the strategic data that it's already "learned" or established a knowledge pattern for. Other areas may include mining the architecture in which the data sits in (ie: the data model), mining the processes that link the data together - looking for flaws or better ways to deal with it, mining the metadata around the data set for additional context establishment and so on.

"In this paper we introduce data quality mining (DQM) as a new and promising data mining approach from the academic and the business point of view. The goal of DQM is to employ data mining methods in order to detect, quantify, explain and correct data quality deficiencies in very large databases. Data quality is crucial for many applications of knowledge discovery in databases (KDD). So a typical application scenario for DQM is to support KDD projects, especially during the initial phases. Moreover, improving data quality is also a burning issue in many areas outside KDD. That is, DQM opens new and promising application fields for data mining methods outside the field of pure data analysis. To give a first impression of a concrete DQM approach, we describe how to employ association rules for the purpose of DQM." (2)

Active Data Warehousing is about integrating the ODS and Data Warehouse into a single instance, single data store. It's about capturing data as it happens (at the right time), in to the warehouse as a statement of fact, and then using that data or leveraging the data to make both strategic and tactical decisions in time with the enterprise. Active Data Warehousing also brings in massive sets of information to deal with, thus making the task dually difficult. Of course - with an Active Warehouse we also need to utilize real-time arriving data.

One notion I've believed in is something I call Active Mining. Active Mining is the ability to start a neural net, pre-load it with the historical data, and then as data arrives (when it arrives), add it to the neural net already in play. In other words - no waiting, no "re-running" of the mining algorithms to get the result. Of course in the beginning (or depending on how much history is mined when started), the neural net may need to be shut-down and restarted - but as time goes on, less and less correction is necessary.

I believe that active mining will take the fore-front and will be embedded in every process through the streams of data that we deal with on a daily basis. However, that's not to say that there's no value in storing existing level of details as a statement-of-fact in the warehouse, there certainly is value to that. But moving forward, dynamically understanding how well the new data fits - may become a critical factor of business operations.

Speaking of business operations, there is a company called Purple Insight which has (in my opinion) begun to master the ability to tie data mining, and results to visualization. Check them out here. Using Active Mining to feed a live visualization of the data may also begin to play a powerful role in the future "use" of our information sets.

References:
1. Data Mining Quality Improvement - http://www2.sas.com/proceedings/sugi27/p111-27.pdf
2. Data Quality Mining - http://www.cs.cornell.edu/johannes/papers/dmkd2001-papers/p5_hipp.pdf

  Posted by Dan Linstedt on December 2, 2005 6:03 AM |

Comments

I agree with your view that active mining will play an important role in business processes. I would like to add, that in general, for the type of system you've described, scoring data on the fly is probably more important than continuously or incrementally rebuilding models. What has become more popular is a champion-challenger approach to model management, where the performance of new models (challengers), built against recent data, are compared against the current best (champion) model. If a challenger model has a better performance it becomes the new champion. On the scoring front, the integration of data mining in the database kernel allows great flexibility, power, and response time. This is illustrated in this example using Oracle's SQL PREDICTION operator (http://oracledmt.blogspot.com/2006/01/analytics-in-oracle-database_05.html).

Post a comment