It sounds somewhat counter-intuitive until you realize that in a world of exploding data volumes that need to be analyzed, you have only two choices if you want to maintain a reasonable response time for users: (1) throw lots of hardware at the problem--parallel processing, faster storage, and more--or (2) be a lot cleverer in what you access and when. The first approach is pretty common and based on recent developments, quite successful. And as we move into solid-state disks (SSD) and in-memory databases, we'll see even more gains. But, let's play with the second option a bit.
How can we minimize access (disk I/O) to the actual data? So, we can say immediately that the minimum number of times we have to touch the actual data is once! In the case of a data warehouse or mart, that is when we load it. In a traditional row-based RDBMS, that's when we build an indexes we need to speed access for particular queries or further processes. With column-based databases, we often hear that indexes are no longer needed or much reduced--reducing database size, load time and ongoing maintenance costs. And it's certainly true that columnar databases improve query response time. And yet, we might ask (and it applies in the case of row-based databases as well) is there anything else we could do on that single and mandatory access to all the data that could help reduce later data access during analysis?
Infobright's solution is the Knowledge Grid, a set of metadata based on Rough Set theory generated at load-time and used to limit the range of actual data a query has to retrieve in order to figure out which values match the query conditions. Each 64K items block of data (Data Pack) on disk has a set of metadata such as maximum and minimum values, sum, count, etc. for numerical items calculated for it at load-time. At query run-time, these statistics inform the database engine that some data packs are irrelevant because no item meets the query conditions. Other data packs contain only data that meets the query conditions, and if the statistics contain the result needed by the query, the data here need not be accessed either. The remainder of the data packs contain some data that matches the query and will have to be accessed. Given the right statistics, the amount of disk I/O can be significantly reduced. Infobright also create metadata for character items at load-time and for joins at query-time.
Generalizing from the above, we can begin to imagine other possibilities. What if you didn't load the actual data into the database, but just left it where it was and crawled through it to create metadata of a similar nature to allow irrelevant data for a particular query to be eliminated en masse? Of course, that sounds a bit like the indexing approach used by search engines and extended by Attivio and others to cover relational databases as well. Of course, the problem with indexes and similar metadata is that they tend to grow in volume also, until they reach a significant percentage of the actual data size; then we're back to square one.
My mathematical skills are far too rusty (if they were ever bright and shiny enough in the first place) to know if Rough Set theory has anything to say about that issue or how it could be applied beyond the way that Infobright have implemented it, but it does seem like a interesting area for exploration as data volume continue to explode. Any bright PhDs out there like to give it a try?
Posted July 29, 2010 2:03 PM
Permalink | No Comments |