Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

As you know Temperature of Data is one of the next "big things" to come from RDBMS engine vendors. In this blog entry we will discuss temperature of data and what it is, and how it ties to DW 2.0. The initial business problem is that data sets are continuing to grow, and grow - particularly as enterprises come under more scrutiny for compliance and auditing. In fact, I read yesterday (no surprise) that all email trails are capable of being under subpoena in a court of law - this means that email alone deserves it's own compliant data warehouse. But that's for another day.

Temperature of Data, what is it?
In the next generation of Database Engines, data can be HOT, Medium, Luke Warm, and Cold. For lack of a better definition, the Database engines are working on the following:

HOT = Accessed all the time, or extremely important data requiring sub-second response times. Hot data must reside in RAM continuously. The more HOT data, the larger the RAM requirements or so one thinks. In-RAM database engines have recently been acquired by most of the leading RDBMS technologies; some vendors are rolling their own. In-RAM RDBMS engines require lots of RAM, or require complete compression across every column without sorting, which can be done through an indirect access to a hash table of existing values. The problem is: as user requests grow across business, more and more data becomes "hot" requiring additional RAM resources to keep it in memory. The only thing that can answer this call is Nanotech Memory (which I will blog on shortly).

HOT Data in DW2.0 is defined as Interactive or Integrated, depending on the need. In DW 2.0 the HOT data must include not only current transactional data sets, but some level of context as well - which usually means Master Data sets, Pointers to Metadata that describe Textual Data Sets, and Transactional Detail Data Sets, or minimum descriptive detail data sets.

Medium Data = is data accessed most of the time, but where response times can be anywhere from 1 second to 10 or 15 seconds. In this category, aggregation analysis, strategic analysis, and some levels of master data can be requested. Certainly this is NOT a desirable area for transactional data, particularly in detection of Fraud. Detection of Fraud should be in the HOT data section. Medium Data is usually stored on slower storage, including internal disk, super-fast SAN, high-speed controlled attached disk - something with lots of I/O channels, high parallelism, and incredibly fast access times.

Medium Data in DW 2.0 is defined to include: Some Master Data (that which is not HOT, Partial Textual Data, and partial descriptive data. It may also include partial aggregation layers for "current" snapshots (within the past 10 minutes, refreshed every 10 minutes) so that information action time is significantly reduced. Medium Data sits smack in the middle of integrated data sets.

Luke-Warm Data = Data that is accessed rarely, rarely may be (for example) once every 30 minutes or twice every 4 hours. Data like this is usually SCANNED in nature. In other words, Luke-warm data is accessed but for other reasons - to be aggregated (mostly), or to be mined, cleansed, or updated. Luke Warm data will be tossed out of RAM as soon as the accessibility is complete.

In DW2.0, Luke Warm data sits on slower external Disk, over the WAN possibly, or sits on near-line storage like WAN drives, DASD, or slow SAN drives. The Data sets in Luke-Warm data are defined for use in contextual queries, or strategic long-term queries, or when users are digging for answers over-night.

Cold Data = historical context that is hardly ever accessed, but when requested, must have a response time equal to that of a couple minutes. Cold data is typically used by auditors once a year, or twice a year, or even once or twice a month. Cold data sits on Archival storage, like CD drives, worm drives, and slow tape libraries.

Cold Data in DW 2.0 is considered Archival Data; Cold data sits out of play until requested. It might contain the full textual reference data (from emails, word docs, full images), it might contain full descriptive data like unused addresses, or historical references from data that was "relevant two years ago". Cold Data is inactive for the most part, but when needed - can be brought on to near-line or integrated storage temporarily (and automatically) to handle the request.

There is no direct fine line between any of these types of data, and for the most part it will vary from corporation to corporation, until the definitions are nailed down across best practices of Active Data Warehousing and Real-Time data integration. What is clear is that METADATA MUST be defined in order to "classify" data components within the RDBMS engines, the thresholds that are set must be defined within the metadata as business rules, and these rules MUST be dynamic no matter which RDBMS engine you are using.

Furthermore, the METADATA must be available to BI tool sets, and technical business users. With the help of IT, the technical business users can set and re-set the definitions of HOT, COLD, Medium, and Luke-Warm. This means that a tool like EII is primed for taking over this space - it can manage the metadata, interface with business users, and reset or re-write the thresholds back to the RDBMS technology that define where data sits. The RDBMS engines must then deal with where to put this data and how it fits.

One thing is clear, as we march forward, our data sets will only grow, not shrink. Something to take note is: what exactly is "garbage data", what does it mean? Can you identify it and remove it from your systems without impact to audits? If you clear it out, are you removing the possibility of tying together or discovering a meaningful relationship across your business that you didn't have before? If you have garbage data, does it mean your business is hemorrhaging money? YES... more on these questions in another blog.

What does this mean to business?
It means higher costs, more hardware - hopefully your organization is already on the road to consolidation, this will help. Hopefully you've already embarked on an SOA initiative, and hopefully you've started your Master Metadata, and Master Data Management initiatives as well - these are all pre-cursors to using this technology effectively.

Questions? Thoughts? Love to hear from you.

Cheers,
Dan Linstedt
CTO, Myers-Holum, Inc
http://www.MyersHolum.com


Posted August 20, 2006 5:12 AM
Permalink | 1 Comment |

1 Comment

That's a timely posting! I just got an email last week where I saw these terms for the first time. Talked about high priority (HOT) and low priority (COLD) resource pools on a database. You entry expands on that, I like the idea of not just dividing HOT and COLD on the timeliness of the data but on the type of content as well.

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›