We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

August 2006 Archives

I just blogged on the need for allowing dirty data to flow through to an auditable reporting area. There are a lot of questions about WHERE the dirty data should reside, and where the dirty data should be cleaned up. In this blog entry we'll dive in to take a look at that in a short consolidated view. If you are fighting with compliance and auditability at a systems or data level, hopefully this will be helpful.

In Nirvana land, there would never be dirty data. Our source systems would be perfect, and would capture all clean data. And if the data wasn't clean it would go to great lengths to get it that way before releasing it to the rest of the corporation. Of course in Nirvana land we would have a SINGLE STACK (one operational system, one data store that had both - current / transactional and historical / warehouse like) information system.

In Nirvana land the single system would integrate with web-services external to our company. In Nirvana land, this "single system" would really be the be-all-end-all, dare I say it: Appliance / Platform / pre-configured with all that you need, everything from DSS to source system OLTP and edit checks to information quality, and reporting and dashboards. In Nirvana land, it would all be... well... nirvana.

But alas, this is not the case. In the real world, for years - the data warehousing industry has taught or preached that data quality should be between the source system and whatever the next "step" of integration is: which might be an ODS, an EDW, an ADW, a Kimball Bus Staging area, and from a strategic point of view - this may work nicely. As long as whatever tools are being used to change and alter the data produce the correct audit trail.

However from a tactical or operational point of view, or from a point of view where we are trying our best to "catch" broken source systems, it doesn't work so well. Cleansing the data or using data quality to produce a "single version of the truth" can have a negative impact on our ability to be accountable (as business users) and actually get the source systems fixed. That said: where oh-where do we put the IQ systems?

Between Sources and ODS?
Between Sources and Data Warehouses?
Between ODS and Data Warehouses?
Between Data Warehouses and Data Marts?

And where oh-where do we put the "bad" records? Do we kick them back to the source system? Do we report on them separately for end-users to clean up? Do we leave them in a daily-snapshot set of error tables? Do we let them through our ODS, and Data warehouse environments to end up in separate data mart areas?

Remember that TRUTH is subjective, and no one can tell "who's truth it is - or when". And also, that auditors have a very different definition of what truth is, they try to take an objective view of truth - one that is traceable and auditable.

Putting DQ or IQ between sources and the ODS allows Master Data to be built within the ODS - the caveats are that the bad data, pre-altered data trails, and any audit trails produced by the integration or "data alteration" stream MUST be warehoused somewhere over-time. Simply "improving the quality of the data" without showing the errors to the business causes all kinds of heartburn downstream. Kicking bad records back to the source system is fine (for a while), until more code must be hand-written on the source system to take it as a new feed, this drives maintenance costs up through the roof. Having quality / cleansed data in the ODS allows great interaction from a strategic standpoint, and possibly allows transactional feeds back to the source systems. Ultimately it greatly increases the complexity of the solution that is handling "BAD DATA" up-stream.

Putting DQ or IQ between the Sources and the Warehouse can (and often does) cause auditability problems. Not many of these ETL or ELT tool sets have the capacity to create data "as of before it was changed/aggregated/filtered", that must be architected and increases the difficulty of the load routines - not to mention slows down the overall load cycles. Furthermore, Bad data must be captured in "error tables" which are usually in the staging area, sometimes these are only daily refresh - no analytics can be run on these data sets, they disappear to fast. As we move towards real-time or near-real-time processing, there is NO TIME to process, integrate, aggregate the data (either between sources and ODS or between Sources and your DW). Real-time dictates that when the data is available, good bad or indifferently, that it be loaded to your warehouse AS IS - no delay's, otherwise the data decays and becomes useless, or the "attempt to detect fraud" is cleansed away, cleaned up and never detected.

Putting DQ or IQ between the Warehouse and the Data Marts is probably (in my opinion) the best way to handle this. The Data Warehouse then becomes a statement of integrated facts (integrated by business key), and the marts then become specific versions of the truth from which management can begin to dissect where they are losing money.

This all applies when loading "today’s" data, and in fact, cleansing with audit trails before reaching either the ODS or the DW applies when loading "today’s" data as well. What doesn't fly is the data model, once the data model has been constructed for "cleansed" data, it very rarely allows broken data (or error data) in, in other words, loading historical data wreaks havoc on the data model, or dirty historical data is cleansed - and through that process the "broken business process / broken business rules" are cleansed and washed away as well.

Now, is there value to cleaning up the source systems? You bet, this is the ultimate goal. Nirvana. Is there value to having cleansed data in your ODS? Absolutely! You get the idea, there's definite value in having cleansed and quality checked data, but there's also value in running analytics on the dirty data, when did it break the business rule (back in history), why? What was the pattern? Then the real work begins: what caused the pattern? Is the pattern still happening today? Is the business losing money or opportunities with customers because the source system is still broken? And so on.

So next time you think about where to PUT your DQ/IQ solution, consider what the impact is of NOT looking at the dirty data, ask the question: instead of kicking the data back to the source system, is there somewhere we can warehouse this information to examine it's patterns? I think you'll be surprised at all the business accountability you'll uncover.

Love to hear your thoughts on this.

Dan Linstedt
Myers-Holum, Inc

Posted August 29, 2006 6:37 AM
Permalink | 2 Comments |

In my most recent blog entry I discuss temperature related data sets, near the bottom I bring up a lot of questions about large scale data sets and dirty data. Let's pick up where I left off...

One thing is clear, as we march forward, our data sets will only grow, not shrink. Something to take note is: what exactly is "garbage data", what does it mean? Can you identify it and remove it from your systems without impact to audits? If you clear it out, are you removing the possibility of tying together or discovering a meaningful relationship across your business that you didn't have before? If you have garbage data, does it mean your business is hemorrhaging money?

Data storage is growing, our requirements to "keep it all" is growing, not shrinking. But is it justified?
In one word: Compliance and auditability, we can justify keeping nearly ALL data in the enterprise integrated data store (it's no longer just a data warehouse). But is that enough to keep us out of trouble?

Maybe, let's take a look at some of the other things causing our data sets to grow:
* Incorporation of Unstructured data, along with pointers to that unstructured data
* Incorporation of data mining activities into our daily lives (mining is no longer for the passive business looking to discover something about themselves, mining is now a daily or even a near-real time activity to detect and prevent fraud and abuse).
* Consolidation of massive global enterprise data collection efforts - for a variety of reasons, including safety, redundancy, and consistency.
* Re-Introduction of Master Data sets, and Master Data Management (was around 30 years ago, but hardware kept consolidation at arms length).
* Collection and attachment of Master Metadata to tie meaning and definition to all the data and processes within the Integrated Data Store
* The realization that "DIRTY DATA" can tell us as much about how we do business wrong, as "CLEAN DATA" Can tell us how we are doing business right.

That's right folks, dirty data or garbage data doesn't necessarily mean it can be washed away, tossed aside, or removed from the system anymore. A fully integrated view of the enterprise means: I have a view which tells me what I'm doing WRONG as much as it tells me what I'm doing RIGHT. If you ever are to get to a point where you can consider DATA AS AN ASSET or undergo asset evaluation, we must be able to quantify and qualify the following:

1. What is generating dirty data?
2. Where is it coming from?
3. Is it (the data) consistently repeatable? (Patterns are the same)
4. Are there patterns in the dirty data that lead to FRAUD??
5. Does the dirty data lead to BROKEN BUSINESS PROCESSES?

Ok and the KICKER: Why must I (business owner) be accountable?
Well, maybe you want to be accountable, and the business has tied your hands? Maybe you need to be accountable for compliance reasons, maybe you want to meet your metrics in the BAM side of the house, maybe you are looking to improve rate-of-return on business processing, maybe you want to lower overhead costs and deliver higher quality products to the market faster...

The Dirty Data tells the story. Mining Dirty Data and understanding it's existence over time is just as important as "consolidating, collecting, and producing" high quality, cleansed data for the business (this data can be used to run the business in day-to-day operations. Dirty data can point out where the business is hemorrhaging $$ (money), or perhaps is suffering losses due to unseen fraud.

The moral of the story?
Just because it's dirty, doesn't mean it should be thrown out, deleted, removed from the integrated data store. It should be understood, business users should be held accountable for the production of the dirty data in the first place. They own the business, therefore they own the dirty data and the processes that produced it. I estimate that if a business undertook the answers to the questions above in a single project, they could cut operating costs by at least 10% within 3 to 6 months, and that's being conservative.

When dirty data begins to get cleaned up at the source system, at the business processes, and by business users, usually morale improves, excitement improves, new projects are granted, and productivity gains are seen. WHY? Because business users finally feel empowered to take charge and make the system work the way it always should have. Business users see an opportunity to spend less time worrying about getting around dirty data, and more time using "correct" data downstream. They finally feel justified in fixing 25 year old systems. I was there; I was in an environment where this happened. For the first time in 15 years, we saw more change requests to the source system than had ever been issued in the past.

As the data cycled back through the organization and improved in quality, management took notice of the newly "agreeing" numbers coming through the business reports from the enterprise data integration store. They applauded the efforts, and overhauled business processes from the top down. It was refreshing.

You said that Dirty Data causes hemorrhaging of money.. how does this work?
Ok, hidden costs. For example let's take a look at all the work to put humpty-dumpty back together again (assuming he's the dirty data)...
1. Source system captures dirty data, tries to reconcile - makes changes to the data set, (and if it's legacy, these are 15 year old business rules), breaks compliance, and adjusts the keys so that it's force-fed into the system to match some "record" the logic says it matches.
2. It then pushes this modified record and it's assumed parent downstream, possibly to multiple systems - infecting (like a virus if you will) all systems that come in contact with this dirty data.
3. On the business side, the processors run their daily reports (operational reports), pull the dirty data and now must spend time looking up the right information from paper files, yes - paper files, or in some cases of young or small organizations - they scratch their heads and lose the data because they can't do anything with it. Maybe they spend some _time_ trying to correct it manually, everything from making phone calls to sending emails to the client.
4. Finally the business gets a handle on the dirty data, and annotates the corrections. WHERE? In a word doc, or unstructured emails, but NEVER back in the source system where the data originated. Now the data is spread out like a flash-light beam. The further from the source, the dirtier it gets.
5. When someone tries to build the deliverable, or assemble the services that were ordered, they often deliver the wrong thing, don't meet specs, and so on. Causing the customer to come back upset, or come back multiple times to get the error corrected.
6. The dirty data is then pulled into the enterprise warehouse, or enterprise integration store - from here it get's tagged with "todays version of the truth" error processing rules, and often ends up getting kicked out. Now the business users need to spend more time going back over the error reports, and ALL the emails they had sent to correct the problem in the first place, then and only then - if they feel that this is a repeat customer or if the money from that customer is great enough they will correct it in the source system.
7. The rest of the integration stream tries it's best to further consolidate dirty data through artificial intelligence rules, and statistical rules to get it into conformed dimensions and altered facts. By now the beam of light has been so far spread out that it is only a dim light on the wall in a very dark room.

This cycle starts all over again when new dirty data is entered, costing the business more time and more money (on an exponential curve). It get's tougher and tougher to fix as business rules change. IF A BUSINESS WANTS OR NEEDS TO BE NIMBLE, they must fix the SOURCE of these problems, but in order to do that, they must undertake a dirty data expedition (project). My good friend Larry English speaks of these things in is TQM courses.

So, in your organization - where does dirty data get put? Out with the garbage? or is it addressed the way it should be, as an organizational or business process problem?

Stop hemorrhaging money! Learn how to capture the dirty data and get it fixed now, so that over time, there is less and less dirty data. Don't just assume that the haystack is all bad, and that the data is throw-away, this will lead to bad business decisions in the future.

My firm has a set of industry best practices that we are currently employing at large government, financial, and travel agencies where we can show the return on investment for cleaning up dirty data.

Thoughts? Comments?
Dan Linstedt
CTO, Myers-Holum, Inc

Posted August 20, 2006 5:48 AM
Permalink | 1 Comment |

As you know Temperature of Data is one of the next "big things" to come from RDBMS engine vendors. In this blog entry we will discuss temperature of data and what it is, and how it ties to DW 2.0. The initial business problem is that data sets are continuing to grow, and grow - particularly as enterprises come under more scrutiny for compliance and auditing. In fact, I read yesterday (no surprise) that all email trails are capable of being under subpoena in a court of law - this means that email alone deserves it's own compliant data warehouse. But that's for another day.

Temperature of Data, what is it?
In the next generation of Database Engines, data can be HOT, Medium, Luke Warm, and Cold. For lack of a better definition, the Database engines are working on the following:

HOT = Accessed all the time, or extremely important data requiring sub-second response times. Hot data must reside in RAM continuously. The more HOT data, the larger the RAM requirements or so one thinks. In-RAM database engines have recently been acquired by most of the leading RDBMS technologies; some vendors are rolling their own. In-RAM RDBMS engines require lots of RAM, or require complete compression across every column without sorting, which can be done through an indirect access to a hash table of existing values. The problem is: as user requests grow across business, more and more data becomes "hot" requiring additional RAM resources to keep it in memory. The only thing that can answer this call is Nanotech Memory (which I will blog on shortly).

HOT Data in DW2.0 is defined as Interactive or Integrated, depending on the need. In DW 2.0 the HOT data must include not only current transactional data sets, but some level of context as well - which usually means Master Data sets, Pointers to Metadata that describe Textual Data Sets, and Transactional Detail Data Sets, or minimum descriptive detail data sets.

Medium Data = is data accessed most of the time, but where response times can be anywhere from 1 second to 10 or 15 seconds. In this category, aggregation analysis, strategic analysis, and some levels of master data can be requested. Certainly this is NOT a desirable area for transactional data, particularly in detection of Fraud. Detection of Fraud should be in the HOT data section. Medium Data is usually stored on slower storage, including internal disk, super-fast SAN, high-speed controlled attached disk - something with lots of I/O channels, high parallelism, and incredibly fast access times.

Medium Data in DW 2.0 is defined to include: Some Master Data (that which is not HOT, Partial Textual Data, and partial descriptive data. It may also include partial aggregation layers for "current" snapshots (within the past 10 minutes, refreshed every 10 minutes) so that information action time is significantly reduced. Medium Data sits smack in the middle of integrated data sets.

Luke-Warm Data = Data that is accessed rarely, rarely may be (for example) once every 30 minutes or twice every 4 hours. Data like this is usually SCANNED in nature. In other words, Luke-warm data is accessed but for other reasons - to be aggregated (mostly), or to be mined, cleansed, or updated. Luke Warm data will be tossed out of RAM as soon as the accessibility is complete.

In DW2.0, Luke Warm data sits on slower external Disk, over the WAN possibly, or sits on near-line storage like WAN drives, DASD, or slow SAN drives. The Data sets in Luke-Warm data are defined for use in contextual queries, or strategic long-term queries, or when users are digging for answers over-night.

Cold Data = historical context that is hardly ever accessed, but when requested, must have a response time equal to that of a couple minutes. Cold data is typically used by auditors once a year, or twice a year, or even once or twice a month. Cold data sits on Archival storage, like CD drives, worm drives, and slow tape libraries.

Cold Data in DW 2.0 is considered Archival Data; Cold data sits out of play until requested. It might contain the full textual reference data (from emails, word docs, full images), it might contain full descriptive data like unused addresses, or historical references from data that was "relevant two years ago". Cold Data is inactive for the most part, but when needed - can be brought on to near-line or integrated storage temporarily (and automatically) to handle the request.

There is no direct fine line between any of these types of data, and for the most part it will vary from corporation to corporation, until the definitions are nailed down across best practices of Active Data Warehousing and Real-Time data integration. What is clear is that METADATA MUST be defined in order to "classify" data components within the RDBMS engines, the thresholds that are set must be defined within the metadata as business rules, and these rules MUST be dynamic no matter which RDBMS engine you are using.

Furthermore, the METADATA must be available to BI tool sets, and technical business users. With the help of IT, the technical business users can set and re-set the definitions of HOT, COLD, Medium, and Luke-Warm. This means that a tool like EII is primed for taking over this space - it can manage the metadata, interface with business users, and reset or re-write the thresholds back to the RDBMS technology that define where data sits. The RDBMS engines must then deal with where to put this data and how it fits.

One thing is clear, as we march forward, our data sets will only grow, not shrink. Something to take note is: what exactly is "garbage data", what does it mean? Can you identify it and remove it from your systems without impact to audits? If you clear it out, are you removing the possibility of tying together or discovering a meaningful relationship across your business that you didn't have before? If you have garbage data, does it mean your business is hemorrhaging money? YES... more on these questions in another blog.

What does this mean to business?
It means higher costs, more hardware - hopefully your organization is already on the road to consolidation, this will help. Hopefully you've already embarked on an SOA initiative, and hopefully you've started your Master Metadata, and Master Data Management initiatives as well - these are all pre-cursors to using this technology effectively.

Questions? Thoughts? Love to hear from you.

Dan Linstedt
CTO, Myers-Holum, Inc

Posted August 20, 2006 5:12 AM
Permalink | 1 Comment |

MDM data often is dispersed across the organization. This begs the question: how can the MDM be a viable asset to the business base? Is the Master Data reused throughout the organization the way it should be? Is it defined in the right context (Master Metadata)? But technically, Master Data should be consolidated into a single global data center. MDM is not a tool, not a toy, not a process - it's a way of doing business that includes tools, best practices, people, governance, metadata and single answers.

Remember: MASTER DATA is NOT, I repeat: NOT a single version of the facts, rather it IS a single version of today's corporately accepted TRUTH. Also remember: what's true today was NOT true yesterday (last week, last year, 5 years ago...) Master Data IS auditable as a system of record, but it's one of 3 system of record definitions (see my blog on System of Record).

Ok, so where does that leave me?
If you've got an SOA effort within your business, then you should have a space in the plan for Master Data, Governance (at different levels), Master Metadata, EII. It means your EDW is already setup as an Active Data Warehouse and is operating in Near Real Time. It means your operational systems have already placed their data exchanges under web-services (or are undergoing this conversion as a part of the SOA project).

To get to Master Data, I would strongly suggest that you have already defined enterprise wide conformed dimensions, or if you've got a 3rd normal form warehouse, that you've already defined enterprise wide accepted metadata definitions. In other words, I don't believe that you can get to Master Data Management successfully without FIRST going through a Master Metadata Management effort.

So what is Master Data Exactly?
As I've suggested in the past, Master Data IS a single consistent, consolidated version of today's truth, based on a SINGLE BUSINESS KEY, and includes descriptive data of that key post-cleansing/aggregation. Keep in mind that Master Data means the data attached to that KEY is at the SAME semantic levels of metadata definition. For instance, a Master Data table is different from a Conformed Dimension in that Master Data Tables do NOT and should not contain hierarchies within the same table set. Hierarchical data is at a different semantic grain, and usually is keyed off a different business key - therefore requires a separated and different Master Data Table.

Can the Master Data Tables be linked together?
Yes - through many to many relationships, Master Data Tables should NOT be dependant on parents, each master data table should be fully independant - as they stand alone in definitional nature. Master Data only requires the business key - it does NOT require the existence of the parent in order to be "created, used, and referenced." The BUSINESS may require the parent key when enforcing business rules for reporting purposes. Referential integrity should be done through secondary processing, or through EII, or through BI query sets to determine what "todays" business rule is, and if the data is in error (according to context).

Keep in mind that Master Data CHANGES CONTEXT depending on who's using it!!! That means, that Master Metadata must be defined, and metadata definitions over-ridden at operational levels (as long as there exists a 100% dependancy on the parent metadata chain) in order to determine context of the Master Data Set.

For example: A car is a car is a car, it has a VIN number - the VIN number doesn't change even though the car's color changes, or the seats change, or the radio is swapped out. The CAR CAN EXIST WITHOUT A DRIVER / OWNER! The business rule for "shipping" of that Car cares about a parent Container, and a Parent ship to that container. The Sales-floor cares about the "car" and the prospective buyer of that car, and the OWNER cares about the car itself. Context of the CAR and how the master data of the CAR changes depending on who's using the data and how it's viewed.

Master Data MUST be consolidated in a single data center. If it's to be utilized (for performance purposes) in other systems, it must flow downstream from the central synchronization point to a local copy. It's a read-only local copy in order to avoid stove-piping in the industry again. Master Metadata must also be centralized, AND the master metadata MUST be delivered along with the Master Data in order to make sense of it at run-time or access time.

Are there problems with the Master Data centralization effort?
YES, timing issues abound, locking, synchronization, and distribution issues abound. However - networks are getting faster, enabling data to centralize better - and be utilized all over the world. Here's where EII can help - EII can begin to compress data sets, store data sets in virtual tables on the server, AND in the near future (EII vendors, are you listening?) manage Master Metadata!! EII is in a prime position to solve the problem of connecting Master Data to Master Metadata definitions, and act as the delivery point for both.

Before you tackle Master Data, please please please spend time in considering the architecture carefully, and I would suggest that you're ADW, EII, Metadata, and web-services components are in place for at least one component of the business.

We are a world-class implementation firm who builds solutions that scale to the Petabyte ranges. Come see my MDM night school course at TDWI next week, followed by my VLDW class the next day.

Thoughts? Questions? Comments?

Dan Linstedt
CTO, Myers-Holum, Inc

Posted August 19, 2006 6:49 AM
Permalink | No Comments |

Many are now discussing the "Appliance" for Enterprise Data Warehousing. I've blogged on this in the past about definitions (or lack thereof), and what it means to be an appliance. I ran a google search on "appliance+data warehousing" and came up with a few vendors, and their thoughts on what an appliance is. We'll try to sort through the claims in this entry - and drive it back to a simple level of understanding.

Instead of purchasing software, hardware and storage from multiple vendors and utilizing thousands of hours of DBA time to integrate and maintain them, an appliance does the job for you. DATAllegro’s data warehouse appliances bring together a comprehensive set of leading-edge technologies that have been integrated and tuned for world-class performance, out-of-the-box.

'Content The most significant difference is that Calpont's system does not have a database, operating system or disks. Instead Calpont holds all the data in memory (SDRAM). In other words the database is implemented as an "SQL on a chip".

With the Netezza Performance Server® data warehouse appliance, companies across several industries are transforming the way they do business. Now they can make ALL of their data actionable - quickly, simply and affordably - enabling them to do things they weren't able to before. Here are some examples of customer results:

Now when we take a look at the "APPLIANCE" side of the market, given the definitions I've written in the past, let's keep in mind that something new is brewing: DBMS Vendors are now moving toward a mix of components, use of high-end for super-fast, and super critical data (hot data), low-end for slow data, non-critical, least accessed (cool data), and intermediate pricing and parts for luke-warm data.

Temperature rating of data is what's just around the corner. The "Appliance" platform (including RDBMS, and hardware, and possible data mining applications) of the future will plug in to your network, provide self-configuration, and allow data to be temperature rated. In some cases the temperature of the data will be managed by the RDBMS software itself. The appliance will have LOTS of RAM, as RAM get's cheaper, and faster - the appliance of the future will invest in utilizing RAM solutions with compression technologies, and other components. Calpont is new on the scene, and we will have to wait and see how they handle things.

One of the items I blogged on in the past is Samsung and it's use of Nanotech to create a 16GB RAM chip: "Samsung Announces It Has Developed World’s First 16Gb NAND Memory Chip" What this does, is allow "platform/appliance" manufacturers to extend their RAM reach in to huge amounts of storage, with tiny spaces - provided they are willing to pay for the expensive technology. Eventually Terabytes of storage will be available on RAM chips. All data will be RAM based once nanotech breaks the price barriers and it becomes efficient to produce.

All of that aside, one must look very carefully before deciding on an appliance. Ask the questions: what types of parts are in the platform? What is the MTBF for the parts themselves? Including disk, RAM, CPU, motherboard, etc.. How can the appliance be upgraded? Is there a remote monitoring facility that can predict when failures occur?

In my mind, a EMC has it spot-on. Their DISK is an appliance, they use high-quality parts in their high-end systems, they have remote monitoring, and are often on-site before the disk fails to fix it, and so on. RDBMS vendors venturing into the world of Appliances can take a lessons-learned approach from EMC, In fact - it would be interesting to see one of these vendors partner up with EMC to leverage its entire pre-built monitoring infrastructure. I wonder if EMC could even outsource something like that...

Anyhow, don't be afraid to ask the tough questions - find out where the product roadmap is going, try to align your goals with a hands-off approach to the "appliance". It should be self-maintaining, self-tuning, and self-upgrading.

I'd love to hear your questions.

Dan Linstedt
CTO, Myers-Holum, Inc

Posted August 19, 2006 6:20 AM
Permalink | No Comments |