Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

Recently in BI Vendors Category

I've just reviewed some technology from a vendor who manages business and technical metadata as a service platform.  Let me say: I'm impressed.  There are many current issues these days with different BI and EDW implementations, specifically around managing, entering, and governing the business and technical metadata.  For years I've found myself using Excel and Word to handle these tasks; while great tools - there are some problems for the Data Steward in the form of governance, management, and usefulness of metadata.   In this entry, I'll discuss a new solution from IData called the Data Cookbook.

The company URL: http://www.idatainc.com

The product URL: http://www.datacookbook.com/

** NOTICE:  I have not been paid for this review, nor am I under contract with IData in any way at this time.  I believe in this product, and I truly like what I see. **

This software covers the following areas of need:

  • Business Metadata - terms & definitions
  • Technical Metadata - terms & Definitions
  • Roles & Responsibilities - provide governance over these things
  • Subscription based - SaaS will do nicely
  • Ubiquitous access - Hand-held computing accessible
  • Notification based - updates and changes notify appropriate individuals
  • Workflow Managed - approvals, edits, changes, and publishing status
  • Web-services - All metadata available through web-services.

There are several long standing needs in the EDW/BI community to enhance or enable the role of the Data Steward.   There are also needs to enable the work of the Data Steward to share with all others on the implementation team, the business users, and executive staff, the actual business and technical metadata that comes with all BI and EDW projects.  I've long wondered if or when someone would try to fill this gap, especially during the rad/jad sessions that the IT team has during requirements gathering.  There may be a couple companies out there, but IData has clearly come to the table at the right time with the right product.

Someone has finally "Got-it".  With the solution they have produced, the Data Steward's role is made 10x easier, maybe even 50x more effective.  They can finally sit in on meetings, and using a simple web-interface, begin entering business terms and definitions that appear during requirements definitions, or user interviews.  They can then assign workflow status to these terms such as: "needs review", or "to be discussed", or "finalized", and so on, to many others in the organization. 

Links to these terms can be e-mailed to the CEO, or others in the executive staff, or the appropriate business user.  Roles can be assigned to multiple users so they can log in and either edit the data/update it, or read it/approve it, tag it, or debate it.  The Data Cookbook keeps track of all edits, all comments, all tags - who, what, and when; so it provides a "data warehouse" of sorts, across the business and technical metadata.

I forgot to mention this: there are tons of search terms available, there are access points through a brand new (just released) web-service API so that different BI tools can "hit" the stored ontology and retrieve these terms and definitions on-demand.

The one HUGE difference in all of this that I see here, is that the Data Cookbook also manages TECHNICAL metadata.  Finally, at last - we can enter transformation rules (in textual format), requirements for data manipulation right along side of the business definitions - and manage them at the same time.

What's the catch?  There's always a catch...

There really is no catch if you will, this is a web-service provisioned software - so no installations necessary, no new servers needed, just sign up a client - and provision roles & responsibilities and away you go.  They also offer some basic reporting, and some nifty graphs and charts of terms that are "approved/edited" etc... where the terms are in the workflow process.

I will say this however, there are some areas for improvement (as with any software, there always are) - in which they hope to address these needs in the future.  The areas they don't track today include: technical data lineage, hooks into data transformation tools to pull the current transformation rule sets out, and ontological based graphical views of the data.  But again, these are minor needs compared to what they accomplish for business and projects today.

Conclusion:

Today the real power is embedded in the usability, and in the fact that they TIE TOGETHER the business and technical metadata definitions, and apply roles & responsibilities and workflow analysis to managing it all.  The fact that this is a web-interface means that we can enter the terms using any portable computing device, from an iphone to an android, to a mini, to a personal computer or laptop. 

I no longer need to share or send my documents all over e-mail (with only a hope that change tracking will be used).  I know that the terms, the definitions will be kept up to date, and that e-mails will be sent based on "subscription" to term changes.   Finally, I have a working solution that is fully usable within the business.

I recommend anyone building an EDW or BI project that they at least look at this solution.  I'm excited to see this technology come to light. 

If you have questions or comments, please reply.

Cheers,

Dan Linstedt
DanL@DanLinstedt.com


Posted March 30, 2010 4:10 AM
Permalink | No Comments |

There are many trends afoot in the industry, but I think there are two or three that are here to stay, and quite frankly, they will make a huge impact on the way BI/EDW is done. Some of the technologies that are ground-shaking and game-changing include Solid State Disk, Database appliances on the Cloud, and data services on the cloud.  I've been watching the Database market now for several years, and here are a few of my predictions for 2010.

Here's the skinny of it:

1) Solid State Disk.  It's not your grand-fathers "RAM on a USB stick" plugged in to the computer, no - it's high-speed, ultra parallel, screaming bus performance RAM.  As one analyst recently put it: "CPU's have gained 5 million times faster performance over the last 30 years, while disk drives have improved only about five times, he says. SSD's offer 150x improvement in random I/O's per second over spinning disk."  Information Week, Oct 26, 2009, Douglas Henschen.

So what does this really mean in business terms?

In business terms this brings the ability to finally execute high speed parallel operations for massive BI queries, and massive bulk-loading systems.  But wait, there's more (yes, it gets better)...  This also means that we can finally take advantage of extreme flexibility in logical modeling layers, and have to "worry less" about physical modeling components... Which of course leads us right to the Data Vault Modeling methodology, or 6th normal form (can become a reality if you like). 

In business terms, if we have extreme flexibility in our architecture, we can finally enable IT to become nimble and fast again.  It means we can reduce delivery times, and execution cycle times.  It also means we can increase up-time, and oh-wait...  HOT DATA?  Did all of you forget about Temperature driven data?  Shame on you!  Hot data will reside in the SSD devices, allowing hyper-cubes to take place, AND allowing DYNAMIC changes to both the data models, the queries, and the loading systems to occur.

Why is that important to me?

Well, I'll tell you...  In my opinion, it's important to you because you should care about the architecture you are using, you should care about the accountability and auditability of the data set, you should (and usually do) care about the performance of the system you're building your solution on; and by the way: your going to have MORE and MORE data to deal with, like it or not: the EDW/BI solutions of today are accountable (they have become a system of record in their own right).

Yea ok, so what...  Solid state disk, faster machines, phooey...

You can say what you like, but IBM Smart Analytic System, Oracles' Exadata, Teradata Extreme Performance Appliance 4555, are just a few of these notions.  What would happen (do you think) if ParAccel, or Vertica, or Sybase IQ jumped on this ship?  Especially into Amazon EC2 cloud...  Hmmmm Now it get's interesting!

Which brings me to my next point:

2) Database technology is changing for 2010.  RDBMS making use of SSD will change the game, it will be an uneven playing field, but then making these devices available on the EC2 cloud (wether it's amazon or google providing the services) - based on HADOOP (mostly) - it won't matter.  The RDBMS technology will get faster, and now - in-company network bandwidth will be OUTSHINED by cloud privisioned bandwidth.

What?  What does that mean outshined by cloud bandwidth?  I don't get it...

You will have to get it if you want to lower cost, and succeed in business going forward.  What I mean is: Data Warehousing and BI solutions will be offered as an outsourced service.  That's right folks, for the first time in DW/BI history, (with just these two advancements coupled together), Data Warehousing and BI as a service will become a reality.  Mark my words!   Ok, don't take my word for it... 

All you need to do is call up a company called: Quantivo they are processing/ingesting 7 billion customer records, and allowing 2 to 3 second query response times against this sizable data.  What do they sell?  packaged data warehousing/BI analytics.  In other words, they do the loading of the raw data, they manage their internal data model, they perform the cross-mapping, the indexing, and they build the "data mart" availability of the data set.

So what's the catch, there's always a catch, and I want to know what it is!

Of course there's a catch... today there's an ethical, security, and privacy dilemma.  How do you "outsource" data to a company providing a service that is put into a cloud owned by a 3rd party company (say Amazon or Google), and keep it all safe and secure?  Well, I would argue there are several layers to making "DW/BI as a service" available.

Layer A: is what Quantivo Does: doesn't keep (or receive) any specific or private information about individual customers that may risk identity theft if the security wall is broken.  They provide a VPN service, dedicated encrypted channels for the customer to get at their own data sets.  They also slice the data sets off - for each customer, they setup different grids in the cloud - ensuring all data stays separate.

Layer B: Hardware vendor/RDBMS vendor offers use of their software/hardware within the cloud as a direct contract or managed service for upgrades, the customer contracts them to execute the cloud installation, and then the customer manages, owns the data, the data modeling, the queries, and all other components within the cloud over VPN.

I think beyond this, there may be some additional options for EDW/BI as a service however they have to be scrutinized carefully.

Ok, ok - so what's the real business benefit here?

Still don't see it eh?  Well, here are just a few of the business benefits to be received by executing plans like this:

  • Lower cost of ownership
  • Faster turn-around time (like in Quantivo's case - they are specialized in customer analytics and data mining)
  • More parallelism with "sloppier design" - wait, that's a risk!  Yes, it is...  but sloppy designs will get a significant performance boost from SSD, and from cloud technology
  • More agile, more nimble (Dynamic data marts) - here today, and possibly trashed tomorrow, only to build a new one!
  • Cube Write-Back, yes - because of the nature of cubes, LARGE CUBES on SSD in a CLOUD now can offer FAST write-back. Hey Microsoft, you listening?
  • More power in the hands of the business user - direct access to create on-demand data mart solutions IN-RAM, that's right... SSD on a cloud over VPN, lets users PLAY with their data directly (if the outsourced service has implemented properly).  finally, play with LARGE to HUMUNGOUS sets of data on essentially a RAM drive!  Pay as you go, pay for only what you need.

I don't mean to dig on Microsoft, I think just the opposite - they have cube write back technology today, and their 2008 SQLServer R2 release is INCREDIBLE...  it is screaming fast, and super scalable (even on standard 32 bit windows servers, if you know how to tune it).  I just think it's an easy jump for them to implement something on SSD and in the cloud, and voila - what an INCREDIBLE solution.  It's so easy, I don't know why their gold partners haven't done it yet...  Anyone want to invest in me?  I'll build you a company to do it.

Alright, what are the risks...

The risks include: not managing the services effectively, not setting the quotas and policies for each user appropriately (example: a low rung business user fires off a query that expands the cloud to 12 to 20 machines for 1 hour, but can't authorize the payment for the service usage).  Ok - Governance, Governance, and more governance....  Oh, did I mention: GOVERNANCE over the solution?  Absolutely required.

Another risk is: letting yourself believe that this is the answer to "all the poor quality data modeling" or "sloppy ETL" or "poorly behaved BI queries" because now the performance is incredible out of the gate.  Wrong, wrong wrong.  Even though this opportunity affords HUGE gains in performance, it is no excuse for lack of GOVERNANCE and DATA MANAGEMENT best practices.  Your data models can & should be improved, your sloppy ETL can and should be tuned, and your poorly behaved BI queries must be optimized.  Then instead of seeing only a 10x gain, you will be able to see a 150x gain.

You know what they say: No PAIN, no GAIN... In this case, the pain is paying for the right architecture to be put in place, the right training for performance and tuning, and the right data modeling to be utilized under the covers.

You can find out more about Data Vault (550+ users, 2000 unique visitors a month, 150k hits a month) at http://www.DataVaultInstitute.com

You can find out more about ETL performance and Tuning at http://www.TrainOvation.com, and http://www.GeneseeAcademy.com

OR: Contact me directly: danL@DataVaultInstitute.com  (even if you're not using Data Vault)

Cheers for now,

Dan Linstedt

 


Posted October 28, 2009 12:03 AM
Permalink | No Comments |
There are quite a few vendors out there who publish industry logical data models.  I've had my fare share of direct, hands-on experience with them over the last 8 to 10 years.  In this entry I will share a few of the pros and cons of these classifications of models. My objective is more of an inquisitive one, in hopes that you will respond with your experiences, and if you are a vendor: please post your opinions.

First let's take a look at what these iLDM's are defined to be: Steve Hoberman offers some thoughts in 2004 on the subject: http://www.teradata.com/tdmo/v06n04/Tech2Tech/InsidersWarehouse/DataModels.aspx

The right tools to capture how an organization within a particular industry works, or should work, are essential for in-depth knowledge of your business. These tools should also provide expert modeling structures for your enterprise to use as a guide. An industry logical data model (iLDM) offers this kind of information and guidance.

...The iLDM is a pre-built model that extends the EDM concept. Like the EDM, it is a subject-oriented and integrated view of the information needed for an enterprise data warehouse (EDW) to answer strategic and tactical business questions....

...While it does not contain history, the iLDM represents a snapshot of information (point in time). It also does not contain derived data or other calculations to assist with retrieving data more quickly. All model changes necessary for issues like speed, storage, security, backup and recovery must be added at the physical level....

Interesting... It does not contain history, nor derived data, nor other calculations... However the iLDM's I've seen have been implemented 1 to 1 with the physical model at client sites, and DO contain history and DO contain calcations, and so on. So it appears in step 1 that vendors are not necessarily following Steve's advice...

In my days of data modeling, I was taught as Steve suggests, that Logical Data Models should stay Logical - and often times do NOT represent the physical (due to partitioning, indexing, data types, super and sub-types, and so on). However in my experience, the vendors professional services of the LDM's appear to want to implement these physically in a 1 to 1 fashion (meaning 1 logical table = 1 physical table).

Moving on, I find this interesting: http://www.freepatentsonline.com/7359906.html A patent by several individuals for a particular LDM. I'll let you decide if this is a good or a bad thing. Let's move on...

Here is more information on iLDM's from another individual: http://www.univdata.com/Portals/9/udm_Products_Repository.pdf

And another: http://publib.boulder.ibm.com/infocenter/rdahelp/v7r5/index.jsp?topic=/com.ibm.datatools.logical.ui.doc/topics/clogmod.html

A logical data model is a model that is not specific to a database that describes things about which an organization wants to collect data, and the relationships among these things.

A logical model contains representations of entities and attributes, relationships, unique identifiers, subtypes and supertypes, and constraints between relationships. A logical model can also contain domain model objects, or reference one or more domain or glossary models. After logical objects and relationships are defined in a logical data model, you can use the workbench to transform the logical model into a database-specific physical representation in the form of a physical data model.

What I find in my experience:

What I've found is that professional service organizations take the logical data models and physically implement them directly on the specific databases in a 1 to 1 fashion. Of course, they add all the components necessary to make them work in a physical realm. However, I've found both pros and cons to this approach. So let's get to the heart of this entry and discuss (I'd love to hear from you about your pros and cons, and I'm not here to try to tell you one way or the other what to do with the LDM's)...

Pros:

  • Coverage of the LDM's are wide
  • Years of Experience in the industry of the people who design these models
  • Some "fast" implementation if scope of the project is kept to a minimum
  • As guides go, they are understandable (to a point) and contain many of the complex relationships needed to express information properly
  • They are "mostly" compressed (smaller models, smaller numbers of tables)
  • Easy to "Buy" versus "build"

(This is the part where you can add to the list)

Cons:

  • Consulting organizations don't often understand how to implement properly
  • They have been mapped 1 to 1 to the physical (without necessary changes)
  • The models are super typed, and often times are too high a grain for customers to really grasp ALL the elements that are combined within
  • The super-types try to handle too many levels of "grain", mixing corporate customers for example with individual customers and changing the outcome of the queries on the reports
  • Often require years of expertise and training to implement properly
  • Sometimes are a nightmare to tune, partition, and get performance from
  • Sometimes require tons of staging areas to "prepare" the data before integrating into supertyped table structures
  • Nearly always need "extending" and customization, but don't fare well with customers, as this becomes a multi-year, high cost effort - turning into a huge consulting services gig for big dollars.

Now are they worth it?
Sometimes they are, other times they fall down. If your a customer, and you're looking at these models, they certainly seem appealing.

What's your experience? *** WARNING: THIS IS MY OPINION ONLY! ***
I've seen these models take 3 to 5 years to implement correctly, and modify and customize properly. To be fair, I've seen 50/50 success rates with the implementations, some have worked beautifully, others have failed miserably. Was it because of the model? Can't say. Some environments it was the politics, not the model, not the provider, not the hardware or database. Other companies it was because of consulting services not understanding how to integrate, alter, upgrade, and customize. In other companies it was because of the storage needed in the staging area in order to "cleanse/change/alter and super-type" the data before it was loaded.

Successful implementations I've seen were because the scope was controlled, the modifications were extremely minimal, and data quality (from the source system) was extremely high. I've also seen successful implementations when there is only 1 source system to deal with. When this is the case, cross-mapping the information from the source to target super types is fairly easy to do. When there are multiple source systems involved the task becomes tremendously complex quickly.

My conclusion (this is where I'd like your feedback)
I believe that sometimes, buying the logical data models can really help move the project forward, but I also believe that breaking out the super-type into it's constituant raw-level components for the physical model is an important act of service to the customer. I believe that if you "customize" the logical data model, that you will have trouble when the vendor puts out a new version (that alters the version you have in place). I've seen cases where a 2 and 3 year bid effort to customize becomes a 7 year services contract in order to maintain customizations across multiple releases.

Don't get me wrong: LDM's are WONDERFUL GUIDES! But alas, they are "logical" data models, and should remain "logical" as their name implies. In the physical world, dealing with history, separation of grain, functional areas of business (crossing lines of business), and so on are very very important. Customizations are nearly *ALWAYS* in a data warehousing world, even with pre-packaged implementations of "pre-built data warehouse solutions."

Finally, let me ask you data modelers out there: does it really make the physical ETL implementation that much easier if you "supertype" everything (customer, individual, employee, manager, corporation, organization, and so on) into "one single table"? What if each of these elements has their own business key, and clearly the data is at a different grain, what do you do in your LDM to ensure success?

Thank-you for your comments and feedback,
Daniel Linstedt
DanL@DanLinstedt.com


Posted May 22, 2009 4:26 AM
Permalink | No Comments |

My good friend Richard Winter just published a document about Oracle and Exadata and scalability.  Don't take this the wrong way, but I believe the findings are lopsided at best.  I hold Richard in the highest regards for exercising VLDB systems, but this report clearly is aimed at highlighting what Oracle does best - but it is missing crucial information about very large systems performance that I've been asking about for years.

The report is here: http://www.oracle.com/corporate/analyst/reports/infrastructure/bi_dw/winter-exadata-performance.pdf

You can read it for yourself.  First, I have to give kudos and credit to Oracle for finally recognizing that Infiniband networking is needed for high band-width, and also that high speed disk (such as SATA or SCSI Internal) is also needed for Oracle to perform.  These numbers of throughput are impressive.  However the report itself fails to test the following components:

1) High Performance Batch Load - where are the performance numbers of high performance batch load, or of parallel loads executing against the device?  How many parallel BIG batch loads can execute at once before the upper limits of the machine and Oracle are reached? 

2) Performance of near-real time transaction feeds.  How many feeds can be consumed? what's the maximum throughput rate? What's the upper limit for number of parallel feeds and number of transactions per second that can be "added" to the data warehouse?

3) Mixed workload performance tests.  What happens to the query performance when either one or both of the above loads take place WHILE querying?  How much is the impact to the system?  What happens to the logs and the temp?  Do we end up with CPU bound operations?

These are all things that Richard is very familiar with testing.  I have a feeling that Oracle didn't sanction these tests, or that somehow they were simply "removed" from the paper.  Again, Oracle marketing has stepped forward - it shows the Exadata appliance in the right light, but it doesn't have enough information to lead to sound decision making (in terms of: should we invest or purchase this appliance or not)?

One more piece I can't understand is the Star Schema that was put forward at the end of the report.  What appear to be "dimensions" are EXTREMELY narrow, they almost look like fact tables.  This star does not appear to me as any star I see on customer sites.  The first FACT table appears to house data that is not "fact based", and is extremely wide.  Of course Oracle will eat this up, as the dimensions can almost be "pinned in RAM".  Where is the "type 2" nature of the data in the dimensions?

Typically at least we see a customer dimension with multiple versions of the customer address - then we apply that to millions of customer rows, but nope - the fact table is the only one with billions of rows embedded.

Ok, maybe I'm being to harsh, and if so - my appologies. But I'm just really frustrated with the marketing of all of these companies that say: "The worlds fastest and largest database appliance/engine...." and then fail to include the whole story.

What did you take away from reading the report?  is it biased? is it one-sided? or is it spot-on and provide the full answers?

Curious,

Dan Linstedt

 


Posted March 5, 2009 4:16 AM
Permalink | 2 Comments |

This entry is a candid look (opinionated mind you) at what I see as the future of transformations themselves.  We will cross several subjects in this entry, as it is meant to be a look at where transformations currently happen, where they need to happen, and what's actually happening in the market place.

ETL or Extract, Transform and Load has been around a long long time.  ELT (or sometimes referred to as in-database, or push-down) is new to the ETL vendor world, but a very old concept.  On the other hand, RDBMS vendors have heard the cry and have responded by continually adding new features and functionality to in-database transformation logic.

Now, enter real-time.  Ok, EAI (enterprise application integration) and message queuing - both have been around a long time too, they are also growing and changing.  Then along came BPM (business process management) which changed or morphed into BPEL (business process execution language) and BPW (business process workflows).  All of which to engage real-time flows and manage transactions at the user level.  Oh yea, I almost forgot: the middle tier technology known as EII (enterprise information integration) which never really caught on, but is valuable (none-the-less) when embedded in other technologies like web-services and SOA.

Down to brass tacks...

When we look at what's around the corner we have to ask ourselves the following questions:

1.     What does compliance and auditability mean to our transformation efforts?

2.     What really and truly is so difficult about transforming the data?

3.     What do some of these complex business rules look like in transformation logic?

4.     WHY do we fundamentally rely on machines and programmatic (static rules) to alter data sets?  In other words, why do we "write" rules into SQL or transformation logic to make data "usable" by the business?

5.     Just what is considered "usable data" anyway?

Ok, enough of the esoterical stuff - I just thought we needed to ask these questions, of course - if you have concrete answers, I'd love to hear them in your replies to this blog entry.  Now, on to more serious stuff...  where is transformation going to happen?  Especially given ever-growing data sets, and ever-decreasing latency of arrival...

I would argue that ETL is still partially viable, however their comes a time when transformation in-stream simply falls down, no longer feasible to execute.  ESPECIALLY when loading data from the source systems IN to the EDW.  However, the exception to this rule is when the application is encoded directly on top of the business process rules application - or the business workflow management system.  THEN, as the data is entered and submitted to the application, the data is "edited" or transformed before placing it on the transaction stream.

Likewise this might occur over web-services and streaming services for data sets.

Now this raises the question again: WHAT exactly is auditable data?  WHEN is it compliant or auditable? even for the operational systems?  Is it when the user enters the data on the screen?  is it when it's first captured by the transaction system?

Ok - back to brass tacks.

In order to handle volumes of data in the EDW (flowing in and out), and decrease loading cycle times, it is absolutely imperative that the business rules or transformation logic be moved downstream of the EDW.  That it *NOT* be placed upstream between the source system and the staging area or EDW (as generally architected).  This causes significant re-engineering costs to be incurred, and creates an ETL bottleneck with larger data sets.

Some of this bottleneck is solved through larger hardware or 64 bit systems.  HOWEVER that's not enough anymore.

So what are you saying?

By moving the transformations downstream of the EDW, (between the EDW and the data marts) we now have created an architectural OPTION.  We can now CHOOSE to use ETL or ELT and leverage the RDBMS for transformation.  Especially if both the EDW and the data marts reside on the same database instance.  This allows us to apply the technology in the right place at the right time.  Furthermore it makes the data in the EDW more "compliant and auditable" because it is not subject to change before loading.  (see http://www.DataVaultInstitute.com for more information).

Alright - the future stuff... so what do we need from ETL "vendors" in the future?

* ETL vendors must support both, ETL and ELT (in-database)

* Fully configurable temporary tables, block style processing, in-database control - all from an ETL metadata and visual GUI perspective

* FULL 100% push-down must be supported, and if "EL" needs to be added to the chain, so be it - the ETL tool will automatically set that up, and do it's best to provide 100% push-down where necessary.

* For advanced developers, the ability to control "HOW" the push down will be executed, will full over-rides and step by step debugging IN THE DATABASE.

* Many more, which I don't have time to post now...  these are the major ones.

What does this mean to the Database Vendors?

* Ever increasing support of "faster API calls"

* More parallel API calls

* dedicated "step-by-step" debugging interfaces

* a whole lot more in-core coded transformations and complex SQL statements

* MORE BATCH oriented SQL statements, where a "batch processing size" can be set, then the statements will manage themselves

* MORE interconnection (high speed) with remote database instances.

* MORE metadata

* inclusive of versioning of every single piece of executing code

* Versioning of the TABLE structures and INDEXES

* on-the-fly indexing

* Parallel index builds DURING high speed load or batch operations

* NO MORE "TABLE COPY SWITCHING" for high-volume and high-availability.

 

Please add some of your own thoughts to this party, I'd like to hear what you think.

As always,

Dan Linstedt
DanL@RapidACE.com  - check out a 3D Data Model Visualizer Demo!


Posted February 5, 2009 7:22 AM
Permalink | 1 Comment |
PREV 1 2 3 4 5

Search this blog
Categories ›
Archives ›
Recent Entries ›