Recently a discussion raged on LinkedIn regarding the 'ETL tools that support Data Vault OUT OF THE BOX' (link). I gotta be honest - I was annoyed by the discussion and was stupid enough to display this by commenting kind of harshly. I would like to apologize to everyone and especially to Daan.
In this blogpost I would like to explain my point of view regarding this question.
In the above mentioned discussion I commented very briefly 'All ETL tools support Data Vault". Allow me to explain this by paraphrasing an argument that was also used by Daan in the subsequent comments. He mentioned that technology brought about efficiency gains in the last 20 to 30 years. I agree with that, the data is quite clear about it ;-). Trying to explain these gains I leave to applied science, but I would like to take one tiny piece of the puzzle and put it in the context of my remark that 'all ETL tools support Data Vault'.
One of the 'variables' in the function of this tremendous leap - in my opinion - is uniformity. Organizing uniform systems (I use the term 'systems' in the broadest sense - People, Technology, Processes) opened the door towards repeatability, predictability, limiting waste and improving quality. In writing this I think Dr. W. Edwards Deming would agree with me.
Now, back to the subject of ETL and Data Vault. With Data Vault we design the system of modeling and logistics of data in advance. Both go hand in hand. What we want to achieve is uniformaty as much as we possibly can. Uniformity in modeling, balanced with the uniformity in loading.
Let me elaborate some more.
In Data Vault and more generally speaking, in 'systems thinking', all objects in a system are interrelated. How I construct a data model has a strong impact on the way I (can) construct the loading (ETL). With Data Vault we standardize the data model as much as we can (there are quite some heuristics in Data Vault, it should not be applied in some dogmatic way), in a limited number of constructs (hub, link, sat). But we also design the loading constructs, which are also extremely limited in number (hubload, linkload, satload). Every load construct has got a standardized pattern, see the figure regarding the pattern for a hub load.
If I were to translate this to SQL it would be something like: INSERT <distinct values> to HUB where NOT EXIST in HUB. Of course any ETL tool would support such a simple construct! Data Vaults are thus being build with SSIS, Informatica, InfoSphere, Business Objects, Pentaho,SAS etc...
Please be advised that the above is a simplified example, in real life the loadpatterns are considerably more complex. However, the principles however remain unchanged;
- A limited number of loading patterns
- The patterns are standardized in type
- The patterns are simple
- The patterns can be executed asynchronous
- The type of patterns can make use of parallel loading
I would like to summarize the above with two words; uniformity and automation. Because of uniformity in modeling and logistics we open the door towards repeatability/automation. Making it a lot cheaper to maintain, but also easy to change or supplement (testability is designed in the system, as well as repairability). Agile software development find great support by these kinds of systems (this is worthy of an entirely new blogpost ;-)).
We now can design a predictable system of loading data in a data model. We have created a uniform structure of the data in the data warehouse, opening the way for more uniformity towards Kimball datamarts as well (be it in-memory, on file, virtualised, etc..).
Uniformity and automation have ignited a wave of innovation in the Netherlands. Innovation led by independent consultants and consultancy firms - that saw great opportunity in the daily problems they face - to take the data logistics to a new level of automation; metadata driven ETL (example open source: Quipu, example commercial: WhereScape).
Posted September 21, 2011 3:44 AM
Permalink | No Comments |