Blog: Ronald Damhof Subscribe to this blog's RSS feed!

Ronald Damhof

I have been a BI/DW practitioner for more than 15 years. In the last few years, I have become increasingly annoyed - even frustrated - by the lack of (scientific) rigor in the field of data warehousing and business intelligence. It is not uncommon for the knowledge worker to be disillusioned by the promise of business intelligence and data warehousing because vendors and consulting organizations create their "own" frameworks, definitions, super-duper tools etc.

What the field needs is more connectedness (grounding and objectivity) to the scientific community. The scientific community needs to realize the importance of increasing their level of relevance to the practice of technology.

For the next few years, I have decided to attempt to build a solid bridge between science and technology practitioners. As a dissertation student at the University of Groningen in the Netherlands, I hope to discover ways to accomplish this. With this blog I hope to share some of the things I learn in my search and begin discussions on this topic within the international community.

Your feedback is important to me. Please let me know what you think. My email address is Ronald.damhof@prudenza.nl.

About the author >

Ronald Damhof is an information management practitioner with more than 15 years of international experience in the field.

His areas of focus include:

  1. Data management, including data quality, data governance and data warehousing;
  2. Enterprise architectural principles;
  3. Exploiting data to its maximum potential for decision support.
Ronald is an Information Quality Certified Professional (International Association for Information and Data Quality – one of the first 20 to pass this prestigious exam), Certified Data Vault Grandmaster (only person in the world to have this level of certification), and a Certified Scrum Master. He is a strong advocate of agile and lean principles and practices (e.g., Scrum). You can reach him at +31 6 269 671 84, through his website at http://www.prudenza.nl/ or via email at ronald.damhof@prudenza.nl.

This blog has been inspired by Martijn Evers blogpost, by Barry Devlin's work on Business Integrated Insight, by some very forward thinking customers and is bugging me for some time now.

Back in 2005 I gave masterclasses 'Data Warehousing in Depth' at the knowledge institute CIBIT in the Netherlands. One part of the class was pondering about the future of data warehousing. It was my favorite part and I remember that I always reminded the class as to the root-reason for a data warehouse; a physical construct to overcome deficiencies like performance, history, auditability, data quality, usability, ...

And I remember drawing two circles; one for data in the operational environment and one for data in the informational environment. Both represented physical environments, in other words; the data needed to physically move from the operational to the informational environment.

What would be necessary to blend the two environments, I then asked;

  •  "Sources need to maintain history and adhere to strict auditability rules"
  • "More and faster hardware and parallelization options in both hard- and software"
  • "A common vocabulary of data, reflected in the operational systems"
  • "Enterprise-wide data like master- and reference data needs to be maintained pro-active (upstream) and centrally"
  •  ..

These are all still very valid points, but I would add a more profound one:

Data Virtualization guru's preach to us the Information Hiding1 pattern. A pattern perfectly suited for the decoupling information systems and their data. These Data Virtualization guys and gals say that the software that supports this data virtualization is the new pinnacle for decoupling operational and informational environments and make the Data Warehouse2 (eventually) obsolete. 

My opinion; 'they' are right and they are wrong. Yes, Information hiding is a pattern (there are more) that enables decoupling of the information environment and the operational environment. But this distinction is somewhat flawed - it is a distinction that originated from the early 90's that reflected deficiencies of using registered data for decision support activities.

We might wanna make an effort in trying to get rid of these deficiencies....

In the last 20-30 years, the data-model and its instantiations (the data) where directly based on the Information systems. The data-model and the data were by-products. The data-model fitted the information system. The Informational environment was born to overcome the deficiencies that came with this approach.

I want to make a plea for shifting this process-oriented thinking and designing to data-oriented thinking and designing. Make the data smarter instead of making countless little/big information systems with their own 'data-store'. It is not that odd; information systems and the business processes they support are so much more susceptible to change than the data is. Data is - in its very nature - extremely stable over time.

This plea is not new; start with an Information Model of your business, construct a conceptual model and slowly design your way down to the logical, physical model and/or canonical model3. Information systems are now made to fit the data-architecture and not vice versa. These information systems are somewhat decoupled from the data architecture. These information systems need to use the Information Hiding pattern.

Now, the principle of Information Hiding and the technology that can handle this pattern can flourish. Data Virtualization technology can be used to its full potential, but the same applies to BPM technology or technology which is based on service oriented architectures.

So, now we have the following list of requirements that are necessary to blend operational- and information environments:

1.     A data-oriented design and architecture of information system as opposed to a process oriented design and architecture

2.     "Sources need to maintain history and adhere to strict auditability rules"

3.     "More and faster hardware and parallelization options in both hard- and software"

4.     "A common vocabulary of data, reflected in the operational systems"

5.     "Enterprise-wide data like master- and reference data needs to be maintained pro-active (upstream) and centrally" 

 

And I will add another three:

 

6.     Centralized design and maintenance of business rules

7.     Data security and privacy law and regulations are enforced on the data-level.

8.     An organizing framework for establishing strategy, objectives, and policies for (corporate) data4

 

It is impossible to be completely thorough in this list, there are indeed many more, but this is a blog.....not a book ;-)

The mentioned criteria can be mapped to a discipline that is about to reach a critical mass in terms of body of knowledge, technology support, rigor in science and relevance in practice; Data Management and Data Governance. 

Back in 2005, in my masterclasses, I mused over the future of data warehousing. The future is now, the journey will not be easy, but the rewards are substantial. It truly is the future of the Sense and Respond5 organization.

 

Footnotes:

David L. Parnas, 1972, On the criteria to be used in decomposing Systems into modules

2 Data Virtualization vendors often claim to make the Enterprise Data Warehouse obsolete (referencing to the Boulder BI Brain Trust meeting beginning of this year where a vendor made this claim). They confuse a technology (data virtualization software) with an architectural construct.

3 A Canonical model is a design pattern used to communicate between different data format. It needs to be based on the logical model of the organization.

4 Jill Dyche and Evan Levy, note from the blog-post-author; I adapted the quote by adding brackets between 'corporate'

5  Stephen H. Haeckel, 1999, Adaptive Enterprise: Creating and Leading 


Posted January 22, 2013 4:53 AM
Permalink | No Comments |

There is something going on for some time now, decades even. It all started with the arrival of the Internet where people voluntarily contributed data to, well, everyone who was interested. Data about themselves, their relationships, their adventures, their careers. This data was shared with consent of the owner of the data - although not everyone knew what the data was used for. So one might say that there was consent, but not informed consent.

Lets take it a step further and imagine....

What if our data  - generated by others - was given back to us and we could consent in an informed manner, that we wish to share this data for the greater good? For example; my tax information is my data, it is about me and I want to decide whether or not others can use this data. Or, suppose I can get a hold ofmy location/GPS data, showing all my movements. Or my point of sale data from the grocery store, showing my eating patterns. Suppose I can even get a hold of the data of the last MRI I took, my genome data or the data of my last blood test or even the data of a clinical test I was in?

Imagine....

What if I could decide to contribute this data (consensually) for the public good, where my privacy was Freedom-road-signstill being honoured? What if dozens of people would decide that? What if millions of people would decide that? Clinical research would never be the same again. We would be able to scan for patterns in seas of data consisting of environmental data and healthcare data. No more clinical trials with just 2000 people and ever-increasing smarter statistics. In this setting the healthcare specialists, the quants, the sociologists and the behavioural scientists would have an unprecedented test bed of data. Is there a correlation (or even causality) between aspects of travelling, career, eating patterns, social status and cancer? Suppose even several generations would contribute their data; what would that mean for clinical research? Mindblowing.... 

In the above I discussed data that was about myself and so I should be the one who should decide whether or not to share. But what about data that is ours? The government heavily sponsors research in many countries. Research on biology, behavioural science, economic science, climate science etc.. Shouldn't the data generated by this research be public domain? I think it should...

What about data created by government - which is us. Data about im- and export movements, data regarding employment, schooling, law enforcement, crime, etc.. 

Imagine....

What would this democratization of data mean with regard to innovation? I think it would truly ignite a burst of possibilities and a huge potential for our general wellbeing. And no - I am not referring to the challenge of marketing handbags to middle age ladies (quote somewhat paraphrased from Neil Raden).

No, set the data free to go for the real challenges we face; decreasing poverty, climate control, improving healthcare, scarcity of resources, economic stability and decreasing crime.

This blogpost is hugely inspired by John Wilbanks - google the guy (!) -, all the Open Data initiatives of the world where governmental agencies free up their data, the technological possibilities of data storage, data deployment, data enrichment, data visualization and advanced analytics and finally...this blog is inspired by a deeply felt wish and conviction that our field of knowledge (data management and data utilisation) can make a contribution to a better place for us to live in.


Posted January 4, 2013 8:50 AM
Permalink | No Comments |

I have always been fascinated by the true origins of modern-day phrases or trends in my domain - Information Management, data management in particular. It is like a challenge I give to myself, a puzzle waiting to be solved. Why you say? Well, Aristotle said it already:

'If you would understand anything, observe its beginning and its development'.

I tend to collect first the modern-day writings about it, mostly by practioners. Then I go to the on-line science libraries and browse through ACM journals, MIS Quarterly journals, European Journal of Information Systems, IBM Systems Journal, Decision Support Systems, Journal of Management Information Systems and lately the journal of Data and Information Quality Research. And I am forgetting a whole lot. But, since the field of information management is a relatively young science, I tend to eventually end up in the more or less classic science domains; psychology, mathematics, engineering, etc.. If I took it any further I would probably end up with philosophy and Theology ;-) and discover the meaning of life.... 

Being on such a quest is like opening up an unprecedented series of presents given to me by brilliant men and women. There is so much out there that can easily be applied to other domains, for example, the information management domain.

With 'Data Quality' the same applied. I started with books of Thomas Redman, aka the data doc, of course Larry English, Danette McGilvray, David Loshin, Jack Olson and also Arkady Maydanchik can not be missed. And one cannot overlook the books written by Yang Lee, Richard Wang and Leo Pipino. The majority of these books however (with the exception of Lee, Wang and Pipino), lack the scientific rigor, the kind of Design Research approach as introduced by Alan Hevner in 2004 (published in MIS Quarterly). And although this type of research is relatively young, there are many scientific based papers out there that more or less adhere to several of the Design Research pre-requisites that aim to have scientific rigor and relevance in practice.

Since 2004 many papers on data quality have been published that are really precious to me, but it just was not good enough for me. I had not reached the true origins yet, so I felt. So I broadened the scope to 'Quality' in general. Quality in a manufacturing/engineering/services context pointed me in the direction of Shewhart, Demming, Juran, Crosby, Feigenbaum, Ishikawa, and also Peter Drucker. Boy - did I enjoy the writing of these guys (sorry, they were all men).

However, I slowly digressed into various domains that opened up Pandora's box; the domain of coping with change, management theory, decision theory, group processes, system theory, system dynamics and much more. And although I studied on a university, economics, this was all new to me. 

Still not sure whether I have not being paying attention back in college or my university just sucked.

In between I entered into the field of Quality Software Management, not that odd I would say; on an abstract level one might argue that it is the sum of the above combined with software engineering and my own professional domain and the projects I undertook. Back then (and I still do) I felt that Gerald (Jerry) Weinberg seemed to have captured the soul of all these quality people combined with system theory, system dynamics, software engineering, a profound human perspective and a keen view on leadership and management (and why many current management models simply disfunction).

If anyone want to really go on a quest regarding 'agile software development'; do not bother, start by reading the books of Jerry Weinberg. You will not find the word 'agile', but you will recognize it.

These books (and he wrote a whole lot) put me on a roller-coaster (which I am still on) that included exploratory testing, self-organizing teams, leadership, Kanban/Scrum/XP, CMMI, Six Sigma, etc...

I have so many books now, so many papers, so many subjects, so many loose ends.....it is ridiculous.

And it all started with 'data quality'....

Am I done yet?

Hell no

Will I ever be done?

Hell no

Is it fun?

Hell yes

I need a second life, and a third...


Posted December 15, 2012 3:50 AM
Permalink | No Comments |

Wednesday November 16th 2011 Ralph Hughes from Ceregenics was in the Netherlands. Ralph is author of the book 'Agile Data Warehousing: Delivering World-Class Business Intelligence Systems Using Scrum and XP'. Ralph is currently under contract to write more books on the topic of agility in data warehouse development.

I had been in contact with Ralph for some time; he wanted to know more about data vault, getting the facts, how it is actually used, what customers use it, how they develop and deploy, how it contributes to agility and how it impacted the business.

IMG_7815

Of course, anything can be explained in writing or conceptually, but the 'real proof of the pudding, is in the eating'. Opportunity knocked when Ralph was in the Netherlands for his TDWI course on Agile data warehousing. He asked me whether or not I could arrange some customer visits in Amsterdam. Customers that use and deploy Data Vault and have attained a high agree of agility.

Tom Breur and me were hosts for Ralph and we visited the Free University (client of mine) and BinckBank (client of Tom), both in Amsterdam. Hans Hultgren (Genesee Academy) happened to be in the Netherlands that week and joined us as well. We met with both management and technical team members of the university and BinckBank.

Both clients were particularly interesting because their data warehouses are in production and in a mode of constant change. Both clients showed a remarkable predictability and reliability in coping with these changes. Change equated to 'business as usual'. I remember Ralph asking an engineer 'how long does it take to deploy a new data element to the warehouse?' The engineer replied: 'do you want to know the lead-time including my coffee break?'.

Ralph, Tom, me and Hans were impressed with the accomplishments of these clients in getting their data warehouse deployment in control while constantly adding value/changes to the business in a predictable fashion. 

IMG_7828I will not transcribe the whole interview in this blog - that is simply too much - send me a note if you want to know more. Interesting differences between Free University and BinckBank were the fact that they used different automating techniques and also the level of business key integration differed slightly. Free University used templating (generating XML and import in Business Objects Data Services) for data warehouse automation and the data warehouse was driven by business keys. BinckBank used Quipu for data warehouse automation and the data warehouse was partly driven by business key, and some by surrogate key (see also my presentation on the Data Vault advanced seminar about different Data Vault species). In terms of software development methods, BinckBank used the Scrum method and Free University was based on waterfall/iterative with lots of lean practises being used.

I will try to summarize both visits from the perspective of me and Tom, particularly slanted towards Agile software development, by asking my blog readers, three questions: 

  1. Why is it that you can build and deploy extremely small particles in Data Vault and not in other approaches, without having an increase in the overhead and coordination of these particles? In other words; 'Divide and Conquer to beat the Size / Complexity Dynamic'1
  2. Why is it that you can re-engineer your existing model and guarantee that the changes remain local? Something that is hugely beneficial in data warehouses that - by definition - grow over time.
  3. Why is it that - as your (Data Vault based) data warehouse grows - your costs grow 'merely' in linear fashion initially, and as you approach the end state marginal growth in cost decreases exponentially (as opposed to exponential cost increase for Kimball warehouses)?

P1060033
I want to thank Free University as well as BinckBank for offering their time, their energy and enthusiasm to the general cause of knowledge sharing. Of course I want to thank Tom Breur and Hans Hultgren for putting in their time as well. 

My special thanks of course to Ralph Hughes as being an open minded, inquisitive and knowledgeable peer. It was great being your host in the Netherlands. 

 

1 - Gerald M. Weinberg - Quality Software Management - 1992

Photo #1: Left in the corner sits Ralph Hughes, next to him Tom Breur. On the other side the Free University; Jaap Roos (project manager), Dorien Heijting (Data Warehouse Engineer), Erwin Vreeman (Project Lead).

Photo #2: Sitting with the american flag - Ralph Hughes and Hans Hultgren. At the top of the table - BinkBank: Michel Uittenbogaard (Data Warehouse Engineer) and on the right Paul Delgman (BI manager). 

Photo #3: Sitting near the window looking down: me, myself and I


Posted December 9, 2011 12:50 AM
Permalink | No Comments |

Recently a discussion raged on LinkedIn regarding the 'ETL tools that support Data Vault OUT OF THE BOX' (link). I gotta be honest - I was annoyed by the discussion and was stupid enough to display this  by commenting kind of harshly. I would like to apologize to everyone and especially to Daan.

In this blogpost I would like to explain my point of view regarding this question. 

In the above mentioned discussion I commented very briefly 'All ETL tools support Data Vault". Allow me to explain this by paraphrasing an argument that was also used by Daan in the subsequent comments. He mentioned that technology brought about efficiency gains in the last 20 to 30 years. I agree with that, the data is quite clear about it ;-). Trying to explain these gains I leave to applied science, but I would like to take one tiny piece of the puzzle and put it in the context of my remark that 'all ETL tools support Data Vault'.

One of the 'variables' in the function of this tremendous leap - in my opinion - is uniformity. Organizing uniform systems (I use the term 'systems' in the broadest sense - People, Technology, Processes) opened the door towards repeatability, predictability, limiting waste and improving quality. In writing this I think Dr. W. Edwards Deming would agree with me.

Now, back to the subject of ETL and Data Vault. With Data Vault we design the system of modeling and logistics of data in advance. Both go hand in hand. What we want to achieve is uniformaty as much as we possibly can. Uniformity in modeling, balanced with the uniformity in loading. 

Let me elaborate some more.

In Data Vault and more generally speaking, in 'systems thinking', all objects in a system are interrelated. How I construct a data model has a strong impact on the way I (can) construct the loading (ETL). With Data Vault we standardize the data model as much as we can (there are quite some heuristics in Data Vault, it should not be applied in some dogmatic way), in a limited number of constructs (hub, link, sat). But we also design the loading constructs, which are also extremely limited in number (hubload, linkload, satload). Every load construct has got a standardized pattern, see the figure regarding the pattern for a hub load.

Schermafbeelding 2011-09-14 om 09.36.29 
If I were to translate this to SQL it would be something like: INSERT <distinct values> to HUB where NOT EXIST in HUB. Of course any ETL tool would support such a simple construct! Data Vaults are thus being build with SSIS, Informatica, InfoSphere, Business Objects, Pentaho,SAS etc...

Please be advised that the above is a simplified example, in real life the loadpatterns are considerably more complex. However, the principles however remain unchanged;

- A limited number of loading patterns

- The patterns are standardized in type

- The patterns are simple

- The patterns can be executed asynchronous

- The type of patterns can make use of parallel loading

I would like to summarize the above with two words; uniformity and automation. Because of uniformity in modeling and logistics we open the door towards repeatability/automation. Making it a lot cheaper to maintain, but also easy to change or supplement (testability is designed in the system, as well as repairability). Agile software development find great support by these kinds of systems (this is worthy of an entirely new blogpost ;-)).

We now can design a predictable system of loading data in a data model. We have created a uniform structure of the data in the data warehouse, opening the way for more uniformity towards Kimball datamarts as well (be it in-memory, on file, virtualised, etc..).

Uniformity and automation have ignited a wave of innovation in the Netherlands. Innovation led by independent consultants and consultancy firms - that saw great opportunity in the daily problems they face - to take the data logistics to a new level of automation; metadata driven ETL (example open source: Quipu, example commercial: WhereScape).


Posted September 21, 2011 3:44 AM
Permalink | No Comments |
PREV 1 2