Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

May 2009 Archives

There are quite a few vendors out there who publish industry logical data models.  I've had my fare share of direct, hands-on experience with them over the last 8 to 10 years.  In this entry I will share a few of the pros and cons of these classifications of models. My objective is more of an inquisitive one, in hopes that you will respond with your experiences, and if you are a vendor: please post your opinions.

First let's take a look at what these iLDM's are defined to be: Steve Hoberman offers some thoughts in 2004 on the subject: http://www.teradata.com/tdmo/v06n04/Tech2Tech/InsidersWarehouse/DataModels.aspx

The right tools to capture how an organization within a particular industry works, or should work, are essential for in-depth knowledge of your business. These tools should also provide expert modeling structures for your enterprise to use as a guide. An industry logical data model (iLDM) offers this kind of information and guidance.

...The iLDM is a pre-built model that extends the EDM concept. Like the EDM, it is a subject-oriented and integrated view of the information needed for an enterprise data warehouse (EDW) to answer strategic and tactical business questions....

...While it does not contain history, the iLDM represents a snapshot of information (point in time). It also does not contain derived data or other calculations to assist with retrieving data more quickly. All model changes necessary for issues like speed, storage, security, backup and recovery must be added at the physical level....

Interesting... It does not contain history, nor derived data, nor other calculations... However the iLDM's I've seen have been implemented 1 to 1 with the physical model at client sites, and DO contain history and DO contain calcations, and so on. So it appears in step 1 that vendors are not necessarily following Steve's advice...

In my days of data modeling, I was taught as Steve suggests, that Logical Data Models should stay Logical - and often times do NOT represent the physical (due to partitioning, indexing, data types, super and sub-types, and so on). However in my experience, the vendors professional services of the LDM's appear to want to implement these physically in a 1 to 1 fashion (meaning 1 logical table = 1 physical table).

Moving on, I find this interesting: http://www.freepatentsonline.com/7359906.html A patent by several individuals for a particular LDM. I'll let you decide if this is a good or a bad thing. Let's move on...

Here is more information on iLDM's from another individual: http://www.univdata.com/Portals/9/udm_Products_Repository.pdf

And another: http://publib.boulder.ibm.com/infocenter/rdahelp/v7r5/index.jsp?topic=/com.ibm.datatools.logical.ui.doc/topics/clogmod.html

A logical data model is a model that is not specific to a database that describes things about which an organization wants to collect data, and the relationships among these things.

A logical model contains representations of entities and attributes, relationships, unique identifiers, subtypes and supertypes, and constraints between relationships. A logical model can also contain domain model objects, or reference one or more domain or glossary models. After logical objects and relationships are defined in a logical data model, you can use the workbench to transform the logical model into a database-specific physical representation in the form of a physical data model.

What I find in my experience:

What I've found is that professional service organizations take the logical data models and physically implement them directly on the specific databases in a 1 to 1 fashion. Of course, they add all the components necessary to make them work in a physical realm. However, I've found both pros and cons to this approach. So let's get to the heart of this entry and discuss (I'd love to hear from you about your pros and cons, and I'm not here to try to tell you one way or the other what to do with the LDM's)...

Pros:

  • Coverage of the LDM's are wide
  • Years of Experience in the industry of the people who design these models
  • Some "fast" implementation if scope of the project is kept to a minimum
  • As guides go, they are understandable (to a point) and contain many of the complex relationships needed to express information properly
  • They are "mostly" compressed (smaller models, smaller numbers of tables)
  • Easy to "Buy" versus "build"

(This is the part where you can add to the list)

Cons:

  • Consulting organizations don't often understand how to implement properly
  • They have been mapped 1 to 1 to the physical (without necessary changes)
  • The models are super typed, and often times are too high a grain for customers to really grasp ALL the elements that are combined within
  • The super-types try to handle too many levels of "grain", mixing corporate customers for example with individual customers and changing the outcome of the queries on the reports
  • Often require years of expertise and training to implement properly
  • Sometimes are a nightmare to tune, partition, and get performance from
  • Sometimes require tons of staging areas to "prepare" the data before integrating into supertyped table structures
  • Nearly always need "extending" and customization, but don't fare well with customers, as this becomes a multi-year, high cost effort - turning into a huge consulting services gig for big dollars.

Now are they worth it?
Sometimes they are, other times they fall down. If your a customer, and you're looking at these models, they certainly seem appealing.

What's your experience? *** WARNING: THIS IS MY OPINION ONLY! ***
I've seen these models take 3 to 5 years to implement correctly, and modify and customize properly. To be fair, I've seen 50/50 success rates with the implementations, some have worked beautifully, others have failed miserably. Was it because of the model? Can't say. Some environments it was the politics, not the model, not the provider, not the hardware or database. Other companies it was because of consulting services not understanding how to integrate, alter, upgrade, and customize. In other companies it was because of the storage needed in the staging area in order to "cleanse/change/alter and super-type" the data before it was loaded.

Successful implementations I've seen were because the scope was controlled, the modifications were extremely minimal, and data quality (from the source system) was extremely high. I've also seen successful implementations when there is only 1 source system to deal with. When this is the case, cross-mapping the information from the source to target super types is fairly easy to do. When there are multiple source systems involved the task becomes tremendously complex quickly.

My conclusion (this is where I'd like your feedback)
I believe that sometimes, buying the logical data models can really help move the project forward, but I also believe that breaking out the super-type into it's constituant raw-level components for the physical model is an important act of service to the customer. I believe that if you "customize" the logical data model, that you will have trouble when the vendor puts out a new version (that alters the version you have in place). I've seen cases where a 2 and 3 year bid effort to customize becomes a 7 year services contract in order to maintain customizations across multiple releases.

Don't get me wrong: LDM's are WONDERFUL GUIDES! But alas, they are "logical" data models, and should remain "logical" as their name implies. In the physical world, dealing with history, separation of grain, functional areas of business (crossing lines of business), and so on are very very important. Customizations are nearly *ALWAYS* in a data warehousing world, even with pre-packaged implementations of "pre-built data warehouse solutions."

Finally, let me ask you data modelers out there: does it really make the physical ETL implementation that much easier if you "supertype" everything (customer, individual, employee, manager, corporation, organization, and so on) into "one single table"? What if each of these elements has their own business key, and clearly the data is at a different grain, what do you do in your LDM to ensure success?

Thank-you for your comments and feedback,
Daniel Linstedt
DanL@DanLinstedt.com


Posted May 22, 2009 4:26 AM
Permalink | No Comments |

The market has been asking for EDW's to deal with more and more real-time based data.  IT on the other hand has become "slower and less agile" as their current system of federated data marts gets larger and larger.  In this entry we will deal with some of the issues, some of the questions, and of course offer an opinion into the insight of dealing with true real-time data sets arriving at the doorstep of the EDW.

First, what exactly is real-time data?

Well, when we put it that way, there is no such thing.  It's always near-real-time to be honest.  If it were true real-time, we'd have the data the instant it is created.

So what is near-real-time data as opposed to batch data?

In my book, near-real-time data or streaming, or semi-streaming data is anything that arrives at intervals less than 5 seconds.   Even at a constant every 5 seconds, I would have to say mini-batch might take place.  Quite honestly, there needs to be a continual flow of transactions and in some of the systems I've dealt with, the transaction rates are usually millisecond based, and sub-millisecond based.  In any event, near-real-time means (to me) that data arrival is too fast to do anything with (like apply business rules and cleansing) and that if you did apply these things, you end up with a backed up pipe on inflow.

Now, every thing else becomes mini-batch, burst-rate-batch, or large batch based systems.

How does this affect me?

Well, it affects your traditional batch loads really.  Particularly if you "put your business rules upstream" of the data warehouse.  By having business rules (quality, cleansing, data alteration, etc...) upstream of the warehouse you are immediately introducing a processing bottleneck and "disabling" your system from the ability to have near-real-time feeds (as I've defined them).

Example: if the EDW receives 10,000 transactions a second, and it takes your "batch load process" 1 minute to load 80,000 rows - you have a bottleneck.  You simply cannot run near-real-time without one of the following: new hardware, faster hardware, RE-ENGINEERING of the batch process, RE-ARCHITECTURE of the EDW and so on...  If the reason you can only run 80,000 rows per minute is because of the business rules upstream, and you through new hardware at it - you have just instituted a short term stop-gap measure.  The risk that you will hit this bottleneck again in the very near future is very very high.  Eventually you cannot afford to throw money at it anymore, and you are stuck re-engineering, or re-architecting to solve the problem.

The business sees re-engineering and re-architecture as a form of weak or incompetent IT (sometimes), other times they call this an "EDW failure" where the costs to re-engineer grow too high.  So they shut us down, and hire a new team and start over.

Well, there's a simple solution to this problem: MOVE YOUR BUSINESS RULES DOWNSTREAM OF YOUR DATA WAREHOUSE!!  Move them to the OUTPUT side of the EDW, between the EDW and the data marts.  Then, allow RAW data to arrive and land in the EDW, that is to say: allow the "good, the bad, and the ugly" data in once it is ready.

This is of critical mass, and is truly the only way to handle near-real-time data at sub-second arrival latencies.  It is the only way to scale the EDW appropriately.  Furthermore, it is the best method (I believe) for answering auditability and compliance issues in real-time.  I've implemented systems like this over 10 years ago, and they are still standing, and growing today without the need for re-architecture/re-design!!  The business is HAPPY with IT team, and the EDW is a "success" according to the business.

What interests you?  Do you have "real-time" needs?  How have you been able to successfully meet them?

Thanks,

Dan Linstedt

DanL@GeneseeAcademy.com


Posted May 9, 2009 5:25 AM
Permalink | No Comments |

I've always felt that this blog (with the agreement from Bill Inmon, Shawn Rogers, and Ron Powell) is a place for me to express my guarded, guided, and best possible opinion about vendors, the industry, industry direction.  I've also long believed that vendors DO HAVE GREAT PRODUCTS, but sometimes their marketing and sales campaigns get a little over-zelous and advertise "features" that simply arent' quite true, or that the product doesn't behave in the situations they claim it does.

Recently however, I've been getting "flack" from certain areas of the industry because some of the vendors who read my blog don't like what I'm saying.  They are exerting indirect pressure on my friends and industry affiliations to "disconnect me, take me off the map." Claiming I have no right to share my information on this blog, and questioning if I should even be an industry analyst in the first place.

This entry is a question to all my readers (vendors included), and I hope you will respond with your thoughts and comments.

I always do my best to represent the vendor and the industry fairly, and when I receive a number of questions from customers about a particular vendor, I do my best to research the information before posting.  Now, I don't always get it right - but I'm not here to "slander" anyone, and I'm not here to out-right "bash" anyone either.  I hope that my blog entries have been fair, and forward - and oh-yes, I'm human, so I'm positive that I have errors and omissions - to which I have always offered vendors and readers alike to post corrections.

By the way, FULL DISCLOSURE:  the ONLY "marketing articles" I've written for money from a vendor are in Teradata Magazine, and for IBI and their SSIS "plug-ins" for data connectivity.  Come to think of it, the ONLY other place I've collected money for "material" is from TDWI for teaching classes.  In my daily life, I teach a lot of courses professionally around systems design, data warehousing, very large database systems, SEI/CMMI and Project Management, and so on - but we all have a job right?  I've not collected or been paid any money by Microsoft, Oracle, Sybase, IBM, Pentaho, Informatica, or other vendors I've wroked with and helped in the past - at least from a context of sitting on their boards (which I don't), or writing marketing articles for them, or writing paid research pieces for them.

I've always tried to keep a level head, and offer a "no-nonsense, cut through the marketing, approach" to all my entries, and I'm sorry if any of these entries appeared otherwise or were mistaken for other pieces.

Question #1: Does the above disclosure make me biased?  I would have to say, yes - to some degree, but then again BIAS is the way we live our life.  The act of having a job and earning an income makes all of us biased.  But what's your vote on this?

Question #2: Is this blog a place for professionally guided opinion?  Or should this blog be "controlled" by vendor pressure or any other external pressure?  Should I say what the vendors want me to say?

Question #3: I've always tried to make this blog a place where I write "more than a sound-bite", as I don't like to echo news elements or other pieces.  Should I change the format of this blog to be more "sound-bites" and less thought and consideration?

Question #4: Do any of my readors (other than vendors) feel as though I've intentionally blogged something here with intent to harm?  If so, let's discuss this.

Question #5: Should I continue offering "MY OPINION" (based on free speech) on the blog and continue writing about vendors and disclosing vendor names? Or should I never mention any vendor ever again?

Question #6: TO VENDORS: have I written anything about ANOTHER vendor that you've found interesting or helpful?

Question #7: Should a Vendor have enough power to shut me down, force me to remove blog entries, and try to cut me off from the industry?  (this is what is currently taking shape)

I could clearly see problems and issues if I went around bashing, and offering slander, but I just can't see that this is what the blog is about, and certainly: I do not live my life this way.

So there you have it, I'll put it to my readers - please REPLY with comments so that we can take a vote here.  Tell us all what you think about this.

Thank-you kindly, with all due respect,

Daniel Linstedt


Posted May 7, 2009 8:39 AM
Permalink | 6 Comments |

There are many definitions of governance, compliance, and accountability in the industry, and it seems as though many of us are struggling to define it for the EDW/BI space in some generically acceptable way.  As I've recently researched these subjects (and been involved in them for years) I've noticed a trend that many of the definitions are vertically challenged (if you will).  They focus on a vertical industry rather than on horizontal enterprise data warehousing. 

In this entry I'll add my two cents to this noise as just another voice of opinion on these subjects.

Governance can be defined many different ways, there are hundreds of definitions there - and in B-Eye Best practices section of this site, there are discussions, white-papers, and so on.  Many resources to define governance.  My opinion on the matter is as follows:

* Governance for EDW and BI should be about managing, monitoring, optimizing, fixing whats broken, and producing ever increasing quality work product.  It involves people, process, and technology - all aspects of the game, and it's focus should be horizontally driven across lines of business in order to meet the needs of the enterprise and BI initiatives.

* Compliance for EDW and BI should be about the meeting agreements, showing progress in governance, between IT and internal business, and the enterprise and it's external customers.  Good "compliance" means (in my opinion) meeting service level agreements between organizations both inter-organization and intra-organization.  Applying compliance in the EDW BI landscape means upholding standards and delivering quality auditable information across the business.  Service level agreements should cover the "data" that is delivered, as well as the people and processes used to meet the needs.  Compliance means setting standards that multiple parties agree to, and then reaching those levels of delivery.

 

* Accountability for EDW and BI team should be about the ability of business to stand up and take ownership of mistakes made, errors and omissions in meeting both compliance and governance initiatives.  It should be horizontally focused across the enterprise, and should include the data warehouse information delivered in reports on a daily basis.  This includes the raw-data-store for good-bad-and-ugly data in the EDW, along with the "quality cleansed" data stores used in data marts, along with ensuring that numbers match at a corporate level when necessary.  Accountability is all about the people taking ownership of their mistakes, admitting errors, omissions, and optimization opportunities at the process level, then applying this to the systems.

These are my introductory thoughts, I would love to hear from you about what you would add, change, or remove from these definitions.

Cheers,

Dan L, DanL@GeneseeAcademy.com


Posted May 6, 2009 6:51 AM
Permalink | No Comments |
This by no means is meant to be an exhaustive list, however some of you may find it helpful. It's a set of information about Very Large Data Warehousing I've gleaned over the years.  I hope you enjoy it, and please add your own knowledge by commenting on this entry.

Things to do when growing your project from 1TB to 50TB, or 50TB to 500TB, or above.

* Establish governance across the IT staff supporting and maintaining the VLDW

* Get training on performance and tuning for the database you are using

* ask the vendor to show you live, working examples of their database/ETL, <insert your favorite vendor here>  of customers working with data sets in the size range you are considering

* Read VLDW reports from the web (some of which the vendor will happily provide to you), some cost money, and others are free

* Ensure your throughput rates for ETL/ELT + loaders are 80,000 rows per second on average with 1k row sizes (without partitioning/parallelism), anything slower is seriously detrimental to your ability to grow the environment

* Check the Data Model.  If you are coming from a column based database (because perhaps you outgrew it), then you must map, create, and manage the data model for relational database that you are moving to.   Even if you are using a column based appliance, you should have a solid data model foundation for logical data representation.  Governance...  the bigger the system gets, the harder it becomes to manage without strict rules and standards and good data models.

* Be flexible.  Learn to align IT, and to turn IT into a lean machine, that can execute rapidly and adapt to business needs.

* don't be afraid to scale out, or in some cases - scale up (mostly with Big-Iron, and no, Big-Iron is not dead, far from it).

* Learn the terminology MPP, SMP, NUMA, Clustered, Grid, Cloud

* Establish a mitigation plan, and a risk analysis plan for "what if this node fails?"

Technically what you can do:

* Test the limits of your machines, networks, disk, cpu, RAM, and so on- understand their maximum throughput, average throughput for multiple parallel processes.

* Test the database, how does it perform with queries running at the same time as a large load?  The bigger the system gets, the harder it will be to "manage this".  Testing with 100,000 rows of data won't cut it when one feed might deliver 1.5 Billion every night.

* Increase I/O disk speed, 300 to 400 mbits throughput per second from the server to the disk and back again.  Watch the BUFFERING that occurs on the disk, ensure that the test clears the buffer on the disk before you run it again, otherwise your results will be invalid

* Increase the I/O channels. The number of I/O channels can have a huge impact on performance of very large systems

Again, theses are some of the things you can do today.  Love to hear about your environment, along with your results.

 

Cheers,

Dan L  danL@GeneseeAcademy.com


Posted May 6, 2009 1:41 AM
Permalink | No Comments |
PREV 1 2