Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

Recently in BI Vendors Category

We live in a world where video delivery is becoming the norm.  Business users are getting tired of "bar-charts" and "standard reports".  They want interactivity.  While drill-down was an interesting development in interactivity, there doesn't seem to be any major advancement from the BI vendors in years.

With the advent of Flash-delivery, and Microsoft's new Silverlight platform, one would think that BI vendors would have had tremendous advances in technology recently, but no - we're still dealing with the old column based delivery mechanisms, and we think that Pivot tables are "cool"...  Man, we're stuck in the 80's here people...

I am learning Flash, along with SilverLight.  I'm also learning video, interactivity, dynamic graphics, movement, and so on.  Yea, yea - I can hear it now: that's old technology, web-designers have been doing this for years!  Yep...  I know, why then can't we build BI systems and dashboards that provide this kind of interface for our business users?

Some companies claim to handle queries on the backend against VLDW, but fall down when one of the tables has 1.5 billion rows in it.  Some companies claim the latest in "drill-down" technology.  Ok-fine and dandy.  Some companies claim the latest in 3D bar charts or live graphs!  Still some companies say: we integrate with xyz column and pixel positioning systems... uh-huh....  ok - let's get down to brass tacks:

* I agree we need to deliver valuable information in a format that most business users understand, but I also believe in the power of paradigm shifts.
* Where's the tie of the reporting/BI analytics to the business rules?
* Why can't I walk through my business rule processes in 3D (like a walk down a street) and see specific analytics that make sense to that area of business?
* Where are the truly interactive charts and graphs?

I've said it before, I'll say it again: Hire a game programmer to make BI/Analytics interesting, fun and maybe even addicting!  What?  And disturb the balance?  What balance?  What's the hotest selling game out there (according to informal fad's and polls and what I see selling)...  maybe Guitar Hero?  It's on all the platforms.  What does it do that makes you play it for hours on end?

INTERACT...  It gives you a set speed, a set song, a stage, and a fake guitar with 5 buttons on it - you have (what seems like) infinite combinations of notes and speeds of notes to place your fingers on the buttons.  Your skill level determines how fast the game goes.

Now there are more advanced games, like Warcraft, Doom, and so on that make use of more buttons, intellect, thinking, terrain changes (during the game).  And because you are playing against humans, you've got to be good, or get better.

Ok - so maybe the themes aren't right for BI and analytics, but jeepers creepers, when I open up an application and the data sits there - I feel like I'm sitting in an elevator in the 1970's listening to elevator music, waiting to push a floor button.  Dry, Dry, Dry... 

Why not put the "cubes" on a flash-carrousel?  Why not have the cubes visualized in 3D and inter-connected?  Why not display data in a 3D sound wave format, where the head-quarters is the center of the graph?  Why not be able to fly in to the graph as drill down, fly under, fly through - re-focus the graph on a live grid?  Why not use some scientific style graphing or themed graphing techniques to represent the business and the data in a metaphorical manner?

For instance, what if I'm in the oil & gas industry, and what if I represented my business data and profitability as a land-map graph?  What if oil wells represent business units, and can run-dry if they are not profitable?

I believe that there is power in metadata, and what if the metadata were metaphors for the business - the entire business?  Could we develop 3D visualization techniques across metaphors and make better use of business metadata?  YOU BET!   Ahh-but wait a minute, this might require the business to get better at building, managing, and governing the metadata.  Yep.

But as sure as I sit here, I can tell you - that in order for BI to "break out ot its' shell" and really become truly USED (no doubt it's useful), I believe that 3D visualization along themed game-play like consoles may be what's required.

I wonder what would happen if the worlds largest company commissioned someone like dream-works to develop an interactive scenario game, based on a metaphorical description of their business?

Just an idea folks... think about it.  Can BI be fun in the future?  I would like to think so.  If you've got some themes or ideas for specific lines of business, I'd love to hear about them.

Cheers,
Dan Linstedt
DanL@RapidACE.com
http://www.RapidACE.com - 3D Data Model Visualizer


Posted February 2, 2009 4:04 AM
Permalink | 2 Comments |

We live in a world proliferated with hand-held devices.  These devices can watch TV, see streaming movie content, browse the web, and provide interactivity via iconography and touch panels.  Yet, we have serious problems with the delivery mechanism of an EDW, and BI on these devices.  Yes, we can produce graphs and charts, and web-based reports - but it all appears to be back-ended by large scale systems that must be up all the time, and that we have to be interconnected to the web in order to "work" with our applications.

There are needs out there beyond the connected, that I will explore in this entry.

I've used my daughters Nintendo DS.  I've seen my business partners I-Phone look-alike, and of course, I've seen all of this technology out in the field.  What I marvel at is the advancement of database appliances in on the server side.  I really like the column based database approach to housing data - especially if it's a small data set (less than 100 Terabytes).  I'm familiar with Netezza, Dataupia, Vertica, and ParAccel, and a few others.

I've lately been hearing about a few needs from customers:

1) the need to have a centralized management architecture and framework

2) the need to have centralized governance processes

3) the need to have de-centralized data stores for privacy and ethics reasons (again controlled centrally), acting as a local cache of specific segmented data sets

4) the need to be able to access data on-network, and off the network - and when the network appears or is available, the device automatically synchronizes with the master EDW store.

5) the need to have real-time data, and an operational application directly on TOP of the EDW, so that history is available, but operational activities can take place in keeping the data fresh, or applying it to day to day activities (along the Operational Data Warehousing side).

6) the need for the centralized EDW data store keeping all history for compliance and accountability.

Now I think to myself, with all these wonderful advances in high-speed, MPP, and parallel computing we can easily achieve #6 (VLDW, high speed centralized EDW) etc... And with all these wonderful advances in hand-held devices, why can't we take advantage of them?

Well, I said quite a while ago, that I believe individuals in BI will switch their delivery mechanisms to Adobe Flash-Like platforms, we see it now with Microsoft's competing platform: SilverLight, and of course using things like QuickTime and Final Cut Studio.  Interactive video, and user-interfaces on "video like" delivery systems will begin to make a difference in how we write our applications, and interface with our data. 

BUT: we still need local data stores.  We've seen a rise of In-Memory databases, Object Oriented data stores, but where-o-where are the column based databases?  I strongly believe that with the massive compression ratio's they get, the high speed access they have, and the ability to load trickle data quickly (not to mention adapting the physical data architecture by adding and removing columns is very easy) - that they would have come to market with a hand-held device.

These are the requirements I think would be awesome to see one of these column based appliance vendors make available.

1) Column based data stores with in-memory pinning, pre-configured on a FLASH drive that can plug & play with an I-Phone like device

2) Application bundle on-top of the column based database for OLTP purposes.

3) partial historical data store acting as a local cache - available from column based data store

4) minimal configuration parameters, such as HTTPS addresses, and FTPS for auto-synchronization of data set changes when the network is available.

5) simple switch "on/off" for to control when synchronization takes place

6) encryption/decryption of the data set both in storage, and in transit - while the device talks to the main EDW mothership.

7) each device data is scrambled by different keys (multiple keys).

8) flash-based, or interactive video based application (still with forms and such) to collect data and feed it via SOAP/XML or web-service protocoll to and from the column based database.

9) additional ability to define application logic with functions embedded in the column based appliance

Now I could be behind the times here - maybe someone out there already has this platform, and if so - I'd really like to hear about it.  If not, then I'd like to hear who may be close to this.  The point is, I can think of at least a dozen of my large clients who can use this functionality today, and I simply don't have a solution to offer them.

Hope to hear from you soon,

Dan Linstedt, danL@geneseeAcademy.com


Posted January 29, 2009 5:09 AM
Permalink | No Comments |

I've just read Jill Dyche's excellent entry about the value of communities, and I agree - communities are important for us to collaborate and communicate.  It's why I'm happy to be a part of B-Eye-Network.com, and LinkedIn where I belong to a number of communities.  I'd like to take a minute to tell you about other communities that we've launched recently.

By the way, I won't do this very often - most of the time I like to blog about business and IT.  I felt it necessary to let everyone know what I've been up to for the past 8 months, in addition to traveling to Europe to teach the Data Vault.

Now before you start commenting on how this is an advertisement, let me say - yes, it is in a way.  But as always, I am trying to provide the business with serious value.  The value here?  Budgets are being slashed on average of 20%, training budgets are nearly gone, yet new innovation must continue.  You can no longer afford to "go to a class", nor to "pay" to have an instructor come to you. It's tougher and tougher to get budget to attend large conferences.  We are putting content on-line, and making it affordable, so you can continue to learn and innovate at your own pace without the high-cost, and without a lot of expense.  We believe in the community messages, and are fostering three new communities that will really help the DW/BI space.

One major effort, centers on e-learning.  You can get to this community at http://inmoninstitute.com - Bill Inmon, myself, and Hans Hultgren offer on-line training for you on subjects like DW2.0, Unstructured Data, VLDW, Data Vault, CIF, and business accumen of DW/BI landscape.  We are posting new material every week.  Register for free, watch a few courses now.

In this day and age where IT budgets are slashed, it becomes near impossible to "go to a class" or to bring an instructor in.  You can take these courses at your liesure from your desk, and it's a very reasonable fee.  Anyhow, the e-learning community is the place to go for new knowledge, you can register for free, and see the quality of the video and sound that we produce, and watch a few free segments.

I also launched a more intense community around the Data Vault modeling called the Data Vault Institute.  The new URL will come soon, but for now, you can reach it at: http://www.danlinstedt.com/datavaultinstitute/  This community is free to register.  In this community, you will find white papers, articles, downloads, and a host of customers and IT folks using the data vault and discussing the business practices all over the world.  This community is more than just a technical community, it's for business users too.  Here they can find out the business value of the Data Vault, and why it helps put agility back in to their IT teams.  We will also offer a FREE 3-D data model visualizer for anyone who pays to upgrade as a subscribed memberSign up now for free!

Finally, next week, I'm launching another e-learning community - with a focus on Software Tool training.  You can see the "old-site" now, at http://www.trainovation.com - the new site is completely revamped and will be launched next week.  It will start with Informatica based e-learning, my own custom courses finally available on-line for a reasonable cost.  So take a look at this site in about another week or so to get access immediately to new content.  I'm also interested in talking to vendors who want to post their own training courses on-line with us, we have a full production video studio.

If you like this kind of entry please comment and let me know.  If you don't like this kind of entry, like-wise, please let me know.  I am curious to the feedback - and I promise, if you like it, I'll keep you up to date only once every 6 months.

You can email me directly: danL@danLinstedt.com

Cheers,
Dan Linstedt


Posted January 24, 2009 8:22 AM
Permalink | 1 Comment |

But very few (if any) actually execute on the vision that I am laying out here. This is a very short entry, but basically re-iterates some of the points of Dynamic Data Warehousing that I believe to be necessary before it (software/appliance/database) can be labled as being something like this.

In my definition of dynamic data warehousing the software around the database is an artificially intelligent engine. The database contains metadata about the structure, about the usage of the structure, and versions of all this metadata (producing a structural and usage life-cycle).

In other words, the AI engine is fed or kick started with an ONTOLOGY. The ontology of terms defines the basic data model that is executed underneath. The Ontology is driven by business terms, business definitions, functions, and descriptions (in accordance with OWL ontology). Secondarily, the AI engine is fed many different data points including usage of the ontology:

* SQL Queries
* Loading Code
* Scripting Code
* Application Code
* Web Service Code

And all of the table references/usages/join criteria components within the code.

Dynamic Data Warehousing RESPONDS to changes BY ITSELF. It responds to USAGE controls (ad-hoc queries, repeated queries, and so on). It responds to LOADING controls (changes to structures, appearance of new attributes/fields, changes to loading code, volume and width). It responds to length of processes (metrics driven), and responds to USER DRIVEN ONTOLOGY CHANGES (based on business requirements).

At the end of the day, the AI engine grades changes, and figures out by itself, how to a) TUNE the structure b) ADAPT or CHANGE the structure, including indexing, c) Add new elements to the structure, d) retire old elements from the structure, d) OPTIMIZE the modeling paradigm for today's business execution cycles, e) manage and propogate structural changes TO the loading code, TO the queries, and TO the ontologies.

Vendors may claim that they have "Dynamic Data Warehouses" all they want, but until they have automatic detection of structural changes, and automatic propogation of those changes - and these automated systems are associated to/with business ontologies, I will not agree that they infact have a "Dynamic Data Warehouse".

This is just my opinion. I believe that in order to become more fluid, and more appropriate to the business, and closer to business change, these are the kinds of systems that will evolve in the next 5 to 7 years.

Cheers for now,
Dan Linstedt
Feel free to contact me directly: DanL@GeneseeAcademy.com, we teach custom Informatica courses, DW2.0 and Unstructured Data Courses, Zachman Framework courses, and Data Vault Data Modeling courses.


Posted October 29, 2008 11:31 PM
Permalink | No Comments |

Column based databases/appliances are making headway in the VLDB/VLDW world. There is no doubt that there are benefits to this approach, but there are also drawbacks. In this entry I explore some of the articles, links, facts and figures - as related to my personal experience. Then I compare what different authors are saying against Row-Based MPP technologies to see what the differences and similarities are. This by no means is a complete research paper, but just a peek into what the future may hold for RDBMS vendors and the new Column based data stores. Of course, Solid state disk, and RAM/Flash based data sets will change things again shortly. I'll also touch on the impacts to Data Modeling and what it may mean going forward.

Let's first set the table by defining what the terms mean:

1) for VLDB/VLDW I'm referring specifically to a 300TB and above system.
2) I'm also referring to LIVE data sets, where it isn't JUST 300TB sitting in a storage disk somewhere, but there's a significant amount of information being loaded AND queried at the same time, utilization is somewhere around 100TB "used/accessed/referenced/loaded" per week.
3) I'm also referring to a MIXED WORKLOAD system, meaning real-time transactions are streaming in, batch loads are occurring, and both tactical and strategic queries are taking place at the same time.

By MPP: I mean Massively Parallel Processing capabilities, like DPF from DB2 UDB (IBM - running shared-nothing architecture), and Teradata with independent nodes to scale out, I'm also referring to theses traditional database systems as "row-based" database engines.

For Column Based "appliances" I am referring to Sybase IQ, Vertica, Dataupia, and others which provide column based data storage. NOTE: Netezza is NOT a column based store, rather it is a flat-wide appliance with hardware that figures out exactly what data set you need before hitting disk to retrieve it.

Thus, one might expect column-stores to perform similarly to a row-store with an index on every column without the corresponding negatives of creating many indices. In fact, this is a common argument we have often heard regarding column-stores and their expected performance relative to carefully designed row-stores -- both approaches provide good read performance, with the column store providing lower total cost of ownership (since you don't have to figure out what indexes to create anymore).

Though this argument sounds reasonable, it is completely incorrect. It is also dangerous since it might cause you to end up choosing a row-store when what you really need is a column-store.

http://www.databasecolumn.com/2008/07/debunking-a-myth-columnstores.html

If you're interested in furthering your knowledge on indexing versus column compression, the article: http://cs-www.cs.yale.edu/homes/dna/papers/abadi-sigmod08.pdf is a very good source for examining the mathematics behind the tuple sets and joins.

Most of the articles I've located discuss indexing, and differences between indexing and column based tuple access. Unfortunately they don't tend to address the loading speeds and performance of getting the data "IN" to the database in the first place.

Column based data stores bring benefits to the table:
* Rapid Query, less overhead (according to the math that I've read through)
* No need for PHYSICAL data modeling (as long as you don't need/want GOVERNANCE or MANAGEMENT in your data store).
* No "seemingly physical" limit to the number of columns PER TABLE.
* Automatic data compression/removal of duplicates on insert
* IF the grid / cloud computing works properly, then they should be able to scale out
* They appear to achieve anywhere from a 3:1 to a 7:1 compression ratio on the data slammed in to the box.
* Raw data can be loaded quickly (in native format) without "stopping to normalize, or assign sequence number surrogate keys"

Now let's take a look at some of the issues that they bring to the table (simple issues)
* Most column based databases have yet to solve massive load performance issues
* Most column based databases have to "STOP" the data stream to compress it, and assign it to the right column post-loading.
* In order to achieve high speed trickle feed (8,000 transactions per second or better) they need to have a significant RAM cache somewhere on one of the nodes to load the data.
* Splitting the data over multiple gridded nodes might take more work than originally thought
* Load balancing with spreading the data set across multiple gridded nodes might be an issue.
* Today, most column based data stores work extremely well on big iron SMP boxes, but struggle to take full advantage of Grid technologies and shared-nothing architectures.
* To handle "BATCH LOADS" Most column based data stores use a "staging area" internally to load the batch data, then split it across and push it in to the column database (this may NOT be such a bad thing... we do this in MPP environments too!)
* Column based databases have "come and gone", the only one that has stuck around over the years has been Sybase IQ, and finally for the first time in many years we are beginning to see announcements from the company that they are putting money back into R&D for this product.

Let's take a look at the physical nature of MPP:
Pros:
* Provides mechanisms for governance and management through physical data modeling
* Provides high-speed batch loads, and high-speed trickle feeds (real-time transactions)
* Provides balanced queries, and can easily handle mixed workload components (loading while querying, and both tactical and strategic queries at the same time).
* Has grown up, is based on mature proven technology.
* Scales out very easily, allows MASSIVE sets of data (because it's not locked in to a single SMP environment).

Cons:
* Usually requires good physical data modeling (normalization) in order to load-balance the data sets across the nodes.
* Usually requires a staging area inside the MPP platform before re-distributing the data ** caveat: some MPP platforms have architected their bulk-loaders to overcome this problem.
* Usually requires JOIN INDEXES or some materialized table to assist with the Tuple Joins
* Usually requires column based compression to be turned on by the operator to achieve benefits.
* Requires enough nodes to "split the workload evenly"
* Requires all nodes to be running at the same speed in order to achieve maximum performance gains.

So these are just a FEW of the points made both for and against column based databases when comparing them to MPP designs. They both work well for their own purpose. Customers of mine continue to look for a "single solution to do it all" however today, it just doesn't seem possible. This is why (I think) that we continue to hear vendors like IBM and Teradata advertise: "we partner with...." fill in the blank of your favorite column based database...

However, watch the vendors closely - this market space is heating up, and over the next year I expect new technologies to be released from all vendors that will converge some functionality and blur the lines between RDBMS MPP, and Column based on a grid.

Thoughts? What do you see in the market?
Dan Linstedt
DanL@DanLinstedt.com


Posted September 21, 2008 2:56 PM
Permalink | 1 Comment |

1 2 3 4 5 NEXT

Search this blog
Categories ›
Archives ›
Recent Entries ›