Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

Recently in Thought Experiments Category

Let's face it, cloud computing, grid computing, ubiquitous computing platforms are here to stay.  More and more data mind you will make it's way on to these platforms, and enterprises will continue to find themselves in a world of hurt if they suffer from security breaches.  If we think today's hackers are bad, just wait...  they're after the motherload: all customer data, massive identity theft, etc...  I'm not usually one for doom and gloom, after all - we have good resources, excellent security and VPN and firewalls right?   In this entry we'll explore the notion of what it *might* take to protect your data in a cloud/distributed or hosted environment.  It's a thought provoking future experiment - maybe it would take a black swan?

Imagine your data on a cloud environment, or a hosted BI solutions vendor, or any other number of hosted environments. 

* How do you protect your data from getting stolen?

* How do you trace it if it is stolen?

* How do you track down the invader/hackers?

* How do you /can you stall the hackers long enough to take evasive action?

First off: we all know that NO security solution will ever be 100% fail-safe, it just simply will not exist.  When someone creates a more secure solution, someone else comes up with a way to break it, that's just the way it goes.  BUT, there is such a thing as "thinking ahead", thinking outside the box, thinking about what you CAN do to prevent and deter succumbing to these types of problems - which may cost you your business, may cost you money (in law-suits), may tangle you up in legalities with governments, and on and on.

There's no question as to the issues that can arise if you don't prove that you've "done everything in your power" to protect the consumers/customers.  So what can you do?

Cloud Computing and Hosted Environments (that is if they host your Data) present unique challenges, really unique challenges.  Everything from "shared data on shared machines" to shared data on dedicated machines, to VPN with non-shared, non-public machines, etc...

Well, let's blow away the fog in the cloud for a minute and take a look at a simple case:

Suppose you outsource your BI and your data for that BI to an analytics as a service firm.... So far so good.  The BEST possible protection you can have (and they should tell you this up front) is to NOT release any personally identifyable data to the hosted service or cloud, only release aggregate data - rolled up trends, trends of trends, and so on...  Then, add to that all the standard and well known security that you can buy, and it's fairly decent.

Now let's say for whatever reason you've outsourced the CLOUD environment, and you've uploaded sensitive data...  What can you do?  what kinds of questions should you ask?  What should the vendors be willing to help with?

First, there's not much you can do - other than ask the vendors for new features... which leads me to answer the question: what can / should the vendors do in this new computing arena?

Cloud is interesting: it follows on the notion that if you need more computing power, that it becomes available on-demand.  Ok, cool.  What if the extra computing power needed was for encryption/decryption of the data?  In other words, I believe that ALL outsourced data should be stored in encrypted format on disk - well, that's a good start, especially if each DISK array carries it's own encryption/decryption hardware to perform the task on the fly.  This prevents someone from "hot-swapping" a RAID 5 disk with a copy of sensitive data and taking it home to crack it.   Well - not really, they can still hot-swap it and take it home, but cracking it (with the right algorithm) can be another story.

The next step: is encryption/decryption at the database level.  I'm assuming that since you're running BI / Analytics in a cloud, that you'll also be needing a Database engine right?  Ok, so the Database engine should encrypt/decrypt the data as it works with it.  The only place the sensitive data should "be visible" is in RAM of the database engine or the machine on which it is currently existing.

Am I saying to encrypt ALL data?  No - that would be ludicrous, and would cost way too much money...  Only encrypt the data that your organization deems sensitive or private in nature.  However, the Database engine should make it EASY with little to no performance hit (due to cloud resource availability).

Next, comes the part out of the box...  What if: Data needed a DONGLE to be utilized properly?  What if the data was sent to your machine, exported to excel, but to SEE the data had to be decrypted on your desktop with a public or private key?  Now, this is interesting...  A database manufacturer working with a cloud based hosted service, to produce DONGLE's instead of SEAT licenses, or instead of selling "software clients" they sold "decryption" dongles for USB...  Hmmm - interesting.  The Dongles then would talk to the cloud, report who, when, and what was viewed - you don't have a choice (sorry, big brother has been watching for a VERY long time).

Let's do one better... what if, just what if, the data itself could be SIGNED - just like a certificate is signed, and this "watermark" went whereever the data went, it couldn't be removed, and it couldn't be seen - but it would be used as part of the key to decrypt and make sense of it.  Now, if the data itself was stolen (even in encrypted format) it would be traceable.  No - it wouldn't call home, as there's no "application" for that, but if it shows up somewhere, there would be forensic evidence (like a finger print) on the data that would point to it's origin....  Now that's some cool science fiction stuff (I wish it were so)....

Anyhow, back to reality... The Dongles provide really good protection mechanisms today, and in fact can also be embedded with a finger-print reader as part of the authentication mechanisms.  This technology exists, and could be put to good use.  \

In some cases your data is worth more than your gold or money in the bank, because it represents tomorrows profitability.  Don't you have the right to ask vendors to help you protect it?  Of course, they have the right to ask you to pay for this service....

Just a thought.  If you have some other cool thoughts, reply in the comments to this blog - I'd love to hear them.

Thanks,
Dan Linstedt
DanL@DanLinstedt.com
http://www.DanLinstedt.com


Posted March 30, 2010 3:05 PM
Permalink | No Comments |

We live in a world where video delivery is becoming the norm.  Business users are getting tired of "bar-charts" and "standard reports".  They want interactivity.  While drill-down was an interesting development in interactivity, there doesn't seem to be any major advancement from the BI vendors in years.

With the advent of Flash-delivery, and Microsoft's new Silverlight platform, one would think that BI vendors would have had tremendous advances in technology recently, but no - we're still dealing with the old column based delivery mechanisms, and we think that Pivot tables are "cool"...  Man, we're stuck in the 80's here people...

I am learning Flash, along with SilverLight.  I'm also learning video, interactivity, dynamic graphics, movement, and so on.  Yea, yea - I can hear it now: that's old technology, web-designers have been doing this for years!  Yep...  I know, why then can't we build BI systems and dashboards that provide this kind of interface for our business users?

Some companies claim to handle queries on the backend against VLDW, but fall down when one of the tables has 1.5 billion rows in it.  Some companies claim the latest in "drill-down" technology.  Ok-fine and dandy.  Some companies claim the latest in 3D bar charts or live graphs!  Still some companies say: we integrate with xyz column and pixel positioning systems... uh-huh....  ok - let's get down to brass tacks:

* I agree we need to deliver valuable information in a format that most business users understand, but I also believe in the power of paradigm shifts.
* Where's the tie of the reporting/BI analytics to the business rules?
* Why can't I walk through my business rule processes in 3D (like a walk down a street) and see specific analytics that make sense to that area of business?
* Where are the truly interactive charts and graphs?

I've said it before, I'll say it again: Hire a game programmer to make BI/Analytics interesting, fun and maybe even addicting!  What?  And disturb the balance?  What balance?  What's the hotest selling game out there (according to informal fad's and polls and what I see selling)...  maybe Guitar Hero?  It's on all the platforms.  What does it do that makes you play it for hours on end?

INTERACT...  It gives you a set speed, a set song, a stage, and a fake guitar with 5 buttons on it - you have (what seems like) infinite combinations of notes and speeds of notes to place your fingers on the buttons.  Your skill level determines how fast the game goes.

Now there are more advanced games, like Warcraft, Doom, and so on that make use of more buttons, intellect, thinking, terrain changes (during the game).  And because you are playing against humans, you've got to be good, or get better.

Ok - so maybe the themes aren't right for BI and analytics, but jeepers creepers, when I open up an application and the data sits there - I feel like I'm sitting in an elevator in the 1970's listening to elevator music, waiting to push a floor button.  Dry, Dry, Dry... 

Why not put the "cubes" on a flash-carrousel?  Why not have the cubes visualized in 3D and inter-connected?  Why not display data in a 3D sound wave format, where the head-quarters is the center of the graph?  Why not be able to fly in to the graph as drill down, fly under, fly through - re-focus the graph on a live grid?  Why not use some scientific style graphing or themed graphing techniques to represent the business and the data in a metaphorical manner?

For instance, what if I'm in the oil & gas industry, and what if I represented my business data and profitability as a land-map graph?  What if oil wells represent business units, and can run-dry if they are not profitable?

I believe that there is power in metadata, and what if the metadata were metaphors for the business - the entire business?  Could we develop 3D visualization techniques across metaphors and make better use of business metadata?  YOU BET!   Ahh-but wait a minute, this might require the business to get better at building, managing, and governing the metadata.  Yep.

But as sure as I sit here, I can tell you - that in order for BI to "break out ot its' shell" and really become truly USED (no doubt it's useful), I believe that 3D visualization along themed game-play like consoles may be what's required.

I wonder what would happen if the worlds largest company commissioned someone like dream-works to develop an interactive scenario game, based on a metaphorical description of their business?

Just an idea folks... think about it.  Can BI be fun in the future?  I would like to think so.  If you've got some themes or ideas for specific lines of business, I'd love to hear about them.

Cheers,
Dan Linstedt
DanL@RapidACE.com
http://www.RapidACE.com - 3D Data Model Visualizer


Posted February 2, 2009 4:04 AM
Permalink | 2 Comments |

We live in a world proliferated with hand-held devices.  These devices can watch TV, see streaming movie content, browse the web, and provide interactivity via iconography and touch panels.  Yet, we have serious problems with the delivery mechanism of an EDW, and BI on these devices.  Yes, we can produce graphs and charts, and web-based reports - but it all appears to be back-ended by large scale systems that must be up all the time, and that we have to be interconnected to the web in order to "work" with our applications.

There are needs out there beyond the connected, that I will explore in this entry.

I've used my daughters Nintendo DS.  I've seen my business partners I-Phone look-alike, and of course, I've seen all of this technology out in the field.  What I marvel at is the advancement of database appliances in on the server side.  I really like the column based database approach to housing data - especially if it's a small data set (less than 100 Terabytes).  I'm familiar with Netezza, Dataupia, Vertica, and ParAccel, and a few others.

I've lately been hearing about a few needs from customers:

1) the need to have a centralized management architecture and framework

2) the need to have centralized governance processes

3) the need to have de-centralized data stores for privacy and ethics reasons (again controlled centrally), acting as a local cache of specific segmented data sets

4) the need to be able to access data on-network, and off the network - and when the network appears or is available, the device automatically synchronizes with the master EDW store.

5) the need to have real-time data, and an operational application directly on TOP of the EDW, so that history is available, but operational activities can take place in keeping the data fresh, or applying it to day to day activities (along the Operational Data Warehousing side).

6) the need for the centralized EDW data store keeping all history for compliance and accountability.

Now I think to myself, with all these wonderful advances in high-speed, MPP, and parallel computing we can easily achieve #6 (VLDW, high speed centralized EDW) etc... And with all these wonderful advances in hand-held devices, why can't we take advantage of them?

Well, I said quite a while ago, that I believe individuals in BI will switch their delivery mechanisms to Adobe Flash-Like platforms, we see it now with Microsoft's competing platform: SilverLight, and of course using things like QuickTime and Final Cut Studio.  Interactive video, and user-interfaces on "video like" delivery systems will begin to make a difference in how we write our applications, and interface with our data. 

BUT: we still need local data stores.  We've seen a rise of In-Memory databases, Object Oriented data stores, but where-o-where are the column based databases?  I strongly believe that with the massive compression ratio's they get, the high speed access they have, and the ability to load trickle data quickly (not to mention adapting the physical data architecture by adding and removing columns is very easy) - that they would have come to market with a hand-held device.

These are the requirements I think would be awesome to see one of these column based appliance vendors make available.

1) Column based data stores with in-memory pinning, pre-configured on a FLASH drive that can plug & play with an I-Phone like device

2) Application bundle on-top of the column based database for OLTP purposes.

3) partial historical data store acting as a local cache - available from column based data store

4) minimal configuration parameters, such as HTTPS addresses, and FTPS for auto-synchronization of data set changes when the network is available.

5) simple switch "on/off" for to control when synchronization takes place

6) encryption/decryption of the data set both in storage, and in transit - while the device talks to the main EDW mothership.

7) each device data is scrambled by different keys (multiple keys).

8) flash-based, or interactive video based application (still with forms and such) to collect data and feed it via SOAP/XML or web-service protocoll to and from the column based database.

9) additional ability to define application logic with functions embedded in the column based appliance

Now I could be behind the times here - maybe someone out there already has this platform, and if so - I'd really like to hear about it.  If not, then I'd like to hear who may be close to this.  The point is, I can think of at least a dozen of my large clients who can use this functionality today, and I simply don't have a solution to offer them.

Hope to hear from you soon,

Dan Linstedt, danL@geneseeAcademy.com


Posted January 29, 2009 5:09 AM
Permalink | No Comments |

In my last entry in this category, I described automorphic data models and how the Data Vault modeling components is one of the architectures/data models that will support dynamic adaptation of structure. In this entry I will discuss a little bit about the research I'm currently involved in, and how I am working towards a prototype of making this technology work.

If you're not interested in the Data Vault model, or you don't care about "Dynamic Data Warehousing" then this entry is not for you.

The Data Vault model has reached the height of flexibility by applying the Link tables. It is an architecture that is linear scalable and is based on the same mathematics that MPP is based on. Single Link tables represent associations, concepts linking two or more KEY ideas together at a point within the model. They also represent the GRAIN of those concepts.

Because the link tables are always a Many To Many, they are extracted away from the traditional relationship (1 to many, 1 to 1, and many to 1). The Links become flexible, and in fact, dynamic. By adding strength and confidence ratings to the link tables we can begin to gauge the STRENGTH of the relationship over time.

Dynamic mutability of data models is coming. In fact, I'd say it's already here. I'm working in my labs to make it happen, and believe me it's exciting. (only a geek would understand that one...) The ability to:

* Alter the model based on incoming where clauses in queries (we can LEARN from what people are ASKING of the data sets and how they are joining items together)
* Alter the model based on incoming transactions in real-time (by examining the METADATA) and relative associativity / proximity to other data elements within the transaction.
* Alter the model based on patterns DISCOVERED within the data set itself. Patterns of data which were yet previously "un-connected" or not associated.

The dynamic adaptability of the Data Vault modeling concepts show up as a result of these discovery processes. I'm NOT saying that we can make machines "think" but I AM suggesting that we can "teach" the machines HOW the information is interconnected through auto-discovery processes over time. This mutability of the structure (without losing history) begins to create a "long term memory store" of notions and concepts that we've applied to the data over time.

Through recording a history of our ACTIONS (what data we load, and how we query) we can GUIDE the neural network into better decision making and management over the structures underneath. This includes the optimization of the model, to discovery of new relationships that we may not have considered in the past.

The mining tool is:
* Mining the data set AND
* Mining the ARCHITECTURE
* Mining the queries AND
* Mining the incoming transactions

to make this happen. We've known for a very long time that Mining the data can reap benefits, but what we are starting to realize NOW is that mining these other components really drive home new benefits we've not considered before. In the Data Vault Book (the new business supermodel) I show a diagram of convergence (which has been bought off on by Bill Inmon). Convergence of systems is happening, Dynamic Data Warehousing is happening.

These neural networks work together to achieve a goal: creating and destroying link tables over time (dynamic mutability of the data model) while leaving the KEYS (Hubs) and the history of the keys (Satellites) in-tact. Keep in mind that the Satellites surrounding Hubs and Links provide CONTEXT for the keys.

I've already prototyped this experiment at a customer, where I personally spent time mining the data, the relationships, and the business questions they wanted to ask. I built 1 new link table as a result with a relationship they didn't have before. We used a data mining process to populate the table where strength and confidence were over 80%. The result? Their business increased their gross profit by 40%. They opened up a new market of prospects and sales that they didn't previously have visibility to.

Again, I'm building new neural nets, new algorithms using traditional off the shelf software and existing technology. It can be done, we can "teach" systems at a base level how to interact with us. They still won't think for themselves, but if they can discover relationships that might be important to us, then alert us to the interesting ones - then we've got a pretty powerful sub-system for back-offices.

More on the mathematics behind the Data Vault is on its way. I'll be publishing a white paper on the mathematics behind the Data Vault Methodology and Data Vault Modeling on B-Eye-Network.com very shortly.

Cheers,
Dan Linstedt


Posted August 27, 2008 5:54 AM
Permalink | 5 Comments |

I've recently begun research in to this area, and am calling this "Automorphic data models" rather than dynamic data warehousing, because I think the concept lends itself better to this kind of term. Dynamic Data Warehousing seems to be an overly-used slightly abused term in the industry, and raises quite a few questions as to how, and what it is. Vendors are also using this term to mean different things. We'll let the business and the vendors work out their definition of this term over the next few years. I'm going to write exclusively (for a while - in this section) on Automorphic Data Modeling. These entries are aimed at the researches and the scientific people in the audience.

First, I must apologize to all those who _really_ know this stuff. I am an architect, and an Information modeler at heart. I believe these connections exist to the Data Model architecture I wrote up called the Data Vault Model, because it is based in spatial-temporal mathematics, and because it is based on the "poor mans definition of how the brain MIGHT store/use/retrieve information." Based on these hypothesis, I can see where the mathematics of these types apply to the model. I'd love to hear from those of you as to why these theories will or won't work, it will be interesting to see how this progresses.

If we start with Websters definition of Automorphic we end up with the following:

Patterned after one's self.

The conception which any one frames of another's mind is more or less after the pattern of his own mind, -- is automorphic. --H. Spenser.
http://dictionary.reference.com/browse/automorphic

However, I prefer the mathematical definition of Automorphism:

In mathematics, an automorphism is an isomorphism from a mathematical object to itself. It is, in some sense, symmetry of the object, and a way of mapping the object to itself while preserving all of its structure. The set of all automorphisms of an object forms a group, called the automorphism group. It is, loosely speaking, the symmetry group of the object. http://encyclopedia.thefreedictionary.com/automorphism

Automorphic Groups: (Which is what I'd suggest the Data Vault model is built from)

In mathematics, the general notion of automorphic form is the extension to analytic functions, perhaps of several complex variables, of the theory of modular forms. It is in terms of a Lie group , to generalize the groups SL2(R) or PSL2 (R) of modular forms, and a discrete group , to generalize the modular group, or one of its congruence subgroups. The formulation requires the general notion of factor of automorphy for , which is a type of 1-cocycle in the language of group cohomology. The values of may be complex numbers, or in fact complex square matrices, corresponding to the possibility of vector-valued automorphic forms. The cocycle condition imposed on the factor of automorphy is something that can be routinely checked, when is derived from a Jacobian matrix, by means of the chain rule. http://encyclopedia.thefreedictionary.com/Automorphic+form


Essentially what we are doing within the Data Vault data model is a form of Automorphism. The Data Vault modeling structures are built The Data Vault Model is actually based on many different components of temporal mathematics and spatial mathematics. (I've listed a few of the research papers I used in the 1990's to help me construct the structural integrity of the Data Vault):

1. “Unifying Temporal Data Models via a Conceptual Model”, http://www.cs.arizona.edu/~rts/pubs/ISDec94.pdf
2. “Notions of Upward Compatibility of Temporal Query Language”, http://www.cs.arizona.edu/~rts/pubs/Wirtschafts.pdf
3. “Temporal Data Management”, http://oldwww.cs.aau.dk/research/DP/tdb/TimeCenter/TimeCenterPublications/TR-17.pdf
4. “Spatio-Temporal Data Types: An Approach to Modeling and Querying”, http://web.engr.oregonstate.edu/~erwig/papers/MovingObjects_GEOINF99.pdf
5. “Formal Semantics for Time in Databases”, http://portal.acm.org/citation.cfm?id=319986&coll=portal&dl=ACM&CFID=6511873&CFTOKEN=58729889

The Data Vault model is capable of adapting, changing on the fly, exhibiting the mathematical properties of automorphism, in that through architecture mining combined with data mining efforts we can "learn" what architecture flaws exist, where stronger relationships exist, and where the architecture can change itself or re-connect to itself to form a stronger data model.

How does this work?
The Data Vault LINK is made up of vectors. It houses a directional connection to each HUB that it is associated with. The vector of that connection can be assigned a strength and confidence co-efficient to determine it's usefulness within the data set contained within a link. Mining the data over time can produce a powerful combination of patterns of change, along with the discovery of patterns of association (possibly never known before), or as a result of a pre-known state.

The data mining tool can then be taught either "what to look for", or it can be set-off in discovery mode to associate information based on a Data Vault model already constructed (use the existing Data Vault model as a starting point for the learning process), and then it can determine if any "undiscovered" relationships exist. Furthermore the process of mining the data can then be used to assign strength and confidence coefficients to EACH of the vectors in each link, thus preparing for the architectural mining phase.

So how is the Data Vault automorphic?
The Data Vault is connected (within itself) to itself via the Links and the vectors within the links. Each vector can be considered a component within the mathematical matrix of the automorphic functions. Then, the mathematics of "groups" and vector analysis can be applied to dynamically alter the matrix for a potentially different outcome.

Thus, new links can be constructed on the fly, tested, and then removed (if no real value to the human on the other end of the computer is perceived). They can likewise be constructed, and then old linkages can be removed to produce an auto - morphing data structure, something akin to self-correcting. I will NOT go so far as to say it's actually learning, because it (the computer) will still not understand the CONTEXT to which it's applying the changes.

This type of system STILL requires guidance, training, and tweaking from the operators in order to achieve the desired outcomes and modifications to the model that make sense to the business, even if the business itself is commercial or government oriented.

However, this type of system can be applied (easily) directly to the Data Vault modeling constructs in order to achieve a self-changing data store, something that appears to "point-out" different facts, or discover different unknown relationships without understanding what it has. The understanding part is still up to the human.

Ok, so how does this benefit business?
Well, if we can spot relationship changes automatically (on a simplistic level), or mathematically figure out POSSIBLE infancies in our business, then we might be able to adapt our business based on the information being collected (or in some cases, adapt the source systems to collect information WE MISSED that might be vital to our operations). The data sets and the architecture of the data sets can tell us as much about our business as the processes and the business models we use.

You can find out more about the Data Vault model (for free) at: http://www.DanLinstedt.com

Hope this is interesting,
Daniel Linstedt


Posted December 23, 2007 6:32 AM
Permalink | No Comments |
PREV 1 2

Search this blog
Categories ›
Archives ›
Recent Entries ›