Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

October 2007 Archives

I've been discussing DDW for quite a while; I've started discussing the nature of dynamic structure change. There are larger considerations out there that we need to think about before embarking down these paths. However that said - there are some applications regarding architectural mining and dynamic structure changes which I wish to discuss here. For those of you in the intelligence sectors of government or research and defense this may be of interest (or not). For those of you in DW / BI traditional, the only benefits that dynamic structure change might bring to you is the ability to adapt faster (on the back end) to the dynamic changes of business. But then again, this technology is years away (for all we know ;)

Dynamic re-structuring of structured data, why would we do it? What is the interest? What are the benefits?

Well, if you're in the intelligence sector, or identity analytics, or defense research, then this may hold some serious value - and perhaps, you are already performing these tasks - after all, DARPA began funding Nanotech and DNA computing experiments over 10+ years ago (at least as far as we can tell publicly). Enough said...

Anyhow, imagine a system beyond master data.. Where we have the structures that house specific "images" of data at a specific point in time, then we can stack those images and slice by time, or... slice by association.

What do you mean, slice by association?
What I mean is: imagine for a minute that a slice in time combined with a business key, surrounded by specific descriptors actually establishes a particular context. Now imagine that you have several hundred of these data points (multiple keys across the model), each is already imaged by "time", and quite possibly multiple time frames. Finally imagine that the business keys are essentially useless, except for hard mechanisms for people to indentify the information.

Now you are beginning down the path of something called identity analytics. Surround those keys with the notions of context, of course, using the term losely - in other words it is one view of information at a particular point in time, context is how you "rotate" the information to meet the needs of the current end user.

So you're saying "no relationships"?
yes, that is one of the things I'm saying. Now apply data mining across the information sets, and look for previously unknown patterns, but deeper than that - look for abstracted patterns of correlated data - where outliers aggregate in coincidental time frames - now this is presented to an end-user, in a visual 3 dimensional graph format.

We apply color to "Hot" "cold" and luke-warm correlations, the user applies the human thought process of "interest". By focusing in on the interested points, and applying human logic we could theoretically surf billions of contextual relationships that would otherwise go un-noticed.

Now, the human interaction establishes (interactively) the points of interest or the relationships that are associating the information to other points of information. Once this is done, a new set of data mining algorithms are run. These algorithms produce a specific answer, and test correlation of information to a more focused lens. This cycle can be run over and over again until the human decides that the relationship is of interest, and NOW can apply information relationships dynamically.

Once this relationship "falls out of interest" it is removed, in favor of a new relationship. In essence, the model becomes a slowly evolving model with human intervention. It's possible that after certain relationships have been identified, that the data mining algorithms can be "tuned" to self-modify parts of those relationships.

Well, all of this is just a thought experiment - the only part which may not necessarily be achievable today is the application of these changes to the queries, and load routines. Certainly without human interaction, zooming in to points of interest becomes a difficult task.

Identity analytics plays a role like this, in identifying context from information - then relating different "identities" as associated elements. But that's for another day.

I hope you found this entry interesting; I'd love to hear your thoughts.

Thanks,
Dan Linstedt
DanL@GeneseeAcademy.com


Posted October 30, 2007 2:50 PM
Permalink | No Comments |

We are nearing the end of the entries I will be making (for now) on the VLDW world. I will discuss indexing going forward in a traditional RDBMS engine point of view. "Appliances" are changing some of this as they move into the field. But for now, indexing of large data sets requires some consideration.

When people think of large data sets, they often forget to consider the indexing. They will state to me: I have to move 1.5 Billion rows in to the RDBMS within 1 hour and 20 minutes. But then they make statements like: I can't get the performance over 30,000 rows per second, or 8,000 rows per second, or otherwise. They don't understand why the table "causes deadlocks", or why they should be concerned about TEMP I/O on load.

Well, when you load a table using a bulk loader, you can load it only two ways: FAST or SLOW, these are the only two modes available. You can request FAST load from a number of loaders, but they a) either will stop, or b) silently switch to slow mode if they run in to specific conditions. One of those conditions is indexing, another is clustering, another is constraints, and in some databases DEFAULT VALUES are considered constraints, also causing a slow down.

Indexes play a huge role in the performance of the queries, and by all means are necessary to make systems perform. In some database engines, indexing has been tremendously improved (within the last year or so) over the previous release. Here's what matters:

1. Don't have indexes on the table / partition during load, disable them, remove them, rebuild them after the load is complete; and if you're not familiar with indexing partitions, read about it and learn about it. People sometimes forget - they partition the table, but not the index.
2. Partition your indexes
3. Use Raid 0+1, move the indexes to different parts of the disk, even across separate I/O channels if you can - unless your Disk device has HUGE RAM Buffers, and your server has a HUGE RAM cache for your Database Indexing efforts.
4. Setup the indexes to work in parallel. Often times DBA's will tune the table for parallelism, but forget to tune the indexes for parallelism.
5. Change to COST BASED OPTIMIZERS, KEEP YOUR STATISTICS UP TO DATE (as I've mentioned in a previous posting).
6. Always, keep your indexes small - no more than 6 fields, no more than 250 bytes wide. Some databases state they won't even index past 250 bytes, EXCEPT in the Text fields.
7. Use SEPARATE indexes: UNIQUE for instance with (TextField, Primary Key) just for text fields
8. Don't try to index more than 1 text field at a time
9. SPLIT the text field out into it's own table, and join to it (normalize it), probably your biggest performance gain all around.
10. Increase the parallelism ON THE INDEX
11. set the FillFactor to 90%, PCTFREE to 10%, especially in data warehouses - and if you PARTITION your indexes, bump those even higher / lower to match. Data should always be inserted into a NEW partition by time, causing new pieces of the index to be built.
12. DO NOT CLUSTER, EXCEPT for maybe the Primary Key (Surrogate Sequence Number). Why? Because it multiplies the I/O of the incoming data by a factor of 2x, 3x, 4x, and more - especially as the data set grows.
13. Use Multiple SMALLER indexes to gain coverage across your data, reduce the size of each index - but have a few more of them, you'll find even the queries are faster.
14. Increase the amount of RAM that holds the indexes in buffers within the Database, INCLUDING BITMAP type indexes.
15. Increase the Read-Ahead buffering of the Indexes.
16. Increase the BLOCK SIZE of the database, it affects # of rows per index chain as well as the data.

IF YOU ARE SWITCHING FROM ONE DBMS VENDOR TO ANOTHER:
**** RE-EVALUATE YOUR INDEXING STRATEGIES ****
Do Not assume that the same indexes on the new RDBMS will perform the same as the indexes on the old RDBMS, it simply is NOT true (in most cases).

Hope this helps,
Daniel Linstedt
http://www.DanLinstedt.com - Data Vault Modeling


Posted October 16, 2007 3:52 AM
Permalink | No Comments |

I recently attended Teradata Partners conference, which was a lot of fun, one of the things they discussed was governance, data stewardship, data ownership - and of course: Claudia Imhoff in her masterful presentation of MDM talked diligently about SoR, SoE, and a few other acronyms. The gist of the statements (across the board) was that System Of Record lines are blurring. Shifting Sands I might say...

In 2006, I blogged on my version of the SoR, and how I believe there are at least three different definitions for it. You can find the entry here. I recently received a good comment about SoE, and how these things need to be separated. The comment discussed the notion of incorporation of MDM.

I'd like to keep this blog entry unusually short (for me) - because I believe a summary is in order.

My current thoughts are shifting along with the sands of definition land... but here's my two cents on it:

1) We have systems of entry (SoE) as they are calling it
2) We have systems of Record (SoR) - but how this is defined is very shady
3) We have single versions of integrated FACTs (everyone knows how I feel about TRUTH)
4) We have historical views of integrated FACTs
5) We have non-auditable, completely altered, data in "Single versions of the Truth" that are delivered to the business.

I've assigned SoR to an integrated EDW space, single version of integrated facts, because it's the only place that 3 & 4 exist over a period of time, I've assigned SoR to an SoE - why? because frequently the operational systems do both, and are responsible for both, and once the data is fed from #5 BACK to source systems as "clean data" - that shifts it's definition to become an SoR as well.

Now we have MDM - which really, only the Master Data itself can be considered an SoR for the company, but what does that mean?

So I'll leave you with this tonight, these are questions that over the next couple months I'll blog on in depth...
1) What _exactly_ is an SoR?
2) Who is the "expert" that can accurately define a global metadata definition for the term SoR?
3) What differentiates an SoE from an SoR and why would they be separated?
4) Why ISN'T a raw-detailed warehouse (single version of integrated facts) considered an SoR? When might it be?
5) Can a Federated Star Schema, cleansed data set/altered data set be called an SoR when it feeds data back to the operational systems? Is it acting as Master Data?
6) WHEN does a Master Data "set" of information become qualified to be an SoR?
7) Where's the governance, data stewardship, data management in all of this?

Love to hear your thoughts....

Cheers,
Dan Linstedt
DanL@DanLinstedt.com


Posted October 11, 2007 7:44 PM
Permalink | 1 Comment |

I sat down with my good friend Jeff Jonas yesterday and discussed the nature and notion of contextual processing. Jeff is a phenomenal individual, and much smarter than I ever hope to be, but all that aside, we had a wonderful conversation about the nature of processing streaming data (one piece at a time, or possibly multiple pieces in parallel, but separated) and how to focus the notions of context.

How is this related to B.I.?
It has everything to do with Business Intelligence, and how we "experience" and use our data sets/patterns within to make sense of our business, especially in an Operational B.I. world

Processing the context on a streaming basis (as Jeff says) requires the ability to "change" all that we know (perception) at run-time based on new facts arriving on the stream. His statements went a little like this:

1) Imagine we think our friend XYZ is a good person. We just met this person 3 days ago, so we don't know much about them, but they've been nice to us - so our current perception of this individual is: K, U, I, O, T - and so on. We've hung out with them, so we have a whole host of experiences to draw from (mostly fun).
2) Now, 3 days later we find out from another very good friend, someone we've trusted for over 25 years, that this person has done something horrible in the past...

At that instant, considering our relationship to our very good friend, all that we know about person XYZ (perceptively) changes; usually very quickly.

Now, this isn't so bad if we are dealing with one piece of information, and a very small series of memories that we are focused on, but imagine now: trying to do this at 10,000 transactions per second in a non-sequential order of arrival of facts, and then trying to affect data sitting within 100 billion rows in our database...

This brings me to my discussion. From here Jeff and I began discussing HOW this processing needed to take place, and it reminded me of some of the conversations I'm having here at Teradata Partners conference this week.

The questions on the table are:
1) How should the system determine the assigned context for a given fact? Well, we have to let go of the word "context" and from a systems perspective we have to work with the notion that the data has a strong correlation to a particular STACK or SET of facts/history or historical knowledge.
2) Once a perspective has been established for that incoming fact, what IMPACT does it or should it have against all the target data, or patterns that are already known? For instance, suppose an area code changes from 720 to 750 (Jeff's example) - what do you need to do to change ALL of the existing phone numbers? Inserting brand new rows isn't always the answer, it would cause too much data change, updating existing information also won't work - it too would take too long. REMEMBER: 10,000 transactions per second, means we have to process this information and execute on the history in millisecond response times.

Jeff and I began to discuss the notions of a LENSE, through which focus on a particular pattern could be achieved. What's important here is the FOCUS - but again, remember the focus is for _this current piece of information_ and is not necessarily related to other currently arriving information or facts.

Well what the heck does this have to do with B.I.?
You should already be able to see it... In a VLDW where we have huge stores of time based information it is near impossible (without focus) to find what you're looking for, so the first problem is (again) establishing focus - where oh where does my data FIT? So if you're processing in REAL-TIME folks, listen up... Once we establish which data sets are affected, we need to understand IN A FRACTION OF A SECOND how to change the "known outcome" on the existing history - oh yes, and by the way, this all has to happen in PARALLEL with all the other arriving facts, or it simply won't be executed in a timely fashion.

Now what else am I saying about ALL THIS DATA we've stored?
HERE IT IS:

* Large volumes of data must be processed and learnt from.
* The combined "learned" knowledge (we'll call it a derivation on average) of a STACK of related information within a topic area IS MORE IMPORTANT than the parts or the all the history and individual facts, but without all the details, we can't create a combined image.
* This combined knowledge element must be used IN CONTEXT or AS A CONTEXT LENSE to quickly establish the relevance of the incoming information, and how it will affect the "next" view or look at the information.

In other words:
* VLDB / VLDW data by itself is important when you're digging for detailed specifics that happened at a specific point in time, but the real value is having a "mined" collective perspective on all that detail that allows us to establish where and how our current "transaction" will affect the outcome.

A 24x7x365 neural network / data mining engine MUST be up and running consistently. it must first be trained, and then constantly adjusted for "drift" off topic, but the neural net should be receiving the transaction inflow for "context" application in order to establish our focus, or put a "lens" of information to our historical data set. This isn't your fathers neural net, and not your mother’s data mining engine - no... this is a different way of "scoring" parts of interesting history that are within the interested perception bounds (Jeff's term) so that processing of "extraneous noise" is filtered away as one of the first steps.

This data "mining" engine or neural net is highly focused, real-time processing based on transactions, and it houses "the many different lenses" of focus (or combined derivations) of different but interesting views of history, so that based on the incoming transaction - it can change the "lens" to match and see where the impact is.

From a B.I. perspective, I'm also saying that the sum of the whole may be more interesting and more valuable than the sum of the parts, but to get the sum of the whole, we have to have all the parts when we start. So the INTELLIGENT part of Business Intelligence is all about
1) Knowing which patterns are most interesting / most costly to the business - establishing the RIGHT LENSE at the right time, and having that lens available ahead of the arrival of the transactions
2) understanding that changing the color of the lens is easy when the transaction arrives, but that over time, the "lens" needs to be replaced (due to virtual scratching / shifting of the answer set), and needs to be re-aligned with all the large set of facts included in the history.
3) real-time transaction processing IS 100% necessary in a VLDW / data warehousing environment.
4) ALL the facts that we collect are important, depending on the "viewing perspective" of the business user.

New kinds of systems like this are in development labs, and I can help you with your efforts (should you so desire) to focus the lens. But it's advances in technology beyond what we have today that make this so interesting.

Food for thought anyhow, I'd love to hear what you have to say.

Cheers,
Dan L
DanL@DanLinstedt.com


Posted October 9, 2007 7:33 AM
Permalink | 2 Comments |

Just like there is a danger in over-denormalization (overrunning the block sizes, causing chained rows, and a multiplier to reading the data), there is a danger in over-normalizing... Or is there? Lately there has been renewed discussion about column-based-solutions coming in to play (but that's for another blog). In this blog entry I discuss the dangers of over-normalizing data on a traditional row based database system, especially as it relates to VLDW and MPP.

The math that works in FAVOR of Normalization, also works AGAINST normalization if we over-normalize, and re-introduce too many new joins. For instance going from 3rd normal form to 4th normal or even 5th normal form in our architectures can cause significant I/O traffic, even in a parallel environment.

What we want (like always with Performance and Tuning) is a balance.
That magic spot between total normalization, and total de-normalization, something that allows our rows (and most accessed data) to be grouped together, large block sizes to fit more rows in, and then a few (hopefully less than 16) joins, although with Teradata MPP you can easily achieve up to 32 joins without too much hassle, and in DB2 UDB DPF (MPP options) you can easily achieve close to 20 maybe 24 joins without too much hassle.

In Oracle 10g on a BIG-IRON SMP machine (32 CPU's, 48GB RAM) you can usually achieve between 5 and 15 joins, depending on the disk and I/O configuration, and sizes of tables.

Now keep in mind that these are "averaged" numbers, on an "average" VLDW system, and are executed when there is "average load" on those systems to begin with, unfortunately I cannot publicly disclose actual numbers, nor performance of these types of queries, except to say: when properly tuned for parallelism, they run FAST, and except to say that the average table size contains 150 million rows at about 1k each...

We are pulling punches here, because some of the tuning that has been done includes things like Join Indexes, or materialized views, or IOT (Index Only Tables) in the cases where row size is 12 to 20 bytes long... We've also used partitioning, database compression, and turned up the parallelism of the indexes available to the optimizers.

Keep in mind that these large and very large systems have to be tuned in one form or another in order to get these joins working well. But I wander.... Back to the point.

If we overly normalize (reduce down to 1 or 2 columns per table), and then we "double" the I/O's needed to get the data back, not to mention the work that has to be done in Memory... But we DON'T increase the hardware to handle it, then we may very well end up with "too much parallelism" that overwhelms the existing hardware, causing multiple threads to "wait" on the I/Os of one or more disk, or wait on the availability of computing resources.

So is it the over-normalization that's the problem? No, not necessarily, in this case its the inability to process everything in parallel all the time, this is why we are seeing a resurgance in column based appliances - they overcome some of these problems at the firm-ware level, and at the data sharing level, and at the processing level. In effect they are "joining" every single column to every other column, in parallel, to reconstruct the rows.

Would I recommend this in your standard RDBMS today?
Absolutely NOT. Over-normalization can detract from performance. What I urge you to do, is to seek out that sweet spot. Take a very long table, with an average row size of 3k to 4k, and normalize it, then time the performance of full table scans to a flat-file, including the join. Continue to normalize it until you've found the sweet spot.

The sweet spot will change based on the BLOCK SIZE selected for the database. As a rule of thumb, for "wider" rows in a normalized table, I shoot for a minimum of 50 to 100 rows (block size of 64k). In "smaller" rows in a normalized table, I shoot for a minimum of 400 to 500 rows per block (more if I can get it).

In databases that can only handle 8k block sizes, I focus on the parallelism of the query optimizer. Here's something I can share about SQLServer2005: Upgrade from 2000 (even in the 32 bit) to get the performance gains in the optimizer. But, Foreign keys actually perform faster than separate indexes on both tables... It seems the optimization has been greately improved, especially if the foreign key includes a CLUSTERED column (usually a surrogate sequence number).

Another note about SQLServer2005: if you have an Index (non-unique), and you have a clustered primary sequence number, then add the primary sequence number to the non-unique index (at the end), make the index unique, and then join.... It will also be faster in joins.

By the way, I normally don't use clustered data sets, except when it comes to surrogate ever increasing sequences.

Hope this helps,
Dan L
DanL@RapidACE.com


Posted October 2, 2007 11:49 AM
Permalink | No Comments |
PREV 1 2