Blog: Dan E. Linstedt« September 2007 | Main | November 2007 » October 30, 2007Thoughts about Dynamic Data Warehousing & ContextI've been discussing DDW for quite a while; I've started discussing the nature of dynamic structure change. There are larger considerations out there that we need to think about before embarking down these paths. However that said - there are some applications regarding architectural mining and dynamic structure changes which I wish to discuss here. For those of you in the intelligence sectors of government or research and defense this may be of interest (or not). For those of you in DW / BI traditional, the only benefits that dynamic structure change might bring to you is the ability to adapt faster (on the back end) to the dynamic changes of business. But then again, this technology is years away (for all we know ;) Dynamic re-structuring of structured data, why would we do it? What is the interest? What are the benefits? Well, if you're in the intelligence sector, or identity analytics, or defense research, then this may hold some serious value - and perhaps, you are already performing these tasks - after all, DARPA began funding Nanotech and DNA computing experiments over 10+ years ago (at least as far as we can tell publicly). Enough said... Anyhow, imagine a system beyond master data.. Where we have the structures that house specific "images" of data at a specific point in time, then we can stack those images and slice by time, or... slice by association. What do you mean, slice by association? Now you are beginning down the path of something called identity analytics. Surround those keys with the notions of context, of course, using the term losely - in other words it is one view of information at a particular point in time, context is how you "rotate" the information to meet the needs of the current end user. So you're saying "no relationships"? We apply color to "Hot" "cold" and luke-warm correlations, the user applies the human thought process of "interest". By focusing in on the interested points, and applying human logic we could theoretically surf billions of contextual relationships that would otherwise go un-noticed. Now, the human interaction establishes (interactively) the points of interest or the relationships that are associating the information to other points of information. Once this is done, a new set of data mining algorithms are run. These algorithms produce a specific answer, and test correlation of information to a more focused lens. This cycle can be run over and over again until the human decides that the relationship is of interest, and NOW can apply information relationships dynamically. Once this relationship "falls out of interest" it is removed, in favor of a new relationship. In essence, the model becomes a slowly evolving model with human intervention. It's possible that after certain relationships have been identified, that the data mining algorithms can be "tuned" to self-modify parts of those relationships. Well, all of this is just a thought experiment - the only part which may not necessarily be achievable today is the application of these changes to the queries, and load routines. Certainly without human interaction, zooming in to points of interest becomes a difficult task. Identity analytics plays a role like this, in identifying context from information - then relating different "identities" as associated elements. But that's for another day. I hope you found this entry interesting; I'd love to hear your thoughts. Thanks, October 16, 2007Indexing: VLDW and Data SetsWe are nearing the end of the entries I will be making (for now) on the VLDW world. I will discuss indexing going forward in a traditional RDBMS engine point of view. "Appliances" are changing some of this as they move into the field. But for now, indexing of large data sets requires some consideration. When people think of large data sets, they often forget to consider the indexing. They will state to me: I have to move 1.5 Billion rows in to the RDBMS within 1 hour and 20 minutes. But then they make statements like: I can't get the performance over 30,000 rows per second, or 8,000 rows per second, or otherwise. They don't understand why the table "causes deadlocks", or why they should be concerned about TEMP I/O on load. Well, when you load a table using a bulk loader, you can load it only two ways: FAST or SLOW, these are the only two modes available. You can request FAST load from a number of loaders, but they a) either will stop, or b) silently switch to slow mode if they run in to specific conditions. One of those conditions is indexing, another is clustering, another is constraints, and in some databases DEFAULT VALUES are considered constraints, also causing a slow down. Indexes play a huge role in the performance of the queries, and by all means are necessary to make systems perform. In some database engines, indexing has been tremendously improved (within the last year or so) over the previous release. Here's what matters: 1. Don't have indexes on the table / partition during load, disable them, remove them, rebuild them after the load is complete; and if you're not familiar with indexing partitions, read about it and learn about it. People sometimes forget - they partition the table, but not the index. IF YOU ARE SWITCHING FROM ONE DBMS VENDOR TO ANOTHER: Hope this helps, October 11, 2007System of Entry, System of Record - System of Shifting SandsI recently attended Teradata Partners conference, which was a lot of fun, one of the things they discussed was governance, data stewardship, data ownership - and of course: Claudia Imhoff in her masterful presentation of MDM talked diligently about SoR, SoE, and a few other acronyms. The gist of the statements (across the board) was that System Of Record lines are blurring. Shifting Sands I might say... In 2006, I blogged on my version of the SoR, and how I believe there are at least three different definitions for it. You can find the entry here. I recently received a good comment about SoE, and how these things need to be separated. The comment discussed the notion of incorporation of MDM. I'd like to keep this blog entry unusually short (for me) - because I believe a summary is in order. My current thoughts are shifting along with the sands of definition land... but here's my two cents on it: 1) We have systems of entry (SoE) as they are calling it I've assigned SoR to an integrated EDW space, single version of integrated facts, because it's the only place that 3 & 4 exist over a period of time, I've assigned SoR to an SoE - why? because frequently the operational systems do both, and are responsible for both, and once the data is fed from #5 BACK to source systems as "clean data" - that shifts it's definition to become an SoR as well. Now we have MDM - which really, only the Master Data itself can be considered an SoR for the company, but what does that mean? So I'll leave you with this tonight, these are questions that over the next couple months I'll blog on in depth... Love to hear your thoughts.... Cheers, October 9, 2007Context and PerspectiveI sat down with my good friend Jeff Jonas yesterday and discussed the nature and notion of contextual processing. Jeff is a phenomenal individual, and much smarter than I ever hope to be, but all that aside, we had a wonderful conversation about the nature of processing streaming data (one piece at a time, or possibly multiple pieces in parallel, but separated) and how to focus the notions of context. How is this related to B.I.? Processing the context on a streaming basis (as Jeff says) requires the ability to "change" all that we know (perception) at run-time based on new facts arriving on the stream. His statements went a little like this: 1) Imagine we think our friend XYZ is a good person. We just met this person 3 days ago, so we don't know much about them, but they've been nice to us - so our current perception of this individual is: K, U, I, O, T - and so on. We've hung out with them, so we have a whole host of experiences to draw from (mostly fun). At that instant, considering our relationship to our very good friend, all that we know about person XYZ (perceptively) changes; usually very quickly. Now, this isn't so bad if we are dealing with one piece of information, and a very small series of memories that we are focused on, but imagine now: trying to do this at 10,000 transactions per second in a non-sequential order of arrival of facts, and then trying to affect data sitting within 100 billion rows in our database... This brings me to my discussion. From here Jeff and I began discussing HOW this processing needed to take place, and it reminded me of some of the conversations I'm having here at Teradata Partners conference this week. The questions on the table are: Jeff and I began to discuss the notions of a LENSE, through which focus on a particular pattern could be achieved. What's important here is the FOCUS - but again, remember the focus is for _this current piece of information_ and is not necessarily related to other currently arriving information or facts. Well what the heck does this have to do with B.I.? Now what else am I saying about ALL THIS DATA we've stored? * Large volumes of data must be processed and learnt from. In other words: A 24x7x365 neural network / data mining engine MUST be up and running consistently. it must first be trained, and then constantly adjusted for "drift" off topic, but the neural net should be receiving the transaction inflow for "context" application in order to establish our focus, or put a "lens" of information to our historical data set. This isn't your fathers neural net, and not your mother’s data mining engine - no... this is a different way of "scoring" parts of interesting history that are within the interested perception bounds (Jeff's term) so that processing of "extraneous noise" is filtered away as one of the first steps. This data "mining" engine or neural net is highly focused, real-time processing based on transactions, and it houses "the many different lenses" of focus (or combined derivations) of different but interesting views of history, so that based on the incoming transaction - it can change the "lens" to match and see where the impact is. From a B.I. perspective, I'm also saying that the sum of the whole may be more interesting and more valuable than the sum of the parts, but to get the sum of the whole, we have to have all the parts when we start. So the INTELLIGENT part of Business Intelligence is all about New kinds of systems like this are in development labs, and I can help you with your efforts (should you so desire) to focus the lens. But it's advances in technology beyond what we have today that make this so interesting. Food for thought anyhow, I'd love to hear what you have to say. Cheers, October 2, 2007Over-Normalization: VLDW and performance of queriesJust like there is a danger in over-denormalization (overrunning the block sizes, causing chained rows, and a multiplier to reading the data), there is a danger in over-normalizing... Or is there? Lately there has been renewed discussion about column-based-solutions coming in to play (but that's for another blog). In this blog entry I discuss the dangers of over-normalizing data on a traditional row based database system, especially as it relates to VLDW and MPP. The math that works in FAVOR of Normalization, also works AGAINST normalization if we over-normalize, and re-introduce too many new joins. For instance going from 3rd normal form to 4th normal or even 5th normal form in our architectures can cause significant I/O traffic, even in a parallel environment. What we want (like always with Performance and Tuning) is a balance. In Oracle 10g on a BIG-IRON SMP machine (32 CPU's, 48GB RAM) you can usually achieve between 5 and 15 joins, depending on the disk and I/O configuration, and sizes of tables. Now keep in mind that these are "averaged" numbers, on an "average" VLDW system, and are executed when there is "average load" on those systems to begin with, unfortunately I cannot publicly disclose actual numbers, nor performance of these types of queries, except to say: when properly tuned for parallelism, they run FAST, and except to say that the average table size contains 150 million rows at about 1k each... We are pulling punches here, because some of the tuning that has been done includes things like Join Indexes, or materialized views, or IOT (Index Only Tables) in the cases where row size is 12 to 20 bytes long... We've also used partitioning, database compression, and turned up the parallelism of the indexes available to the optimizers. Keep in mind that these large and very large systems have to be tuned in one form or another in order to get these joins working well. But I wander.... Back to the point. If we overly normalize (reduce down to 1 or 2 columns per table), and then we "double" the I/O's needed to get the data back, not to mention the work that has to be done in Memory... But we DON'T increase the hardware to handle it, then we may very well end up with "too much parallelism" that overwhelms the existing hardware, causing multiple threads to "wait" on the I/Os of one or more disk, or wait on the availability of computing resources. So is it the over-normalization that's the problem? No, not necessarily, in this case its the inability to process everything in parallel all the time, this is why we are seeing a resurgance in column based appliances - they overcome some of these problems at the firm-ware level, and at the data sharing level, and at the processing level. In effect they are "joining" every single column to every other column, in parallel, to reconstruct the rows. Would I recommend this in your standard RDBMS today? The sweet spot will change based on the BLOCK SIZE selected for the database. As a rule of thumb, for "wider" rows in a normalized table, I shoot for a minimum of 50 to 100 rows (block size of 64k). In "smaller" rows in a normalized table, I shoot for a minimum of 400 to 500 rows per block (more if I can get it). In databases that can only handle 8k block sizes, I focus on the parallelism of the query optimizer. Here's something I can share about SQLServer2005: Upgrade from 2000 (even in the 32 bit) to get the performance gains in the optimizer. But, Foreign keys actually perform faster than separate indexes on both tables... It seems the optimization has been greately improved, especially if the foreign key includes a CLUSTERED column (usually a surrogate sequence number). Another note about SQLServer2005: if you have an Index (non-unique), and you have a clustered primary sequence number, then add the primary sequence number to the non-unique index (at the end), make the index unique, and then join.... It will also be faster in joins. By the way, I normally don't use clustered data sets, except when it comes to surrogate ever increasing sequences. Hope this helps, October 1, 2007Mistaken Nano-Identities (just for fun)Interesting how identities can get confused on the web. For instance, type in: "linstedt" + nano as a search term, and what do you get? Tons of hits for my cousin (Adam Linstedt, or A.D. Linstedt). He's a top research scientist in a major university, he's been a marine biologist, and and now a PHD micro-biologist for years. He's much smarter than I ever hope to be. Why the rub? It is my birthday, and I just thought I'd see how many references to my name there are on the web. To be honest, every year, about this time of year, I get inquiries to head-line at conferences having to do with Nano-tech, or Nano-biology. I was recently invited to become a member (of course, for a price) of an "exclusive biotech club"... (this was their words). The problem? I think they wanted to invite my cousin, not me. I think they also didn't do their search properly. I do have an interest in Nano-Tech, but am by no means an expert - it's just a research interest for me (which is why I haven't written anything for quite a while now on the topic). I have to learn more before I can publish again. But being my birthday, I just thought I'd share with you this interesting thought. Typing in my last name alone, provides hits to many different people (apparantly I'm not as alone as I thought I was).... There's Sharon Linstedt, a writer in Buffalo NY, to whom I think I may be related (not yet confirmed) through one of my great great relatives - who immigrated to Manitowoc Wisconsin in 1880 (or so)... Then there's a couple of Linstedt's in Germany, who I've not yet contacted to see if there's a relationship. My original family name was spelled: "Lindstedt" - somewhere along the line, the "d" in the middle was dropped. Anyhow, if you're really interested in nano-biology, you'll have to contact my extremely brilliant cousin, you can contact me too - but I'll only re-direct you. Thanks, |