Blog: Dan E. Linstedt« November 2005 | Main | January 2006 » December 23, 2005Who's on First - Abbot and Costello (parody)This funny parody was submitted by my good friend Kent Graziano, but we do not know whom the original author is. Abbot and Costello ringing your phone this holiday? Subject: Remember "Who's on first?" Enjoy! You have to be old enough to remember Abbott and Costello, and too old to REALLY understand computers to fully appreciate this, AND for those of us who sometimes get frustrated by our computers, please read on... If Bud Abbott and Lou Costello were alive today, their infamous sketch, "Who's on First?" might have turned out something like this:
COSTELLO CALLS TO BUY A COMPUTER FROM ABBOTT ABBOTT: Super Duper computer store. May I help you? ???????? (A few days later)
December 16, 2005VLDW: Clustering Versus MPPI have the wonderful opportunity to teach a VLDW course at TDWI, I also have the sheer joy of dealing with data that qualifies as VLDW feeds and some of the most massive systems in the country on a consulting basis. But there always seems to be the question: Which is better? Clustering machines together, or one "big honking box." (Honking = to push the horn button in the middle of your steering wheel). Well, as usual, I have a very opinionated stance on this - and have discussed this with Kent Graziano, Richard Winter, and a couple of my other friends. This entry is based on my experience, and what I've seen. I then speculate on what happens to each scenario as the system grows (again based on experience). If there is anyone out there with a different experience, I'd love to hear their thoughts. Here we go. So what does VLDW mean anyway? Let's also define what I mean by Clustering: Let's define what I mean by MPP: Ok, what have you seen in the market place? Customer 1: DB X - 45 TB, Clustered environment, having trouble with Network bottlenecks and I/O synchronization, has a daily call with the engineering staff of DB X vendor, with patches written just to keep their DBMS up and running. Ok, here's my two cents: take it for what it's worth, this is the thought experiment I set out on to find out WHY clustering began to exhibit problems at specific levels of volume when MPP showed no signs of slowing down. Here's what I found, and what I speculate: And so on, the problem compounds itself in such a way, that no amount of money in the world (to throw at the problem) can solve the amount of performance required to handle such large data sets. A basic tenant in life is to "Divide and Conquer" when we are faced with large problems. We need to learn to apply this to our data warehouses, especially in VLDW. The only way to divide and conquer is to use MPP - OR to really buy a "big-honking SMP machine". What am I saying? This is why if you look at a machine that runs as a HUGE SMP (32 to 64 CPU's and 64 GB RAM) you see a super power-horse, and also why these machines cost so much. The company that produces that machine has gone through the trouble of solving these problems (or eliminating them) through hardware BUS architecture. It's only on these machines that DB X has been scaled beyond the TB levels I've put here. Now, there's a couple other things I wish to note: there are SMP clusters that are rack-mount, where the interconnects are a back-bone and a direct connect across the machines. These way-lay the problems but only for a little while. The next thing I'd like to note are the SMP appliances, when plugged in - they act as MPP architecture, independent nodes, and are taking market share from the leading MPP RDBMS vendors. Dedicated Rack Mount SMP clusters that act as a "unit" within an MPP environment (handling only a portion of the overall data set) work REALLY REALLY WELL, and are extremely fast, plus they offer the benefit of fail-over and recovery at lower cost than a single LARGE SMP unit within an MPP environment. Mathematically there doesn't seem to be an upper limit to MPP data handling, mostly because adding another "node" to the MPP chain divides the work further - and doesn't necessarily "add" to the complexity because synchronization is not needed. I'll give you several cases where I've been that have MPP in their environments (these are all commercial environments, as the public sector environments are much larger, but cannot be discussed). Big problems require big solutions, I'd be happy to speak with you off-line about this information, as I teach VLDW, and performance and tuning, as well as systems architecture, design, and scalability for the future. Bottom line, Clustering (the way I've defined it here) is not suggested for your future if you expect large volumes. I welcome any thoughts, critical, or otherwise - I'd love to hear about successes in the clustered environment, maybe we can flush out what's acceptable to cluster. Thanks, December 13, 2005Competitive Decision Time Is ShrinkingWe've all heard it, it's there. Most of us know it - yet we refuse to accept it. There are strange happenings within the strategic use of information across the organization. The REAL question going forward will be: what will the value of STRATEGIC data sets be to the organization in the future? The whole question invites the opposite thought process: Do a bunch of fast TACTICAL decisions today, define the STRATEGIC decision of tomorrow? In this day and age executives and decision makers are finding less and less time to "decide" what to do strategically with the organization. Yet strategic decisions become more valuable when they are made in the RIGHT-TIME. Tactical decision making is on the rise, and in fact - is using "learned" information from a strategic base of data (patterns to create knowledge) to TEST their tactical decision. Confused yet? Sorry. Here's an example: Today, if you tell me you have a strategic plan for the next 10 years, I might begin to question how agile your company is to changing market conditions. I might tell you your company may not be around in 5 years (unless the plan changes as you go along). What does this mean? I'm not saying that all strategic plans are washed up, nor am I saying that strategic decision making is completely gone, nor will it ever completely go away. I am saying that the nature of strategic decision making is changing - to be more agile. The lines of what's tactical and what's strategic are changing and blurring together. I am saying that strategic and tactical decisions (if made incorrectly, or without enough learned background) cost more today than they did in the past. I am also suggesting that tactical decisions need to be made on the basis of data mining of all that history. The notions of time are speeding up (see Ray Kurzweil, and The Age of Spiritual Machines). What I would like to know is: in your organization, or companies you've worked in (without mentioning names), what have you seen in regards to their ability to think Tactically vs Strategically? Do they have long term plans that guide them and are unchanging? It no longer pays to be a dinosaur of giant proportions; it seems to pay better to be more like a body of water - fluid, dynamic, and possibly covering large areas of ground. Cheers, Were does EII need to go?In this entry I will explore some futuristic capabilities (a wish list) of features that I would like to see EII work towards. The real questions are beginning to surface about EII and ETL / ETLT and EAI, there are other questions about web-services, security, standardization, and the best practices needed for implementation of SOA around the enterprise. Let's take a look at the feature set that may be needed via an EII tool in the near future. What are some of the business problems that EII solves compared to ETL and EAI? Technical Problems that EII solves What we need is a single tool, a single interface to handle a much more broad set of requirements. EII has such a narrow scope right now (because most EII tools are just now coming into the second generation), that additional functionality is necessary to really take a chunk of the market space. For instance, a huge potential exists for a very strong single GUI in an EII tool to manage, maintain, and help define UDDI registries (in other words manage the web-services through metadata). Today, there appear to be partnerships between EII vendors and "Registry" vendors. This is good, but won't remain a differentiator for long. Wish list of features
Their ability to truly integrate the enterprise and ALL of it's data (not necessarily in volume, but remaining true to the notions of currency) will have a huge impact IF this information can also be managed. Reaching into new domains of information integration will help EII grow into a major player in the implementation space. SOA is growing, best practices are being developed, web-services and EII are major players in the success of SOA. Particularly when EII can provide the management of the Web-Services and it's metadata. It's a domain that is a natural fit for EII, the EII vendor of the future will "purchase" a registry solution as their own, and will begin to differentiate beyond other vendors in this area and in what they can do with the metadata. One of the largest keys to success will be: how does the EII tool tackle the problem of "bringing that management to the end-user?" In other words, can the tool provide enough of an end-user or business user interface to entice metadata management to take place as a natural function of business? The GUI interface and codeless solutions will become more and more important, tying the metadata to a master integrated meta-model (single view of the enterprise) will also become paramount to success. Finally, the EII tool that can communicate bi-directionally with a metadata solution will have tremendous success, as business users see added leverage for utilizing a single GUI interface to assist with true EII. Do you agree / disagree? I'd love to hear your thoughts on the matter. Thanks, December 8, 2005Virtual "Data Tables" for EIIThere is a new concept on the horizon of EII known as Virtual Tables. In other words, structures and temporary data stores that capture data from the sources, and refresh it on request. In this entry we will explore the nature of virtual tables, and temporary data storage - the pros and cons of this mechanism. I'm not sure it belongs in Dynamic Data Warehousing, but it's not your "ordinary" mechanism for data access, therefore it's dynamic in nature - ever changing without management. Without further adieu, let's take a look at this concept. There's a vendor: Ipedo, who has produced a thing called Virtual Tables for EII queries. What does this mean? What does this bring to the table? Who can use it? Should it be implemented across the board? We'll try to answer a few of these questions, all that I blog on here is based on my personal experience and speculation for what the future holds. There's nothing more that I like than to have those in the field offer their opinions in response to my blog entries, thank-you to all those who've commented in the past year, and I look forward to additional comments here on this. What's a Virtual Table anyway? What's the power in a virtual table? Don't get me wrong, a Virtual Table is NOT a replacement for the warehouse. Virtual Tables when used within the right context provide the EII tool with a powerful solution. What are the pros and cons of the Virtual Table within an EII solution? Cons: What are some of the challenges? Bottom line? Thoughts? Comments? What do you think about a virtual table? Cheers, December 2, 2005Data Mining and the Active Data WarehouseWhere is data mining these days? What power can it bring to the table? If I build an Active Data Warehouse will it come, is it necessary? There are many questions floating around these days, and I've written a little bit about this topic in the past. In this post I will attempt to discuss some of the newer thoughts about this subject, and push the envelope out a little further than maybe we're comfortable with. This entry is a thought experiment, but has implications in today’s computing arena. Data Mining has grown up over the recent years. It's been around for a long long time, but I guess I should say, it's become much more "usable" in the business users eyes, and it's beginning to appear as embedded technology. It's now plugged in to Teradata RDBMS, Oracle RDBMS, SQLServer 2005 Integration Services and RDBMS, DB2 UDB RDBMS, FirstLogic IQ Suite, SAS ETL and BI tools, and so on - there are too many to list here. The point is that it is beginning to be utilized to enhance the quality of information. "According to the example of Baosteel production, this paper introduces the way of using data mining technology -- SAS/EM to discover the rules that we don’t know before and it can improve the quality of products and decrease the cost." (1) Data mining is not just about data quality, it's also about business process quality, deeper understanding of our environment, and the quality of our products. In this companies' case they concluded that "...How to use data is an important thing that faces everyone. We should apply the data mining technology to more fields." (1) I would tend to agree. Additional fields (in my mind) include mining active data as it arrives - in context with the strategic data that it's already "learned" or established a knowledge pattern for. Other areas may include mining the architecture in which the data sits in (ie: the data model), mining the processes that link the data together - looking for flaws or better ways to deal with it, mining the metadata around the data set for additional context establishment and so on. "In this paper we introduce data quality mining (DQM) as a new and promising data mining approach from the academic and the business point of view. The goal of DQM is to employ data mining methods in order to detect, quantify, explain and correct data quality deficiencies in very large databases. Data quality is crucial for many applications of knowledge discovery in databases (KDD). So a typical application scenario for DQM is to support KDD projects, especially during the initial phases. Moreover, improving data quality is also a burning issue in many areas outside KDD. That is, DQM opens new and promising application fields for data mining methods outside the field of pure data analysis. To give a first impression of a concrete DQM approach, we describe how to employ association rules for the purpose of DQM." (2) Active Data Warehousing is about integrating the ODS and Data Warehouse into a single instance, single data store. It's about capturing data as it happens (at the right time), in to the warehouse as a statement of fact, and then using that data or leveraging the data to make both strategic and tactical decisions in time with the enterprise. Active Data Warehousing also brings in massive sets of information to deal with, thus making the task dually difficult. Of course - with an Active Warehouse we also need to utilize real-time arriving data. One notion I've believed in is something I call Active Mining. Active Mining is the ability to start a neural net, pre-load it with the historical data, and then as data arrives (when it arrives), add it to the neural net already in play. In other words - no waiting, no "re-running" of the mining algorithms to get the result. Of course in the beginning (or depending on how much history is mined when started), the neural net may need to be shut-down and restarted - but as time goes on, less and less correction is necessary. I believe that active mining will take the fore-front and will be embedded in every process through the streams of data that we deal with on a daily basis. However, that's not to say that there's no value in storing existing level of details as a statement-of-fact in the warehouse, there certainly is value to that. But moving forward, dynamically understanding how well the new data fits - may become a critical factor of business operations. Speaking of business operations, there is a company called Purple Insight which has (in my opinion) begun to master the ability to tie data mining, and results to visualization. Check them out here. Using Active Mining to feed a live visualization of the data may also begin to play a powerful role in the future "use" of our information sets. References: DNA Computing & Tic-Tac-ToeI came across this entry this morning, where DNA computing in enzymes has been activated to play tic-tac-toe. Apparently (the article says) that the system cannot be beat. The article also goes on to discuss how the enzymes affect the DNA strands around it, cutting, splicing, and attaching depending on their choice. In this blog posting I will explore what some of the "possible applications" of this technology might be, a simple thought experiment if you will. The article can be found here. I've spent a lot of time writing about the nature of convergence, and the fact that I believe "wet-technology" or the mix between natural world models and our electronic models is coming together. Nothing is more evident here. In this particular case we have electronic gates / switches that we normally use to play tic-tac-toe, only they are placed into DNA enzymes. This raises some very interesting questions: 1. How parallel is this DNA computer? We're always molding our natural world into models that we see to fit our needs, for instance - moving the tubes into a sequence to represent tic-tac-toe. What if instead, we utilized a single strand of folded DNA in three dimensions to represent a tic-tac-toe board? Could the single solution with a single DNA strand play the game on a much smaller level? This is the type of question that would lead deeper into the Nanohouse abilities. The ability to control a single DNA strand, and utilize a model that already exists to achieve our goals. We would have a much larger scale repeatable model if we could do this. The thought experiment: Now suppose we released 120 people, told them to go "take" six cars from each train - the only requirement is that they need to all choose a different 6 car set. This might represent the chemical release to a DNA strand, and each of the "people" or incoming chemical mix matches with a specific DNA place in the chain. By repeating this process, and having the "computer" or the "game" choose other car sets, you've effectively re-created a logic gate computing device at the DNA strand level. The other thing we've done here is suggest that the computations occur in parallel, and that data sets can be different for each "action" - told to attach itself to DNA at different parts of the strand. We've effectively re-created the possibility to play an very large number of finite "games", all in parallel. Very quickly the "winning pattern" will emerge, these may become the rules that are applied to the next engine going forward - in other words, spot the "learning pattern". Of course, change the game - and we have to start all over again. The learned rules for Tic-Tac-Toe don't necessarily work for checkers or chess. Some of the other questions still rolling around in my head are: It seems to me that electron spin computing has a ways to go, and isn't making advances as fast as DNA computing, but that remains to be seen. It also appears to be a more difficult challenge, as DNA molecules are much larger than electron based control at the atomic level. However - I must ask the question, if we can search 10^8 Terabytes of DNA solution in 3 seconds, how fast (if it ever can be done) will electron spin computing device search 10^8 terabytes? I must also ask, is it really worth the cost or difficulty of overcoming its (electron spin) obstacles to make it happen? I'd love to hear your thoughts and ideas. |