Blog: Krish KrishnanAugust 31, 2007Which Appliance fits your needs?Wow!, take a deep breath, the data warehouse marketplace is getting inundated with all sorts of cool gadgets also called as Appliances or Accelerators (let us keep it there for now). What do these tools do for you? How different are they and how to identify their succint differences?. If your primary goal is focussed on performance and scalability with cost coming as the last criteria(CFO's you pretend not to hear or read this line), then you can start buying all kinds of fancy appliances to suit all the different needs and keep adding complexity to the architecture of your solution. But rarely do you have a situation like this, on the contrary the focus is to reduce the TCO while improving the performance and scalability. Here is where you have to start choosing wisely on which group of appliances' do you need in the toolkit arsenal for the data warehouse and the business intelligence and analytical applications it serves. Let us split this further, separate the requirements of performance and scalability into data management problem and data presentation problem. by doing this you will isolate the real issues into manageable components and we can start looking at what the marketplace can offer to satisfy those specific problems or requirements. You have the data warehouse appliances like Netezza and Dataupia that will address the data management problem and can be a backplane to the data presentation problem. Then you have vendor offerings for speeding the business intelligence portions like Cognos Celequest, ParAccel which are accelerators for specific solution stacks. If you have invested in a tool like which is not supported by an accelerator, you could still add layers in the data management area to help speed up the presentation layer. Bottom line is there are distinct differences between the different kinds of "appliances". Before you rush into POC or other decisions, we will need evaluate your requirements throughly. Determine the weakest link in your current state architecture, then decide the next steps. In case you are wondering how to determine the first step of this game, a separate whitepaper on the topic titled 'how to determine whether your data warehouse requires an appliance' is in the works and will be available shortly. August 25, 2007The Need for SpeedA very simple word in the English language that keeps data warehouse architects, database administrators, data warehouse project managers up at night is Speed. Business users demand more speed in terms of performance and scalability (increased data volumes) constantly. A consequence to this is a direct increase in the cost of the data warehouse, whether you add services to augment staff or add hardware and storage. In the end of this cycle, there are users who are left with solutions that perform below their expectations and IT is left with a big gaping budget deficit. How do we win this constant battle, here is where the data warehouse appliance comes in handy. Although the technology and the architecture itself is not new, if you consider Teradata as an appliance, the rest of the vendor offerings are new and have been introduced in the last five to seven years. The hesitation amongst the IT user community to adopt to the appliance and include it in the technology stack is based on the fact that it is new, and that appliances are built on database platforms which are community developed. What is it that makes the Appliance work better than traditional solutions when we talk about scalability and performance? it is the architecture of the appliance that makes the difference. If the argument is that one can make any database platform scale and perform, the answer is yes, being a DBA myself, I can see how it can be done. But when one starts engaging in that exercise there is an associated cost that is not cheap and often the end solution does not satisfy the user needs to the fullest extent. By introducing the appliance into the data warehouse architecture, we are not claiming to reduce complexity in your architecture. But what we can achieve by augmenting the appliance to the data warehouse are tasks like table load balancing for data volumes, partition pruning etc can be simply achieved by utilizing native commands that are provided in the appliance which in a traditional solution would need extensive planning and coordination between database administrators, storage administrators and end users and will require downtime from an end user perspective (thought rdbms technologies can support online table and partition management). If the question is about the massively parallel processing (MPP) technology that provides the much needed performance boost in respect to maintenance and support, the MPP engine is developed and supported by each appliance vendor like any other technology. With respect to the appliance management itself, interfaces provided by the vendors to manage the appliance like how you manage databases. With some of these tangible benefits, a short term pain to adopt and augment a new technology will potentially free up the data warehouse IT team to focus on the needs of the core solution in the longer term. The goal to consider the appliance in the data warehouse architecture should be to provide end users with a performing and scalable solution, and controlling the cost of the data warehouse. The idea is to add the appliance and not replace your existing architecture, unless the existing architecture needs a complete replacement. BI tool vendors are moving to adopt the appliance as the platform to offer low cost packaged solutions. Other tool vendors in the data warehouse space will follow suit in the future. The appliance will reduce the workload on the data warehouse when incorporated into the solution. It is a lower cost to execute large queries on the appliance than the data warehouse itself, simply because you allow the data warehouse to be available for other users. If you start examining the remaining areas and it s benefits, this technology is here to stay. To summarize, when you quantify the success of your data warehouse project, the business goals that are your requirements for the data warehouse should be met with information delivery that adds increased value to enable the business to make informed decisions faster. The appliance is an enabler to that goal and definitely not a detriment. The budget aspect and other ideas will be blogged in the days to follow. August 22, 2007A Scalable ArchitectureWhy does scalability need to be considered when you are selecting an architecture for your data warehouse? Whenever you see performance problems in the data warehouse the knee jerk reaction is to start tuning the data warehouse rdbms platform, os platform, queries and end user applications. While this is a good stop gap, you will start running in circles on the query tuning and platform settings since tuning for one application or user type will affect someone else adversely. This is a day to day situation faced by all of us in a data warehouse. August 14, 2007Why you need Data profilingAn interesting problem that often surfaces in data warehousing and business intelligence activities is the content within the different attributes. Take a scenario of a simple data warehouse solution consisting of customer, product, time,location and transactions. This data model has to accommodate multiple locations and their transactions in a unified presentation to the end business user, as mandated by the business requirements. All of this is fine and dandy. The data model is approved by the business users in a data governance meeting and metadata definitions are agreed upon and the physical database has been created. Now you load the data warehouse, then you build your aggregates and summary data and declare that it is ready for QA and UAT. A harried report user calls out an error in the calculations for certain locations. This sets of a chain of investigations and after spending hours of time from various team members (not to forget the starbucks coffee and krispy kreme donuts) it is determined that the value of the data as reported by these locations for sales is at a different level than the rest of the locations. Your first instinct is to start looking at data mapping from source to target, look at end user training notes, data model reviews etc. Even after combing with a fine tooth comb you cannot determine how this occurred. Net-net is that all the data loaded for these locations have to be scrubbed and data has to be reloaded, this is not bad if you have the source data available else it is a far worse problem. How could you mitigate these issues? what processes need to be adopted to mitigate the risk, well a few simple steps can help mitigate the problem to a large extent 1. Confirm the business requirements gathered with sample data. Whatever maybe the steps executed, they should be done in a proactive fashion. This will alleviate the risks and minimize the need to revisit the issue at a later point where any mitigation strategy will be expensive. August 7, 2007Does data have a lifecycle? Part IIWhat impact does data lifecycle within a data warehouse have on your overall costs? Let us start examining this by looking at a scenario
ABC Corporation has deployed and is using an enterprise data warehouse for about 6 years. Over time there have been a number of changes to the source systems that feed the data warehouse and business rules around the data transformation to the data warehouse. Currently the data warehouse is experiencing severe service level agreement issues on data availability and warehouse availability. The data warehouse initial size was 500GB and has grown to over 5TB in 6 years. August 6, 2007Does data have a lifecycle? - Part 1Data warehouses have been evolving over time and data within the warehouse will reach a maturity point beyond which, it will be obsolete. Is the simplistic answer to just delete or archive this data? No, just deleting the data creates space but not solve the underlying problem, which is the value of the data (read attribute in the E-R world) itself.
Sit back and take a look at the data evolution in your data warehouse. When you start examining the data evolution you see that your data architecture has been evolving over time to accommodate source system changes and end user demands constantly, while portions of the data elements from legacy source systems do not even get to the data warehouse anymore. When you scream, my goodness why do we have all this extra data? You will realize that the data within the data warehouse does lose its value, thereby reaching an end of lifecycle. Now it makes sense all of a sudden that archiving the data is not the solution. How do you determine the lifecycle of data in the data warehouse? To answer this question, you will need to have information about the following - Once you have the relevant information gathered about the data , you will compile a findings and recommendations document, meet with the data governance committee and get the final approval on the obsolete data removal. Removing obsolete data along with its definition, benefits the data warehouse in the following areas Remember that this is not a simple task and requires elaborate planning and execution. You will be changing the data definitions for dimensions in the data warehouse, thereby requiring testing and user approval before this goes to production. Fortunately for us from a methodology perspective, Bill Inmon’s DW2.0 will serve as a blueprint in this exercise. For more information on DW2.0 please visit www.inmoncif.com or watch this space in upcoming months for articles around this topic. A second part of this blog will address some technology specific ideas. August 2, 2007Data Warehouse Solution Architecture - Why does it matter?In the recent years, there has been a surge in the volume of information that needs to be stored in any organization due to changes in regulatory policies primarily and due to mergers and acquisitions, global business expansions and other activities. In this background, organizations have started looking at the current investments into data warehousing and started leveraging the existing solution to accommodate more data. This is the correct approach, keeping in accordance with the definition of the data warehouse as the “single, integrated, non-volatile version of truth.” We all know that every organization engages teams to work on this task and a lot of planning, development, testing, etc. happens before this integration is done and after this integration has been completed. The question that begs to be answered is: Does the data warehouse serve all the facets of the organization with this data or is this data needed by everybody across the organization for any reason. If the answer is no, then why do we need to load this data into the data warehouse? Why cannot we just store it offsite and access as needed? A different way to approach this matter is to consider solution architecture for your data warehouse. By this we are not discussing the design methodology, database technology or BI tools. We are discussing the overall approach to solve your data warehouse requirements. Look for further reading on this in an upcoming article in my Business Intelligence Network Data Warehouse Appliance Expert Channel Krish |