We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Blog: Krish Krishnan Subscribe to this blog's RSS feed!

Krish Krishnan

"If we knew what it was we were doing, it would not be called research, would it?" - Albert Einstein.

Hello, and welcome to my blog.

I would like to use this blog to have constructive communication and exchanges of ideas in the business intelligence community on topics from data warehousing to SOA to governance, and all the topics in the umbrella of these subjects.

To maximize this blog's value, it must be an interactive venue. This means your input is vital to the blog's success. All that I ask from this audience is to treat everybody in this blog community and the blog itself with respect.

So let's start blogging and share our ideas, opinions, perspectives and keep the creative juices flowing!

About the author >

Krish Krishnan is a worldwide-recognized expert in the strategy, architecture, and implementation of high-performance data warehousing solutions and big data. He is a visionary data warehouse thought leader and is ranked as one of the top data warehouse consultants in the world. As an independent analyst, Krish regularly speaks at leading industry conferences and user groups. He has written prolifically in trade publications and eBooks, contributing over 150 articles, viewpoints, and case studies on big data, business intelligence, data warehousing, data warehouse appliances, and high-performance architectures. He co-authored Building the Unstructured Data Warehouse with Bill Inmon in 2011, and Morgan Kaufmann will publish his first independent writing project, Data Warehousing in the Age of Big Data, in August 2013.

With over 21 years of professional experience, Krish has solved complex solution architecture problems for global Fortune 1000 clients, and has designed and tuned some of the world’s largest data warehouses and business intelligence platforms. He is currently promoting the next generation of data warehousing, focusing on big data, semantic technologies, crowdsourcing, analytics, and platform engineering.

Krish is the president of Sixth Sense Advisors Inc., a Chicago-based company providing independent analyst, management consulting, strategy and innovation advisory and technology consulting services in big data, data warehousing, and business intelligence. He serves as a technology advisor to several companies, and is actively sought after by investors to assess startup companies in data management and associated emerging technology areas. He publishes with the BeyeNETWORK.com where he leads the Data Warehouse Appliances and Architecture Expert Channel.

Editor's Note: More articles and resources are available in Krish's BeyeNETWORK Expert Channel. Be sure to visit today!

August 2007 Archives

Wow!, take a deep breath, the data warehouse marketplace is getting inundated with all sorts of cool gadgets also called as Appliances or Accelerators (let us keep it there for now). What do these tools do for you? How different are they and how to identify their succint differences?.

If your primary goal is focussed on performance and scalability with cost coming as the last criteria(CFO's you pretend not to hear or read this line), then you can start buying all kinds of fancy appliances to suit all the different needs and keep adding complexity to the architecture of your solution. But rarely do you have a situation like this, on the contrary the focus is to reduce the TCO while improving the performance and scalability. Here is where you have to start choosing wisely on which group of appliances' do you need in the toolkit arsenal for the data warehouse and the business intelligence and analytical applications it serves.

Let us split this further, separate the requirements of performance and scalability into data management problem and data presentation problem. by doing this you will isolate the real issues into manageable components and we can start looking at what the marketplace can offer to satisfy those specific problems or requirements.

You have the data warehouse appliances like Netezza and Dataupia that will address the data management problem and can be a backplane to the data presentation problem. Then you have vendor offerings for speeding the business intelligence portions like Cognos Celequest, ParAccel which are accelerators for specific solution stacks. If you have invested in a tool like which is not supported by an accelerator, you could still add layers in the data management area to help speed up the presentation layer.

Bottom line is there are distinct differences between the different kinds of "appliances". Before you rush into POC or other decisions, we will need evaluate your requirements throughly. Determine the weakest link in your current state architecture, then decide the next steps.

In case you are wondering how to determine the first step of this game, a separate whitepaper on the topic titled 'how to determine whether your data warehouse requires an appliance' is in the works and will be available shortly.


Posted August 31, 2007 9:13 AM
Permalink | 1 Comment |

A very simple word in the English language that keeps data warehouse architects, database administrators, data warehouse project managers up at night is Speed. Business users demand more speed in terms of performance and scalability (increased data volumes) constantly. A consequence to this is a direct increase in the cost of the data warehouse, whether you add services to augment staff or add hardware and storage. In the end of this cycle, there are users who are left with solutions that perform below their expectations and IT is left with a big gaping budget deficit. How do we win this constant battle, here is where the data warehouse appliance comes in handy.

Although the technology and the architecture itself is not new, if you consider Teradata as an appliance, the rest of the vendor offerings are new and have been introduced in the last five to seven years. The hesitation amongst the IT user community to adopt to the appliance and include it in the technology stack is based on the fact that it is new, and that appliances are built on database platforms which are community developed.

What is it that makes the Appliance work better than traditional solutions when we talk about scalability and performance? it is the architecture of the appliance that makes the difference. If the argument is that one can make any database platform scale and perform, the answer is yes, being a DBA myself, I can see how it can be done. But when one starts engaging in that exercise there is an associated cost that is not cheap and often the end solution does not satisfy the user needs to the fullest extent.

By introducing the appliance into the data warehouse architecture, we are not claiming to reduce complexity in your architecture. But what we can achieve by augmenting the appliance to the data warehouse are tasks like table load balancing for data volumes, partition pruning etc can be simply achieved by utilizing native commands that are provided in the appliance which in a traditional solution would need extensive planning and coordination between database administrators, storage administrators and end users and will require downtime from an end user perspective (thought rdbms technologies can support online table and partition management). If the question is about the massively parallel processing (MPP) technology that provides the much needed performance boost in respect to maintenance and support, the MPP engine is developed and supported by each appliance vendor like any other technology. With respect to the appliance management itself, interfaces provided by the vendors to manage the appliance like how you manage databases. With some of these tangible benefits, a short term pain to adopt and augment a new technology will potentially free up the data warehouse IT team to focus on the needs of the core solution in the longer term.

The goal to consider the appliance in the data warehouse architecture should be to provide end users with a performing and scalable solution, and controlling the cost of the data warehouse. The idea is to add the appliance and not replace your existing architecture, unless the existing architecture needs a complete replacement. BI tool vendors are moving to adopt the appliance as the platform to offer low cost packaged solutions. Other tool vendors in the data warehouse space will follow suit in the future.

The appliance will reduce the workload on the data warehouse when incorporated into the solution. It is a lower cost to execute large queries on the appliance than the data warehouse itself, simply because you allow the data warehouse to be available for other users. If you start examining the remaining areas and it s benefits, this technology is here to stay.

To summarize, when you quantify the success of your data warehouse project, the business goals that are your requirements for the data warehouse should be met with information delivery that adds increased value to enable the business to make informed decisions faster. The appliance is an enabler to that goal and definitely not a detriment. The budget aspect and other ideas will be blogged in the days to follow.


Posted August 25, 2007 6:03 PM
Permalink | 1 Comment |

Why does scalability need to be considered when you are selecting an architecture for your data warehouse?

Whenever you see performance problems in the data warehouse the knee jerk reaction is to start tuning the data warehouse rdbms platform, os platform, queries and end user applications. While this is a good stop gap, you will start running in circles on the query tuning and platform settings since tuning for one application or user type will affect someone else adversely. This is a day to day situation faced by all of us in a data warehouse.

In a hindsight thought process all of us start examining what type of growth from a data and an user perspective has occurred in the data warehouse and what workload is the data warehouse executing on a daily basis.

This is where the scalability question arises. When you start the design process for a data warehouse, you need to examine the type of queries and application that will use the solution and what kind of mixed workload will you need to anticipate. The reason you need this exercise is when you can predict the overall volumetric growth in terms of data, you can also predict how much your infrastructure will be used in supporting the different types of workload, by running sample workload queries and simulating the users.

If you have not considered the scalability exercise while you designed the data warehouse due to whatever reason, best rdbms, vendor best practices etc, whenever you come across this problem, start doing the scalability exercise on your infrastructure. You will start understanding the limitations on the traditional data warehouse architecture due to infrastructure constraints at the end of this exercise, leaving you quite frustrated and your CFO fuming at the dollar spend and any proposed spend.

An alternative way to approach the infrastructure limitation is to explore the newer advances in technology one of which is the data warehouse appliance. There are multiple articles and white papers on the subject for academic reading. The data warehouse appliance is targeted to be augmented into the data warehouse architecture to address the scalability issue. It is built ground up with addressing the question of sustained performance at lower costs.

While I'm not saying that by implementing a data warehouse appliance, you have a silver bullet to answer your scalability needs, I'm assuring you that it is worth your while to start looking at this addition to your data warehouse architecture in the future to ensure that scalability needs are met.


Posted August 22, 2007 10:10 AM
Permalink | No Comments |

An interesting problem that often surfaces in data warehousing and business intelligence activities is the content within the different attributes.

Take a scenario of a simple data warehouse solution consisting of customer, product, time,location and transactions. This data model has to accommodate multiple locations and their transactions in a unified presentation to the end business user, as mandated by the business requirements. All of this is fine and dandy.

The data model is approved by the business users in a data governance meeting and metadata definitions are agreed upon and the physical database has been created. Now you load the data warehouse, then you build your aggregates and summary data and declare that it is ready for QA and UAT.

A harried report user calls out an error in the calculations for certain locations. This sets of a chain of investigations and after spending hours of time from various team members (not to forget the starbucks coffee and krispy kreme donuts) it is determined that the value of the data as reported by these locations for sales is at a different level than the rest of the locations.

Your first instinct is to start looking at data mapping from source to target, look at end user training notes, data model reviews etc. Even after combing with a fine tooth comb you cannot determine how this occurred. Net-net is that all the data loaded for these locations have to be scrubbed and data has to be reloaded, this is not bad if you have the source data available else it is a far worse problem.

How could you mitigate these issues? what processes need to be adopted to mitigate the risk, well a few simple steps can help mitigate the problem to a large extent

1. Confirm the business requirements gathered with sample data.
2. Setup and execute a data profiling activity, tools are available for under $1000 to use. Profile the data from each input to confirm that values within the data attributes are consistent for type, length and usage.
3. If your results from the steps above show problems with the data content in attributes, classify the attributes in question and present the problem to the business and data governance teams.
4. Do all these activities parallel to the data modeling processes, this way you can make changes to the data model if needed, before anything goes to a physical database.

Whatever maybe the steps executed, they should be done in a proactive fashion. This will alleviate the risks and minimize the need to revisit the issue at a later point where any mitigation strategy will be expensive.


Posted August 14, 2007 1:52 PM
Permalink | No Comments |

What impact does data lifecycle within a data warehouse have on your overall costs? Let us start examining this by looking at a scenario

ABC Corporation has deployed and is using an enterprise data warehouse for about 6 years. Over time there have been a number of changes to the source systems that feed the data warehouse and business rules around the data transformation to the data warehouse. Currently the data warehouse is experiencing severe service level agreement issues on data availability and warehouse availability. The data warehouse initial size was 500GB and has grown to over 5TB in 6 years.

At the most recent meeting the CFO of the corporation has asked the IT and IS departments to give him a TCO report on the data warehouse. Based on the total size of the data warehouse and its importance across the company, we can estimate the following (figures shown are sample dollar values)

Initial cost of hardware - $750,000
Initial cost of software (ETL, RDBMS, OS) - $300,000
Initial cost of deployment (services, installation etc) - $1,200,000
Initial cost of backup and recovery solution - $300,000

Ongoing annual maintenance of hardware and software - $100,000
Ongoing annual cost of deployment (upgrades, new programs) - $500,000
Ongoing annual backup costs - $400,000
Ongoing annual spend on storage - $600,000

Looking at the figures above, the TCO for the solution is over a million dollars. Some money spent on a solution which cannot sustain performance and meet deadlines on data availability.

Given this situation, when you start looking at reducing costs and increasing the data warehouse availability, you start assessing which portions of the data warehouse is becoming expensive to maintain and how to mitigate the performance issues.

On closer examination it is discovered that a significant portion of the data in the data warehouse is not used at all and can be completely removed including its definition and metadata. But there is an impact to the data movement and loading processes etc from this. Apart from all of these issues, you would need to conduct another assessment on the data usage to ensure that you do not cause downstream issues with the removal of the data. Given all of the complexities, it is clear that for an initial strategy archiving the data and removing metadata definitions for unused data will solve the immediate suffering. The overall TCO question still lingers on to be solved. This brings about an associated problem of how to provide user access to the legacy data and its metadata in an archived state, while keeping the overall TCO manageable.

Data lifecycle management within a data warehouse is a topic of interest that will be needing attention and focus. It becomes a significant exercise considering the impact it will bring to the overall TCO of the data warehouse. TCO management for the data warehouse will be another topic for discussion another day.


Posted August 7, 2007 12:32 PM
Permalink | No Comments |
PREV 1 2