Blog: Krish Krishnan Subscribe to this blog's RSS feed!

Krish Krishnan

"If we knew what it was we were doing, it would not be called research, would it?" - Albert Einstein.

Hello, and welcome to my blog.

I would like to use this blog to have constructive communication and exchanges of ideas in the business intelligence community on topics from data warehousing to SOA to governance, and all the topics in the umbrella of these subjects.

To maximize this blog's value, it must be an interactive venue. This means your input is vital to the blog's success. All that I ask from this audience is to treat everybody in this blog community and the blog itself with respect.

So let's start blogging and share our ideas, opinions, perspectives and keep the creative juices flowing!

About the author >

Krish is a recognized expert worldwide in the strategy, architecture and implementation of high performance data warehousing solutions. He is a visionary data warehouse thought leader and an independent analyst, writing and speaking at industry leading conferences, user groups and trade publications. He has authored two eBooks, more than 75 articles, viewpoints and case studies on business intelligence, data warehousing, and data warehouse appliances and architectures. In his 19 plus years of professional experience, he has been solving complex architecture problems spanning all aspects of data warehousing and business intelligence for Fortune 1000 clients. He has designed and tuned some of the world’s largest data warehouses.

The Vice President of Strategy at Chicago Business Intelligence Group, Krish teaches regularly at TDWI, DAMA, IRM UK and other conferences, and is helping drive and mature the data warehouse appliance market. Krish also serves as Associate Vice President of Programs for DAMA Chicago and is Ethics and Governance Advisor to DAMA International.

Editor's Note: More articles and resources are available in Krish's BeyeNETWORK Expert Channel. Be sure to visit today!

In the recently concluded Super Bowl 2012, we all know NY Giants won the championship, but in the preceding weeks there was an increasing sentiment expressed on Twitter about Eli Manning and at the end we all know the result.

If you have read James Surowiecki's book titled The Wisdom of Crowds, there is a famous example of the power of the crowd demonstrated by Sir Francis Galton. The story goes In 1906, he was visiting a livestock fair in England, where he stumbled upon an intriguing contest. An ox was put on display, and the villagers were invited to guess the animal's weight after it was slaughtered and dressed, paying 6 pence to participate. Nearly 800 people  participated, but not one person hit the exact mark: 1,198 pounds. Galton collected the answers and applied the statistical mean of these guesses from independent people in the crowd: Astonishingly the mean of those 800 guesses was 1,197 pounds, accurate to fraction of a percent. This marks the first of the series of experiments conducted by scientists to prove the collective intelligence of the crowd.

What this proves to us is when you apply a set of smart people to solve a problem, any problem, chances of a solution are very more possible than a single person trying to do the same. Today the same type of contests are held by companies such as Kaggle, 99Designs, Innocentive, CrowdAnalytix and many others, where statisticians and analytic experts compete to solve such problems.

What is the use of these contests and these business models? well there are several benefits

  • The problem can be solved better by a crowd where it can be solved faster
  • The open innovation platform provides you access to more experts than any consulting expertise can provide
  • Costs can be better managed in an open contest where the solution has a fixed price and timeline
And the list goes on. We will see how challenges arise in this subject in tomorrow's blog

The topic is deep and wide,  next week at TDWI Las Vegas, there is a night school session on this subject that I'm hosting, feel free to attend.

Posted February 7, 2012 6:15 PM
Permalink | No Comments |
By now all of you have learned about the announcement from Amazon about DynamoDB, the latest database with NoSQL+Cassandra+Voldemort+Riak and a lot of other tools thrown together, completely hosted on the cloud, with the feature to scale on demand, a true elastic scalability similar to EC2. throw on top of this a MapReduce interface and you have a Big Data Database that can truly scale.

What sets DynamoDB in my simple tests over the past few hours is the simplicity that it brings to Big Data processing. While my tests are not complete yet, initial results are definitely encouraging. As I write this blog, I have also read Datastax's comparison of Cassandra and DynamoDB at -  DataStax questions DynamoDB's performance. The comparison is long post full of technical comparisons around operations per second, but does not mention cost or services provision of DataStax. If you look at cost, Amazon says the services start at $1 per gigabyte per month. Data transfer is free for incoming data. It's also free for the first 10 terabytes per month and between AWS services (like Elastic MapReduce and S3). Once you surpass 10 terabytes, taking data out of the service is $0.12 per gigabyte through 40 terabytes and then lower rates up to 350 terabytes. Throughput capacity is $0.01 per hour for every 10 units of write capacity and $0.01 per hour for every 50 units of read capacity.

Based on where several internet-based, service companies have built models and found success, they will not have any hesitation in adopting to the DynamoDB platform. Especially with the ability to dial-up and dial-down scalability, you can really control costs, which even on a consistent basis will be much lesser compared to on-site provisioning for these companies. DynamoDB has beta clients like
Elsevier, Formspring and SmugMug, which are definitely encouraging names.

As an organization, If one were to choose a cloud based services provider for Big Data, Amazon sounds a logical choice based on several fronts, but is your big data initiative internet deploy-able? and do you have staffing to execute the program even if you host the data on the cloud?. While you digest more content apart from this blog on DynamoDB, I will revert to running more experiments and share more information in the next few days on scalability tests and consistency of the database.

There are several NoSQL databases to compare DynamoDB against too for a fair comparison at the DB level.

Watch for further information on specifics.

Posted January 19, 2012 9:38 PM
Permalink | No Comments |
Recently a tweet caught several people's attention - "Eventually, Hadoop will swallow the EDW". Let us be very clear, the EDW will be needed now and in the future. The premise of an EDW is for processing and storing data for consumption across the Enterprise for Analytical and Reporting purposes. Hadoop is a platform for managing the processing of Big Data, it is not a relational data store and nor is it engineered to replace the EDW. Several people have a similar misconception, but Hadoop and EDW are mutually exclusive platforms and they will be integrated via strong Metadata relationships.

It is true that Hadoop is getting several upgrades and new distributors, but this does not mean you can move all your EDW data into that platform. Structured data is best processed on RDBMS platforms.

You can argue that one needs a hammer to drive a nail into the wall, but what type of hammer, what type of nail and what type of wall, all of these matter.

There are several articles in the internet including presentations from Hadoop community on why EDW. I urge you to do some research and understand the same. Plan on attending TDWI Las Vegas or Chicago this year to learn more on this, or plan to attend Enterprise Data World 2012 in Atlanta. We have several discussions and sessions on this subject.

Bottomline, EDW is here to stay and is nor getting retired soon.

Posted January 11, 2012 9:59 PM
Permalink | 1 Comment |
At a recent event where I did a keynote, an audience question was on why Big Data means processing with Patterns?. Let us take a step back and analyze this thought, Patterns have always been the way we have learned. Whether it is languages by symbology or music patterns, the human mind can imbibe those patterns and reproduce them. This concept extended to computing too, where we reduce different types of data ino binary symbols that are interpreted by the system.

The patterns are what we formed into thoughts and behaviors that manifested into Big Data, and it is the very same patterns that need to be disambiguated with context. If you draw full circle, patterns play an important role in any aspect of data processing.

Pattern processing is intricate and definitely complex, but there are robust techniques to accomplish this subject. With the advent of Parallel processing techniques for large scale data, Pattern based processing has become more scalable and flexible.

While the subject is not new, thinking about processing complex data from this perspective will be one approach to tackle the problem of Big Data processing

Posted January 4, 2012 9:39 PM
Permalink | No Comments |
Taming the three big things in Unstructured Data (Big Data) include Volume, Velocity and Complexity. While we can see infrastructure growing to handle the volume and velocity equations, the third and the most toughest task involves taming complexity.

Complexity comes in a variety of shapes and sizes within the unstructured world. The reason for this arises from the fact that all things textual, audio, video and more, are based on Human Reasoning and Thinking. The fundamental concept behind human reasoning relates every piece to a context, for example - you go to nice restaurant and order food, more than the food, you relate the restaurant to an occasion, people who you were with, date on which you went there. Assume that you will write about the food experience, your document will contain just more than pure food. If we were to process this as data, without the relevant context it is pure noise with hidden layers of complexity due to the different patterns of thoughts that have gone into the document.

If we were to now take a look at everything we do, without context we are lost. Hence the need for a robust set of contextualized rules are needed to process data in the unstructured world. Textual ETL is one such rules engine that can solve the complexity equation. You can also do the same in Java and MapReduce, though it is very laborious.



Posted December 30, 2011 9:16 AM
Permalink | No Comments |
PREV 1 2