Blog: Barry Devlin http://www.b-eye-network.com/blogs/devlin/ As one of the founders of data warehousing back in the mid-1980s, a question I increasingly ask myself over 25 years later is: Are our prior architectural and design decisions still relevant in the light of today's business needs and technological advances? I'll pose this and related questions in this blog as I see industry announcements and changes in way businesses make decisions. I'd love to hear your answers and, indeed, questions in the same vein. Copyright 2013 Mon, 13 May 2013 08:41:48 -0700 http://www.movabletype.org/?v=4.261 http://blogs.law.harvard.edu/tech/rss The Road to Operational Analytics  
First, let's define the term. My definition, from two recent white papers (April 2012 and May 2013) is: "Operational analytics is the process of developing optimal or realistic recommendations for real-time, operational decisions based on insights derived through the application of statistical models and analysis against existing and/or simulated future data, and applying these recommendations in real-time interactions." While the language is clearly analytical in tone, the bottom line of the desired business impact is much the same as definitions we've seen in the pact for the ODS and operational BI: real-time or near real-time decisions embedded into the operational processes of the business. 

Anybody who has head me speak in the 1990s or early 2000s will know that I was not a big fan of the ODS. So, what has changed? In short, two things: (1) businesses are more advanced in their BI programs and (2) technology has advanced to the stage where it can support the need for real-time operational-informational integration. 

BI Evolution.jpg
The evolution of BI can be traced on two fronts shown in the accompanying figure: the behaviors driving business users and the responses required of IT providers. As this evolution proceeds apace, business demands increasing flexibility in what can be done with the data and increasing timeliness in its provision. In Phase I, largely fixed reports are generated perhaps on a weekly schedule from data that IT deem appropriate and furnish in advance. Such reporting is entirely backward looking, describing selected aspects of business performance. Today, few businesses remain in this phase because of its now limited return on investment; most have already moved to Phase II. 

This second phase is characterized by an increasing awareness of the breadth of information available collectively across the wider business and an emerging ability to use information to predict future outcomes. In this phase, IT is highly focused on integrating data from the multiple sources of operational data throughout the company. This is the traditional BI environment, supported by a data warehouse infrastructure. The majority of businesses today are at Phase II in their journey and leaders are beginning to make the transition to Phase III. 

Phase III marks a major step change in decision making support for most organizations. On the business side, the need moves from largely ad hoc, reactive and management driven to a process view, allowing the outcome of predictive analysis to be applied directly, and often in real time, to the business operations. This is the essence of the behavior called operational analytics. In this stage, IT must become highly adaptive in order to anticipate emerging business needs for information. Such a change requires a shift in thinking from separate operational and informational systems to a combined operational-informational environment. This is where the action is today. This is where return on investment for leading businesses is now to be found. And, simply put, this is why operational analytics is making headlines today--many businesses are ready for it; the leaders are already implementing it. 

This leads us to the second contention: that technology has advanced sufficiently to support the need. There are many ways that recent advances in technology can be combined to do this. In the white papers referenced above, one shows how two complementary technologies, IBM DB2 for z/OS and Netezza, can be integrated to meet the requirements. The other shows how the introduction of columnar technology and other performance improvements in DB2 Advanced Enterprise Edition can meet these same needs. Other vendors are improving their offerings in similar directions. 

So, to paraphrase the "Six Million Dollar Man": we have the business waiting. We have the technology. We have the capability to build this... But, wait. There is one more hurdle. Most existing IT architectures strictly separate operational and informational systems based on a data warehouse approach dating back to the mid-1980s. This split is a serious impediment to building this new environment that demands a tight feedback loop between the two environments. Analyses in the informational environment must be transferred instantly into the operational environment to take immediate effect. Outcomes of actions in the operational systems must be copied directly to the informational systems to tune the models there. These requirements are difficult to satisfy in the current architecture; they demand a new approach. This is beginning to emerge, but is by no means widespread yet. I'll be discussing this topic further over the coming weeks.
]]>
http://www.b-eye-network.com/blogs/devlin/archives/2013/05/the_road_to_ope.php http://www.b-eye-network.com/blogs/devlin/archives/2013/05/the_road_to_ope.php Analytics Mon, 13 May 2013 08:41:48 -0700
Big Data, All Data, PureData, BLU Data big data is simply all data".  IBM's April 3 announcement served admirably to reinforce that point of view. Was it a big data announcement, a DB2 announcement, or a hardware announcement?  The short answer is "yes", to all the above and more.

Weaving together a number of threads, Big Blue created a credible storyline that can be summarized in three key thoughts: larger, faster and simpler.  As many of you may know, I worked for IBM until early 2008, so my views on this announcement are informed by my knowledge of how the company works or, perhaps, used to work.  Last Wednesday, I came away impressed.  Here were a number of diverse, individual product developments that conform to a single theme across different lines and businesses.

Take BLU acceleration as a case in point.  The headline, of course, is that DB2 LUW (on Linux, Unix and Windows) 10.5 introduces a hybrid architecture.  Data can be stored in columnar tables with extensive compression, making use of in-memory storage and taking further advantage of parallel and vector processing techniques available on modern processors.  The result is an up to 25% improvement in analytic and reporting performance (and considerably more in specific queries) and up to 90% data compression.  In addition, the elimination of indexes and aggregates simplifies considerably the need for manual tuning and maintenance of the database.  This is a direction that has long been shown by small, newer vendors such as ParAccel and Vertica (now part of HP), so it is hardly a surprise.  IBM can claim a technically superior implementation, but more impressive is the successful retrofitting into the existing product base.  And the re-use of the technology in the separate Informix TimeSeries code base to enhance analytics and reporting there too, as well as the promise that it will be extended to other data workloads in the future.  It seems the product development organization is really pulling together across different product lines.  That's no mean feat within IBM.

Another hint at the strength of the development team was the quiet announcement of a technology preview of JSON support in DB2 at the same time as the availability of 10.5.  JSON is one of the darlings of the NoSQL movement that provides significant agility to support unpredictable and changing data needs.  See my May 2012 white paper "Business Intelligence--NoSQL... No Problem" for more details.  As in its support for other NoSQL technologies, such as XML and RDF graph databases, IBM has chosen to incorporate support for JSON into DB2.  There are pros and cons to this approach.  Performance and scalability may not match a pure JSON database, but the ability to take advantage of the ACID and RAS characteristics of an existing, full-feature database like DB2 makes it a good choice where business continuity is a strong requirement.  IBM clearly recognizes that the world of data is no longer all SQL, but that for certain types of non-relational data, the difference is sufficiently small that they can be handled as an adjunct to the relational model through a "subservient" engine, allowing easier joining of NoSQL and SQL data types.  This is a vital consideration for machine-generated data, one of three information domains I've defined in a recent white paper, "The Big Data Zoo--Taming the Beasts".

The announcement didn't ignore the little yellow elephant, either.  The PureData System family has been expanded with the PureData System for Hadoop, with built-in analytics acceleration and archiving, and provides significantly simpler and faster deployment of projects requiring the MapReduce environment.  And InfoSphere BigInsights 2.1 offers the Big SQL interface to Hadoop, an alternative file system, GPFS-FPO, with enhanced security and no single point of failure, as well as high availability.

While the announcement clearly targeted Big Data--at the Speed of Business, the underlying message, as seen above, is much broader.  This view is of an emerging information ecosystem that must be considered from a fully holistic viewpoint.  A key role, and perhaps even the primary role, for BigInsights / Hadoop is in exploratory analytics, where innovative, what-if thinking is given free rein.  But the useful insights gained here must eventually be transferred to production (and back) in a reliable, secure, managed environment--typically a relational database.  This environment must also operate at speed, with large data volumes and with ease of management and use.  These are characteristics that are clearly emphasized in this announcement.  They are also key components of the integrated information platform I described in the Data Zoo white paper already mentioned.  Missing still are some of the integration-oriented aspects such as the comprehensive, cross-platform metadata management, data integration and virtualization required to tie it all together.  IBM has more to do to deliver on the full breadth of this vision, but this announcement is a big step in the right direction.

]]>
http://www.b-eye-network.com/blogs/devlin/archives/2013/04/big_data_all_da.php http://www.b-eye-network.com/blogs/devlin/archives/2013/04/big_data_all_da.php Big data Mon, 08 Apr 2013 09:14:22 -0700
Big Data - Please, Drive a Stake through its Heart! stake-in-the-heart.jpgWikibon's lovingly detailed Big Data Vendor Revenue and Market Forecast 2012-2017 provides an excellent list and positioning of players in the "Big Data" market.  Readers may be surprised to see that IBM tops the list as the biggest vendor in the market in 2012 with nearly 12% market share ($1,352 million), more than twice that of the second-placed HP.  Indeed, the names of the top ten in the list--IBM, HP, Teradata, Dell, Oracle, SAP, EMC, Cisco, Microsoft and Accenture--may also raise an eyebrow, given that all of them come from the "old school" of computer companies.  The top contender among the "new school Big Data" vendors is Splunk with revenue of $186 million.

Wikibon openly describes their methodology for calculating these figures, and one could describe it as more art than science, given the reluctance of vendors to share such data.  Furthermore, the authors have also revised their original 2011 market size estimate up from $5.1 to $7.2 billion.  So, one might dispute the figures and placements at length, but it's probably fair to say that this report is among the more useful publicly available data on this market.

Of more concern to me is the big, hairy, ugly question that has bothered me since "Big Data" attained celebrity status: what on earth is it?  Furthermore, how can one evaluate the overall figures with Wikibon's  two-part definition: (1) "those data sets whose size, type and speed of creation make them impractical to process and analyze with traditional database technologies and related tools in a cost- or time-effective way" and (2) "requires practitioners to embrace an exploratory and experimental mindset regarding data and analytics... Projects whose processes are informed by this mindset meet Wikibon's definition of Big Data even in cases where some of the tools and technology involved may not".  Part 1 is the fairly widespread definition of "Big Data", and one that is, in my view, so vague as to be meaningless.  Part 2 is certainly creative but poses some interesting questions about how one might reliably access practitioners' mindsets and assess them as exploratory and experimental!  The bottom line of this definition is that if somebody says a dataset or project in "Big Data" then it is so.  I've long ago come to the conclusion that, unless somebody can come up with a watertight definition, we should stop talking about and fooling ourselves that we can measure "Big Data".  I've said this before, but the term won't go away.  Hence, the reference to killing vampires in the title...

As an alternative, I'd like to point again to a white paper I wrote last year, The Big Data Zoo - Taming the Beasts, where I categorized information/data into three domains: (1) process-mediated data, (2) human-sourced information and (3) machine-generated data, as shown in the accompanying figure.  I suggest that this is a much more clearly defined way of breaking down the universe of information/data and of differentiating between data uses and projects that are part of what you might call classical data processing and those that have emerged or are emerging in the fields that first sprouted the term "Big Data".  These information domains are largely self-describing, relatively well-bounded and group together data that has similar characteristics in terms of structure and volatility.  Size actually has very little to do with it.

Three information domains.jpgReturning to Wikibon's results and their companion piece, Big Data Database Revenue and Market Forecast 2012-2017, in database software, IBM again tops the list with $215 million in SQL-based revenue and is followed by 5 other SQL-based database vendors (SAP, HP, Teradata, EMC and Microsoft) until we reach MarkLogic as the top NoSQL (XML, in fact, so hardly part of the post-2008 NoSQL wave except by self-declaration) vendor with revenue of $43 million in 2012.  Wikibon's "bottom line: the top five vendors have about 2/3rds of the database revenue, all from SQL-only product lines. Wikibon believes that NoSQL vendors will challenge these vendors hard of the next five years. However SQL will continue to retain over half of revenues for the foreseeable future."  I personally don't know on what Wikibon based the growth projections, so I cannot comment, but I do have questions about the 2012 figures themselves, both including and beyond the definition of "Big Data".  Hadoop is not mentioned, and although I agree with its exclusion as a database, many vendors are incorporating it into their database environments by a variety of means.  Is this included or excluded and why?  HP and EMC grab third and fifth positions, based on their Vertica and Greenplum acquisitions respectively.  Judging by the fact they overshadow significant database players like Microsoft and Oracle, it would seem that most or all of their database revenue is classified as "big data".  Is this reasonable?  How did the survey apportion IBM's, Teradata's and Microsoft's database revenue between "big data" and the rest?  Is all of SAP HANA revenue called "big data" simply because it's an in-memory appliance... or how was it split?  And the list goes on...

I'm sure IBM is very happy to be placed top in both listings; I assume the Netezza figures loom large in the database placement.  SAP will be pleased to take second place, based largely on HANA.  HP Vertica can claim top pure play "big data" database.  And Teradata can take pride in its placement, earned I expect, in large part through its Aster acquisition.  And so on...  But the more interesting point is that these are all SQL databases.  The highest-placed NoSQL (in the post-2008 wave sense) is 10gen, with attributed revenue of less than 10% of that attributed to IBM in the "big data" category.  All this will drive marketing machines, but with more heat than light.  Given the underlying dysfunction in the definitions, how will it help businesses who are trying to figure out what to do about the "big data" truck allegedly bearing down on them?

My suggestions are straightforward.  In terms of the three data domains outlined above, be aware that process-mediated data - the well-defined, -structured and -managed data residing in current operational and informational systems - is growing fast and can drive significant new value through operational analytic approaches.  Human-sourced information - currently mostly about social media - and machine-generated data are emerging and rapidly growing sources of knowledge about people's behaviors and intentions.  They enable new, extensive predictive analytics (the successor to data mining) that initially demands flexibility in exploration, such as that offered by Hadoop.  However, they will demand proper integration in the formal data management environment in the medium to long term.  This requires a well-defined and thoroughly thought-out infrastructure and platform strategy that embraces all types of data and processes.  Of all the vendors mentioned above, only IBM and Teradata are attempting to take such a holistic view, in my opinion.  

As for NoSQL databases (however you define them - aren't IMS and IDMS also NoSQL by definition?).  I believe the post-2008 NoSQL databases have important roles in the emerging environment.  They certainly drive substantial and long-absent innovation in the relational database market.  In particular, they offer a level of flexibility in database design that is key in emerging markets and applications.  And they have technical characteristics that are very useful in a variety of niches in the market, solving problems with which relational databases struggle.  
Interesting data times.  But, let's just quietly drop the "big"... ]]>
http://www.b-eye-network.com/blogs/devlin/archives/2013/03/big_data_-_plea_1.php http://www.b-eye-network.com/blogs/devlin/archives/2013/03/big_data_-_plea_1.php Big data Mon, 04 Mar 2013 08:21:14 -0700
Big Analytics rather than Big Data Baby elephant.JPGThe past year has been dominated by Big Data.  What it might mean and the way you might look at it.  The stories have often revolved around Hadoop and his herd of oddly-named chums.  Vendors and analysts alike have run away and joined this ever-growing and rapidly moving circus.  And yet, as we saw in our own EMA and 9sight Big Data Survey, businesses are on a somewhat different tour.  Of course, they are walking with the elephants, but many so-called Big Data projects have more to do with more traditional data types, i.e. relationally structured, but bigger or requiring faster access.  And in these instances, the need is for Big Analytics, rather than Big Data.  The value comes from what you do with it, not how big it happens to be.

Which brings us to Big Blue.  I've been reading IBM's PureSystems announcement today.  The press release headline trumpets Big Data (as well as Cloud), but the focus from a data aspect is on the deep analysis of highly structured, relational information with a substantial upgrade of the PureData for Analytics System, based on Netezza technology, first announced less than four months ago.  The emphasis on analytics, relational data and the evolving technology is worth exploring.

Back in September 2010, when IBM announced the acquisition of Netezza, there was much speculation about how the Netezza products would be positioned within IBM's data management and data warehousing portfolios that included DB2 (in a number of varieties), TM1 and Informix.  Would the Netezza technology be merged into DB2?  Would it continue as an independent product?  Would it, perhaps, die?  I opined that Netezza, with its hardware-based acceleration, was a good match for IBM who understood the benefits of microcode and dedicated hardware components for specific tasks, such as the field programmable gate array (FPGA), used to minimize the bottleneck between disk and memory.  It seems I was right in that; not only has Netezza survived as an independent platform, as the basis for the PureData System for Analytics, but also being integrated behind DB2 for z/OS in the IBM DB2 Analytics Accelerator.

Today's announcement of the PureData System for Analytics N2001 is, at heart, a performance and efficiency upgrade to the original N1001 product, offering a 3x performance improvement and 50% greater capacity for the same power consumption.  The improvements come from a move to smaller, higher capacity and faster disk drives and faster FPGAs.  With a fully loaded system capable of handling a petabyte or more of user data (depending on compression ratio achieved), we are clearly talking big data.  The technology is purely relational.  And a customer example from the State University of New York, Buffalo quotes a reduction in run time for complex analytics on medical records from 27 hours to 12 minutes (the prior platform is not named).  So, this system, like competing Analytic Appliances from other vendors, is fast.  Perhaps we should be using images of cheetahs?

[The photo is from my visit to Addo Game Reserve in South Africa last week.  For concerned animal lovers, she did eventually manage to climb out...] ]]>
http://www.b-eye-network.com/blogs/devlin/archives/2013/02/big_analytics_r_1.php http://www.b-eye-network.com/blogs/devlin/archives/2013/02/big_analytics_r_1.php Big data Tue, 05 Feb 2013 10:30:36 -0700
The Use and Abuse of Big Data MagnifyingGlassDataSpider.jpgAs we begin a new year, we are promised a move from a focus on the meaning and technology of big data to the useful and worthwhile business applications it may offer.  A timely move indeed.  Hopefully, we'll begin to hear less about analyzing Twitter streams to optimize advertising spend and more about applications with the potential to improve people's lives or the environment.  And even more hopefully, people may begin to consider the risks they run when revealing or gathering personal data on our deeply interconnected Web.

With all of the synchronicity that is the Internet, I came across two articles from the New York Times published in last week. The first, by Peter Jaret on January 14, describes how patient records, transcribed and digitized from scrawled (why do they write so poorly?) doctors' notes, anonymized and stored on the Web, can be statistically mined to discover previously unknown side-effects of and interactions between prescribed drugs.  Clearly useful and valuable work.  The second article, three days later by Gina Kolata, revealed how easily a genetics researcher was able to identify five individuals and their extended families by combining publicly-available information from the anonymized 1000 Genome Project database, a commercial genealogy Web site and Google.  Kolata quotes Amy L. McGuire, a lawyer and ethicist at Baylor College of Medicine in Houston:  "To have the illusion you can fully protect privacy or make data anonymous is no longer a sustainable position".  The underlying genetic data is used in medical research to good effect, of course, but what are the possible consequences for those individuals thus identified as insurance companies, governments or other interested parties make potentially negative assessments based on their once private genomes?

Such occurrences--and there are many of them--should be deeply disturbing to those of us involved in the business of big data and analytics.  Here are doctors, scientists and lawyers--with training in logic, ethics and law--who see the power of analytics to improve the human condition, but who seem to gloss over the wider privacy and security implications of making personal information widely available on the Web.  After all, the limits of data anonymization on the web were being discussed openly as long ago as May 2011 by Pete Warden on the O'Reilly Radar blog.  And as far back as 1997, Prof. Latanya Sweeney, now Director of the Data Privacy Lab at Harvard, could show that the combination of gender, ZIP code and birthdate was unique for 87% of the U.S. population.  

Eben Moglen, professor of law and legal history at Columbia University and Chairman of the Software Freedom Law Center, warned at re:publica Berlin in May 2012 that "media that spies on and data-mines the public is destroying freedom of thought and only this generation, the last to grow up remembering the 'old way', is positioned to save this, humanity's most precious freedom".  With media and medicine, government and retail, telecommunications and finance all gathering hoards of information about us, each for their own allegedly good purpose, the reality is now that the abuse of big data (as opposed to its use) is not only possible, but proceeding apace, even in largely democratic,Western states.

So, given that big data anonymity is "no longer a sustainable position", it should be clear that the analytics possible on today's high-powered computers is a double-edged sword; it serves us poorly to focus only on one, single, razor-sharp edge.  As we evaluate and build useful and worthwhile business analytics applications of this coming year, let us step back even occasionally to contemplate whether the profits to be earned or the discoveries to be made are worth the price of human freedom. ]]>
http://www.b-eye-network.com/blogs/devlin/archives/2013/01/the_use_and_abu_1.php http://www.b-eye-network.com/blogs/devlin/archives/2013/01/the_use_and_abu_1.php Big data Wed, 23 Jan 2013 09:06:30 -0700
Big Data and the End of Civilization as We Know It Melting-ice-polar-bear Big Data.jpgAs you may be aware, the world (or civilization, at least) is due to end in a couple of weeks, as the Mayan calendar counts to the last day of this "Sun".  For those of you living beneath a stone, the date / time is 21st December at sunset in the Yucatan... depending on whom you choose to believe.

Big data, conversely, has been heralded by some as the harbinger of a bright, shiny and new world where all things will be will be possible using the vast quantities of data that are becoming available on the Internet.  Many contend that the transformation has already begun.  We will discuss the more mundane truth of how business is using big data in a joint EMA / 9sight webinar "Big Data Comes of Age" on Thursday, 13th December, 11am PST / 2pm EST / 7pm GMT.

The truth is somewhere in between... as always.  And as year-end approaches, it might be a good time to ponder just where big data is leading us as people and as a society.

There's little doubt that big data--in all its meanings and incarnations--is effecting major changes in advertising and marketing.  Much of what we see in this area is about increasing the efficiency of targeting and conversion.  As Google tracks our searches, Facebook and Twitter our shared opinions, mobile Apps our movements and sellers our purchases, the message we hear is that businesses want to understand us and our needs more clearly, serve us better and ensure that we are increasingly delighted.  However, the reality in the vast majority of cases is that businesses are driven simply by the financial bottom line, on a quarterly or even monthly basis as BI reports are produced and earnings statements released.  Unfortunately, in my opinion, big data is most widely used as the next spin of the "sell more at higher profit" story, or to put it bluntly, driving consumption.

And yet, the other areas of application of big data offer insights about some of the biggest challenges to humanity, such as climate change, energy efficiency, health quality, economic and financial management, and more.  The increasing quantities of data being gathered or available for collection and analysis in all of these areas offer us the opportunity to make a real difference in the lives of humanity, and to avert the catastrophes of which most scientists and philosophers already warn.  That even one of the most data-driven of companies, PricewaterhouseCoopers, warns of impending global catastrophe due to the increased rate and scale of warming--as much as 6 �C--is surely a sign that the writing is on the wall.  To quote their report: "The only way to avoid the pessimistic scenarios will be radical transformations in the ways the global economy currently functions: rapid uptake of renewable energy, sharp falls in fossil fuel use or massive deployment of carbon capture and storage, removal of industrial emissions and halting deforestation... business-as-usual is not an option."

The PwC report does not, unfortunately, make the explicit link between ever increasing consumption of energy and raw materials on which the global economy currently functions and the seeming impossibility of reducing carbon emissions at the required rate to avert the worst possible scenarios.  But big data analysis across both sides of this simple equation could show how to tackle the problem. Big data is about bringing data from widely disparate areas together and discovering new possibilities.  How to consume less but improve living standards.  How to prevent the type of financial behavior that paralyzes international economies.  Of course, all this assumes the business and political will to do so.

Those of us who understand big data technology and promote its use must surely begin to advocate the more responsible and sustainable uses of this powerful technology.  A New Year resolution, perhaps...?

]]>
http://www.b-eye-network.com/blogs/devlin/archives/2012/12/big_data_and_th_1.php http://www.b-eye-network.com/blogs/devlin/archives/2012/12/big_data_and_th_1.php Big data Tue, 11 Dec 2012 05:12:15 -0700
Business Analytics - Close to your Heart IDAA_heart1.jpgWell, perhaps not close to your heart, but certainly close to the heartbeat of your business.  This is a key message of an IBM Virtual Event debuting at 10:30 a.m. EST in the U.S. and 10:30 a.m. GMT / 11:30 a.m. CET in Europe on November 28, where I'll talk about modern mission-critical Business Analytics.

For many businesses, embedding operational analytics in the heart of their OLTP (online transaction processing) applications is a key initiative for 2013. The leaders, of course, have already begun.  The old operational data store (ODS) and operational BI were precursors as far back as the mid-90s, attempting to make faster decisions about operational matters.  These initiatives have had their success stories, but they have been limited by a number of factors, both analytical and operational.  The analytical issue has often been the lack of sufficient quantities of transaction and event data to effective mine.  The operational aspect was the ability to get close enough to the near real-time responses required by business users and customers.  

Both of these issues are being addressed with today's technologies.  The enormous growth of business on the Web in the past decade has meant that customer behavior can be analyzed through clickstreams within websites and linkages across different websites, call centers and more. Such information, analyzed in combination with transaction data, allows retailers to more effectively cross-sell, hotels to increase room occupancy and telcos to reduce churn.  But, for this blog, and the above event, the more interesting point relates to how to close the real-time gap.

Traditionally, business intelligence operates on data that has been extracted from the operational environment and analytic outcomes applied back to that environment afterwards. In short, the data is brought to the analytics.  This approach introduces significant delays.  An obvious solution would be to bring the analytics to the data; however, prior technology did not easily allow that.  I discuss this in terms of the mainframe, System z, environment, but the principle applies elsewhere too.

It is an oft-forgotten fact that 70% of all data transactions in the banking, insurance, retail, telecommunications, utilities and government industries still occur on the System z platform, due to its performance, cost, reliability and security characteristics.  The inclusion of the Netezza-powered IBM DB2 Analytic Appliance within the System z complex creates a system with a dual personality -transactional performance of the original environment combined with the analytic performance of Netezza required for integrated operational analytics.  With the inclusion of SPSS Predictive Analytics on Linux and Cognos on the zOS and Linux platforms, the need to move data out of the System z environment is largely eliminated.  More details are to be had in the Virtual Event where IBM's Dan Wardman and David Jeffries will fill in the technical details. See also my White Paper, "Integrating Analytics into the Operational Fabric of Your Business, A combined platform for optimizing analytics and operations".

Irrespective of platform, it is becoming increasingly clear that when it comes to operational decisions, they have to come from the heart rather than the head!


]]>
http://www.b-eye-network.com/blogs/devlin/archives/2012/11/business_analyt_1.php http://www.b-eye-network.com/blogs/devlin/archives/2012/11/business_analyt_1.php Analytics Tue, 27 Nov 2012 00:54:13 -0700
Big Data is Dead - Long Live All Data Big Data RIP tombstone.jpg2012 begins to wind down.  Yes, I know it's still only mid-November, but I find it hard to avoid thinking of year-end when the retail industry has been pushing Christmas for weeks already.  I've been preparing for my keynote at Big Data Deutschland in Frankfurt (20-21 Nov) next week, so it seemed appropriate to share some thinking on where big data is at now.  Also, I've been deeply involved in analyzing the results of the EMA / 9sight big data survey which has just been published.  My bottom line?  Big data is dead!

Of course, I don't mean that literally.  What I'm really trying to do is to get the attention of the marketing folks who have been using and abusing the term, particularly during 2012.  Two very clear results emerge from the big data survey when it comes to real customer projects carrying the moniker big data.  

First, the industry has been besotted by size.  Carefully avoiding now all vaguely salacious phrases, the fact is that size is so relative that calling data big or small is more about bragging or shaming than any measure of real use.  Our survey showed that 60% of respondents were managing less than 100TB of data in total in their organizations, while only 5% stretched beyond a petabyte.   Not all of this data was part of their big data projects; on average, only some 30% was included there.  This strongly suggests that so called big data technology is being widely used for something other than processing excessively large data volumes.

Second, it's not all about exotic types of data either.  Yes, some 45% of the data sources fall under the category of human-sourced information, which includes social media sources.  But, just over 30% is process-mediated data -- transactional data gathered and created in traditional operational and informational applications.  For a more detailed explanation of these data domains, as I call them, please see my recent White Paper "The Big Data Zoo - Taming the Beasts, The need for an integrated platform for enterprise information".  So, big data projects are addressing a substantial proportion of the data we've known and loved for many years.

You can hear more of the survey results on the EMA / 9sight webinar on Thursday, 13 December, 11 a.m. PST / 2 p.m. EST.

What is actually becoming important as we look towards 2013 is what businesses are really doing with data at the moment that is different from what they've traditionally done.  I believe there are two distinct trends.  One is, of course, business analytics.  This is simply an evolution of traditional BI, with more of an emphasis on exploration (or mining) and less on reporting and dashboards.  The second is more interesting and, potentially, game changing.  This involves the re-integration of operational action taking and informational decision making in customer-facing applications that automatically modify their behavior in real-time in response to rapidly changing market or personal circumstances.

All this says to me that big data as a technological category is becoming an increasingly meaningless name.  Big data is essentially all data.  Is there any chance that the marketing folks can hear me?

]]>
http://www.b-eye-network.com/blogs/devlin/archives/2012/11/big_data_is_dea_1.php http://www.b-eye-network.com/blogs/devlin/archives/2012/11/big_data_is_dea_1.php Big data Tue, 13 Nov 2012 11:11:43 -0700
NoSQL, NewSQL... NonplussedSQL The Briefing Room this week with NuoDB's CEO, Barry Morris.  The product itself is extremely interesting, both in its concept and technology; and will formally launch in the next month or so after a long series of betas.  More about that a little later...

But first, I need to vent!  For some time now, I've been taking an interest in NoSQL because of its positioning in the Big Data space.  I've always had a real problem with the term - whether it means Not SQL or Not Only SQL - because defining anything by what it's not is logical nonsense.  Even the definitive nosql-database.org falls into the trap, listing 122+ examples from "Possibly the oldest NoSQL DB: Adabas" to the wonderfully-named StupidDB.  A simple glance at the list of categories used to classify the entries shows the issue: NoSQL is catch-all for a potentially endless list of products and tools.  Just because they don't use SQL as an access language is insufficient as a definition.

NewSQL Ecosystem.PNGBut, my irritation now extends to "NewSQL", a term I went a-Googling when I saw that NuoDB is sometimes put in this category.  This picture from Matthew Aslett of 451 Research's presentation was interesting if somewhat disappointing: another gathering of tools with a mixed and overlapping set of characteristics, most of which relate to their storage and underlying processing approaches, rather than anything new about SQL, which is, of course,  at heart a programming language.  So why invent the term NewSQL when the aim is to keep the same syntax?  The term totally misses the real innovation that's going on.

This innovation at a physical storage level has been happening for a number of years now.  Columnar storage on disk, from companies such as Vertica and ParAccel, was the first innovative concept to challenge traditional RDMS approaches in the mid-2000s.  Not forgetting Sybase IQ from the mid-1990s, which was, of course, column-oriented, but didn't catch the market as the analytic database vendors did later.  With cheaper memory and 64-bit addressing, the move is underway towards using main memory as the physical storage medium and disk as a fallback.  SAP HANA champions this approach at the high end, while various BI tools, such as QlikView and MicroStrategy hold the lower end.  And don't forget that the world's most unloved (by IT, at least) BI tool, Excel, has always been in-memory!

The other aspect of innovation relates to parallel processing.  Massively parallel processing (MPP) relational databases have been around for many years in the scientific arena and in commercial data warehousing from Teradata (1980s) and IBM DB2 Parallel Edition (1990s).  These powerful, if proprietary, platforms are usually forgotten (or ignored) when NoSQL vendors lament the inability of traditional RDBMSs to scale-out to multiple processors, blithely citing comparisons of their products to MySQL, probably more popular for its price than its technical prowess. Relational databases do indeed run across multiple processors, and must evolve to do so more easily and efficiently as increases in processing power are now coming mainly from increasing the number of cores in processors.  Which finally brings me back to NuoDB.

NuoDB takes a highly innovative, object-oriented, transaction/messaging-system approach to the underlying database processing, eliminating the concept of a single control process responsible for all aspects of database integrity and organization.  Invented by Jim Starkey, an éminence grise of the database industry, the approach is described as elastically scalable - cashing in on the cloud and big data.  It also touts emergent behavior, a concept central to the theory of complex systems.  Together with an in-memory model for data storage, NuoDB appears very well positioned to take advantage of the two key technological advances of recent years mentioned already:- extensive memory and multi-core processors.  And all of this behind a traditional SQL interface to maximize use of existing, widespread skills in the database industry.  What more could you ask?

However, it seems there's an added twist.  Apparently, SQL is just a personality the database presents; and is the focus of the initial release.  Morris also claims that NuoDB is able to behave as a document, object or graph database, personalities slated for later releases in 2013 and beyond.  Whether this emerges remains to be seen.  Interestingly, however, when saving to disk, NuoDB stores data in key-value format.

I'll be big data, NoSQL and NewSQL in speaking engagements in Europe in November: the IRM DW&BI Conference in London (5-7 Nov) and Big Data Deutschland in Frankfurt (20-21 Nov).  I look forward to meeting you there!

]]>
http://www.b-eye-network.com/blogs/devlin/archives/2012/10/nosql_newsql_no.php http://www.b-eye-network.com/blogs/devlin/archives/2012/10/nosql_newsql_no.php Big data Thu, 25 Oct 2012 09:45:49 -0700
The Emerging Big Data Ecosystem Integrated information Platform.pngSlowly but surely, big data is becoming mainstream.  Of course, if you listened only to the hype from analysts and vendors, you might think this was already the case.  I suspect it's more like teenage sex, more talked about than actually happening.  But, seems like we're about to move into roaring twenties.

I had the pleasure to be invited as the external expert speaker at IBM's PureData launch in Boston this week.  In a theatrical, dry-ice moment, IBM rolled out one of their new PureData machines between the previously available PureFlex and PureApplication models.  However, for me, the launch carried a much more complex and, indeed, subtle message than "here's our new, bright and shiny hardware".  Rather, it played on a set of messages that is gradually moving big data from a specialized and largely standalone concept to an all-embracing, new ecosystem that includes all data and the multifarious ways business needs to use it.

Despite long-running laments to the contrary, IT has had it easy when it comes to data management and governance.  Before you flame me, please read at least the rest of this paragraph.  Since the earliest days of general-purpose business computing in the 1960s, we've worked with a highly modeled and carefully designed representation of reality.  Basically, we've taken the messy, incoherent record of what really happens in the real word and hammered it into relational (and previously popular hierarchical or network) databases.  To do so, we've worked with highly simplified models of the world.  These simplifications range from grossly wrong (all addresses must include a 5-digit zip-code--yes, there are still a few websites that enforce that rule) to obviously naive (multiple purchases by a customer correlate to high loyalty) as well as highly useful to managing and running a business (there exists a single version of the truth for all data).  The value of useful simplifications can be seen in the creation of elegant architectures that enable business and IT to converse constructively about how to built systems the business can use.  They also reduce the complexity of the data systems; one size fits all.  The danger lies in the longer-term rigidity such simplifications can cause.

The data warehouse architecture of the 1980s, to which I was a major contributor, of course, was based largely on the above single-version-of-the-truth simplification.  There's little doubt it has served us well.  But, big data and other trends are forcing us to look again at the underlying assumptions.  And find them lacking. IBM (and it's not alone in this) has recognized that there exists different business use patterns of data which lead to different technology sweet spots.  The fundamental precept is not new, of course.  The division of computing into operational, informational and collaborative is closely related.  The new news is that the usage patterns are non-exclusive and overlapping; and they need to co-exist in any business of reasonable size and complexity.  I can identify four major business patterns: (1) mainstream daily processing, (2) core business monitoring and reporting, (3) real-time operational excellence and (4) data-informed planning and prediction.  And there are surely more.  This week, IBM announced three differently configured models: (1) PureData System for Transactions, (2) for Analytics and (3) Operational Analytics, each based on existing business use patterns and implementation expertise.  Details can be found here.  I imagine we will see further models in the future.

All of this leads to a new architectural picture of the world of data--an integrated information platform, where we deliberately move form a layered paradigm to one of interconnected pillars of information, linked via integration, metadata and virtualization.  A more complete explanation can be found in my white paper, "The Big Data Zoo--Taming the Beasts:  The need for an integrated platform for enterprise information".  As always, feedback is very welcome--questions, compliments and criticisms. ]]>
http://www.b-eye-network.com/blogs/devlin/archives/2012/10/the_emerging_bi.php http://www.b-eye-network.com/blogs/devlin/archives/2012/10/the_emerging_bi.php Big data Fri, 12 Oct 2012 08:10:04 -0700
Big (Data) Wheel Keep on Turning data 1s and 0s.jpgThe alacrity with which analysts, vendors, customers and even the popular press have jumped on the big data bandwagon over the past year or two has been little short of amazing.  Perhaps it was just boredom with ten years of the "relational is the answer; now, what's the question" refrain?  Or maybe the bottom line was an explosion of new business possibilities that emerged in different areas that all had one basic thing in common: a base of new data... as opposed to a new database?

I've commented on a number of occasions that the software technology on which big data is based is rather primitive.  After all, Hadoop and its associated zoo are little more than a framework and a set of software utilities to simplify writing and managing parallel-processing batch applications.  Compare this to the long-standing prevalence of real-time transaction processing in the database world, relational or otherwise.  NoSQL databases perhaps offer more novelty of thinking, especially where there has been innovation around the concept of key-value stores.  At some fundamental level, big data has been less about "volume, velocity and variety"--marketing terms in many ways--and more about simple economics.  The economics of cheap, commodity storage and processors combined with open sourcing of software development.

But, the big bandwagon has been rolling and many of us, myself included, have perhaps been too focused on the size and speed of the wagon and paid too little attention to the oxen pulling it.  Oxen?  Actually, I'm referring to the major web denizens, such as Google, Facebook and their ilk.  What alerted me was a recent Wired magazine article, "Google Spans Entire Planet With GPS-Powered Database" and a trail of links therein, particularly "Google's Dremel Makes Big Data Look Small".  Both articles, published in the two months, make fascinating reading, but the bottom line is that Google and, to some lesser extent, Facebook are upgrading their big data environments to be faster and more responsive.  Unsurprisingly, Google is moving from a batch-oriented paradigm to, wait for it, a database system that preserves update consistency.  Google has been on this journey for three years now and has been published research papers as far back as 2010.  Get ready for a new set of buzzwords: Dremel, Caffeine, Pregel and Spanner from Google and Prism from Facebook.

So what does this mean for the rest of us?  In the widespread adoption of the current version of big data technology, the driver has not been so much big data as the commoditization of processing power and computation that has emerged.  Database vendors have reacted by embracing Hadoop as a complementary data source or store to their engines.  The open sourcing of Dremel, if it happens, would signal, I believe, a much more significant change in the database market.  Readers familiar with "The Innovator's Dilemma" by Clayton Christensen, first published in 1997, will probably recognize that what would ensue as disruptive innovation, described as "innovation that creates a new market by applying a different set of values, which ultimately (and unexpectedly) overtakes an existing market".  To possibly overstretch the bandwagon analogy, it seems that the bandleader has switched horses; the parade is changing its route.

These developments add a whole new set of future considerations for vendors and implementers of big data solutions, and I'll be exploring them further in speaking engagements in Europe in November: the IRM DW&BI Conference in London (5-7 Nov) and Big Data Deutschland in Frankfurt (20-21 Nov).  I hope to meet at least a few of you there!

"Big wheel keep on turning / Proud Mary keep on burning / And we're rolling, rolling / Rolling on the river" Creedence Clearwater Revival, 1969 ]]>
http://www.b-eye-network.com/blogs/devlin/archives/2012/10/big_data_wheel_1.php http://www.b-eye-network.com/blogs/devlin/archives/2012/10/big_data_wheel_1.php Tue, 02 Oct 2012 03:54:48 -0700
Big Data, Big Money and Big Opportunities piles-of-money.jpgIt often worries me that much of the excitement about big data and business analytics relates to marketing initiatives.  In the greater scheme of life, I personally feel that money spent trying to convince me to drink cola brand A rather than B could be put to better use.  Promoting the health benefits of pure water, maybe.  Tackling real world problems like contaminated water sources, even better.  

I suspect that may be a rather unpopular view in some circles, so in case you're tempted to stop reading now, I'd like to mention upfront a big data survey that is currently open for your input.  Shawn Rogers and John Myers of EMA and I have constructed a short survey to discover what companies are doing with big data and what challenges they are encountering.  We'd be delighted to hear from you.

But, on to big data and big money... and, in particular, off-shore investment money.  Over the weekend, articles in both the Guardian newspaper in the UK and the BBC reported that a tiny global elite of extraordinarily rich people had some $21 trillion in off-shore tax havens as of the end of 2010, an amount equivalent to the US and Japanese economies combined.  The work that estimated the above figure was commissioned by the Tax Justice Network and carried out by former McKinsey & Co. Chief Economist James Henry.  A press release covering the highlights of the report "The Price of Offshore Revisited" notes that Henry "drew on data from the World Bank, the IMF, the United Nations, central banks, the Bank for International Settlements, and national treasuries, and triangulates his results against data reflecting demand for reserve currency and gold, and data on offshore private banking studies by consulting firms and others".  The six-page press release reveals some truly staggering figures and is well worth a read.

You may contest the figures and the conclusions, and many will.  But, as the report says--and this is where we get back on topic with big data--"This scandal is made worse by the fact that [official institutions like the Bank for International Settlements, the IMF, the World Bank, the OECD, and the G20] already have much of the data needed to estimate this sector more carefully".  There is very little of the world's money that is not represented by and moved about as 1s and 0s in financial computing systems.  There is little doubt that this is, indeed, big data and amenable to the collection and processing we talk about and carry out... when we need marketing information.  We can now reliably detect petty fraud on the world's voluminous credit card transactions in flight;  so I'm convinced that detecting, storing and analyzing the transactions that moved this wealth off-shore is technically-speaking, a piece of cake.  Perhaps the question is: do we have the will to do so?

I'll leave you with a more positive spin from the report: "From another angle, this study is really good news. The world has just located a huge pile of financial wealth that might be called upon to contribute to the solution of our most pressing global problems. We have an opportunity to think not only about how to prevent some of the abuses that have led to it, but also to think about how best to make use of the untaxed earnings that it generates."

In the meantime, read some of the above coverage (I haven't found a link to the full report) and please take the big data survey, too.

]]>
http://www.b-eye-network.com/blogs/devlin/archives/2012/07/big_data_big_mo_1.php http://www.b-eye-network.com/blogs/devlin/archives/2012/07/big_data_big_mo_1.php Tue, 24 Jul 2012 07:43:21 -0700
Virtualization, federation or just plain access Virtualization-Federation.pngThere are still many illusions and unjustified expectations about big data.  But, one old belief--dating back to the early days of data warehousing--that it has shattered is in a single store that can serve all BI needs.  Given the volumes and variety of big data, any thought of routing it all through a relational database environment just doesn't make sense.  And after the market's brief flirtation with the idea that all data could be handled in Hadoop (doh!), there is a general belief that IT needs to provide some sort of over-arching, integrating view for users across multiple data stores.

Cirro is among the latest players in this field, as I discovered talking to CEO Mark Theissen, previously data warehousing technical lead at Microsoft and a veteran of DATAllegro and Brio.  Its basic value proposition is to offer users self-driven exploration--via Cirro's Excel plug-in and BI tools--of data across a wide variety of platforms via ad hoc federation.  Cirro's starting point is big data scale and performance, offering a data hub with a cost-based federation optimizer, smart caching and a function library of low level MapReduce and SQL functions.  It also offers an optional "multi store" consisting of Hadoop and MySQL components that can be used as a temporary scratchpad area or a data mart.

In our conversation, Theissen declared that Cirro does federation, whereas competitors like Composite and Denodo do virtualization.  The difference, in his view, is that virtualization involves an expensive and time-consuming phase to create a semantic layer, while federation is done on the fly and, in the case of Cirro, using existing metadata from BI tools, databases and so on.  I wish it were that simple to differentiate between these two phrases, which have become a marketing battleground for many of the vendors competing in this field from the majors like IBM and Informatica to the newcomers such as Karmasphere and ClearStory.

I'd like to try to clarify the two terms... again.

The concept of federation (in data) goes back to the mid-1980s with the concept of federating SQL queries against the then-emerging relational databases.  By 1991, IBM's Information Warehouse Framework included access to heterogeneous databases via EDA/SQL from Information Builders.  By the early years of the new millennium, the need to join data from multiple, heterogeneous sources beyond traditional databases was widespread, often described as enterprise information integration (EII).  But, vendor offerings were poorly received, especially in BI, because of concerns about mismatched data meanings, security and query performance.  I consider federation as the basic technology of being able to split up a query in real time into component parts, distribute it to heterogeneous, autonomous sources and retrieve and combine the results.  To do this, access to technical metadata that defines database (or file) locations and structures, data volumes, network performance and more is needed to enable query optimization for access and performance.

Data virtualization, in my view, builds on top of federation with knowledge of the business-related metadata required to address the problem of disparate data meanings, relationships and currencies and deliver high quality results that are meaningful and consistent for the business user submitting the query.  Simply put, there are two ways to address these problems and supply the needed metadata.  The easiest approach is to depend on the business user to understand data consistency and similar quasi-IT issues and to make sensible (in terms of data coherence and reliable results) queries.  The second way is to model the data to some extent upfront and create a semantic layer, as it's often called, that ensures the quality of returned results.

The former approach typically leads to faster, cheaper implementations; the latter to longer-term quality at some upfront cost.  The former works better if you're coming from a big data view point, where much of the data is poorly defined, changing and of questionable accuracy and consistency in any case.  The latter favors enterprise information management where quality and consistency are key.  The reality of today's world, however, is that we need both!
Cirro, with its sights set on big data and its minimal formal structure, strongly favors the first approach.  Allowing, indeed encouraging, users to build their explorations in the freeform environment that is Excel is a strong statement in itself.  It's typically fast, easy and iterative, all highly valued qualities in today's break-neck speed business environment.  However, when you link from there to the (hopefully) high-quality data warehouse, the need for a more formal and modeled approach becomes clear.  

So, which approach to choose?  It depends on your starting point and initial drivers.  And your long-term needs.  Composite, for example, focuses more on the prior creation of business views to shield users from the technical complexity and inconsistencies in typical enterprise data.  Denodo, in contrast, talks of both bottom-up and top-down modeling to address both sets of needs.  In the long run, you'll probably need both approaches: the speed of an ad hoc approach for sandboxing and the quality of semantic modeling for production integration.

]]>
http://www.b-eye-network.com/blogs/devlin/archives/2012/07/virtualization_1.php http://www.b-eye-network.com/blogs/devlin/archives/2012/07/virtualization_1.php Thu, 12 Jul 2012 07:47:56 -0700
Integrating big data and more with your data warehouse AIW.pngIn 1988, I published the first data warehouse architecture.  Its aim was to provide consistent, integrated data to business users in support of cross-enterprise decision making.  Quality and consistency were the key drivers; at that time the major issues were that operational / transactional systems were highly inconsistent and direct access to them was discouraged for reasons of performance and security.  Business users were happy to get whatever consistent view they could, and, in general, wanted to see a stable representation of the business on a monthly, weekly or occasionally daily basis.  This architecture has remained a foundation of business intelligence ever since.

21 years later, in 2009, I introduced Business Integrated Insight (BI2).  With emerging needs like near real-time decision making in operational BI and increasing use of non-traditional data coming from Web 2.0 and other sources, this new architecture had to address a far wider scope than the original data warehouse.  While consistency and integrity remain important considerations, today's business needs are far more about instant access to the ever-changing ebb and flow of trends in sales, manufacturing and more.  It was becoming clear that a new, over-arching architecture was required to cover all the information, processes and people of the business.

Now, three years later, it's clear that traditional BI is racing to keep up with developments in big data, data virtualization and the cloud, mobile computing as well as social networking and collaboration.  All these topics were incorporated in BI2 from the outset.  Now, as the technology moves to the mainstream, we can and must to dive deeper in these specific areas.  Big data leads clearly to the impossibility of routing all information through an enterprise data warehouse (EDW).  But, how will that impact our need for consistency and integrity?  I envisage we will move from the old adage of "a single version of the truth" to multiple versions depending on users' needs, with one particular version that I call core "business information" being the source of truth for external reporting and financial governance needs.  

Data virtualization has also become big news in recent years.  In many ways, it's a technology whose time has come.  With the explosion of data volumes and varieties, users need ways to combine data on the fly with confidence and performance.  Data virtualization addresses these needs and is increasingly overlapping with function we traditionally associate with ETL.  The result, data integration, as it's sometimes called, enables us to envisage a future where data is made available to users as they need it, whether real-time or integrated and historicized.  

And, against the background of all this upheaval in data and infrastructure, we also see a new breed of technology-savvy business users moving into positions of power.  These so-called millennials are demanding seamless, mobile access to the information they need, as well as the ability to play with it as required.  The rule of IT over the data and application resources of the organization is coming to an end.  But, that's not to say that IT has no future role.  In fact, I see more of a fully symbiotic partnership between business and IT emerging, a partnership I call the "biz-tech ecosystem".

My 2012 BI2 Seminar in Rome on 11-12 June explores these new directions and provides guidance on their introduction in your existing data warehouse environment.  It also introduces the Advanced Information Warehouse, shown above, as the next step on your journey from a traditional data warehouse to comprehensive business integrated insight. ]]>
http://www.b-eye-network.com/blogs/devlin/archives/2012/05/integrating_big_1.php http://www.b-eye-network.com/blogs/devlin/archives/2012/05/integrating_big_1.php Mon, 28 May 2012 06:20:23 -0700
Next query: NoSQL and Business Intelligence just-say-nosql.pngBusiness intelligence (BI) has long been associated with relational databases and the SQL language.  From the earliest days of data warehousing, the qualities of the relational model have been highly valued in the quest for data consistency and quality.  In addition, it was assumed that business users are comfortable with tables of information.  This has been proven true, especially by spreadsheets, much to IT's chagrin.  Tables are also the lingua franca of BI tools and simple Select / Where queries are familiar to many users.  But, whatever the rationale, the association of BI and SQL is deeply embedded in the minds of most practitioners. So, the question arises--what about NoSQL; how does this relate to BI?  Can it be of use in data warehousing?

Good questions.  But first, you need to know what flavor of NoSQL you're speaking about.  For brevity, I'll focus only on one of the five or so varieties: document-oriented data stores.  (If you are interested in the others, the bigger picture--and a trip to Rome--I propose my two-day seminar there on 11-12 June!)  As I discovered about a year ago in a fascinating conversation with Max Schireson, president of 10gen / MongoDB, in this context a document is neither about e-mail contents nor Word documents; it refers to a particular data structure where records consist of an arbitrary set of fields, each identified by a name and value pair, structured in JSON (JavaScript Object Notation) or similar language.  For more details, refer to my white paper.  So, let me release you from your suspense now.  Can this be of use in BI? The short answer is yes.  But to fully grasp the extent, I'd like to introduce you to two MongoDB customers and how they are easing into BI using NoSQL.

I spoke to David Chancogne, CTO of Traackr, a web business measuring the influence of people who blog, tweet and otherwise contribute to the impression the general public forms of brands, products and more on the web.  The goal is to assist marketers and advertizing agencies track and target such influencers more effectively.  Traackr has built a MongoDB database of the contents of blogs, tweets, etc. and gives its customers reports and analyses of the top influencers in their areas of interest.  Is this BI?  In its broadest sense, yes.  The scope is very specific and the queries pre-defined, but this is still BI at its most basic.  Did Chancogne think of it as BI?  Actually not, it's simply his business to provide analytics to his customers.  Probing a little deeper, I discovered that Traackr is continually trying to optimize its algorithm to rate influence.  They do this by extracting data from their database and playing with it in--wait for it---Excel!  More BI, but like many a start-up business before them, the choice of Excel was more through familiarity and ease-of-use.  Generic BI tools that run against a JSON data store, such as Pentaho's NoSQL solution, Nucleon Software's BI Studio, are beginning to appear that allow generic querying on the data without extracting it to Excel.

A conversation with Julian Browne led to further interesting insights.  Browne is the architect of Priority Moments (a location-aware customer loyalty program that offers discounts at affiliated retailers) at O2, the second-largest provider of mobile/cell phone services in the UK, with more than 20 million customers.  MongoDB was chosen as the platform for this service largely to deal with the complexity and variability of their product catalog.  The challenge is that there exists a bewildering variety of product sets that can be offered to different customers, and changes constantly at the whim of marketing.  The absence of a predefined schema, a key characteristic of document-oriented data stores, was a compelling argument for the technology choice.  But, what of BI?  Customer loyalty programs are prime BI territory, of course, and in this case tracking of uptake of offers is vital.  As with Traackr, initial BI was provided through hand-crafted Java programming, although there is growing interest in using the emerging BI tools.  Of more interest, however, is the experimental use of a specific feature of the database that allows a query to be left open and as records arrive in the database, they automatically appear in the result, which can be routed to a live HTML5 graph(1) giving real-time feedback to monitor program activity.

How would we summarize the situation regarding BI for document-oriented NoSQL databases?  What we see is a fairly recent database technology with its query facilities being used for basic, predefined BI.  As might be expected, more generic tooling for building queries is appearing.  The type of BI supported is focused, application-specific querying and reporting--the type associated with data marts in traditional BI.  This is exactly as we saw in the emergence of BI against relational databases.  Note that some of the querying is being performed against the live operational sources.  Again, we see the similarity with early reporting approaches with similar concerns about performance impacts on operations.  MongoDB addresses this through the creation of eventually consistent replicas.  Nonetheless, the demand for real-time BI continues to grow and certain classes of operational analytics will need such real-time or near real-time access.

Where NoSQL does not play a role in BI is also important.  Enterprise data warehouses (EDW), with their focus on creating consistent, integrated, historical stores of core business information are set to remain squarely in the relational database world.  But, where operational needs drive the choice of a NoSQL document-oriented data store, it is clear that BI can flourish in this environment too.  See my latest white paper, "Business Intelligence--NoSQL... No Problem", for further details.


(1)  For background on this approach, see hummingbird and data-driven documents. ]]>
http://www.b-eye-network.com/blogs/devlin/archives/2012/05/next_query_nosq_1.php http://www.b-eye-network.com/blogs/devlin/archives/2012/05/next_query_nosq_1.php Thu, 17 May 2012 03:37:43 -0700