We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


Blog: Barry Devlin Subscribe to this blog's RSS feed!

Barry Devlin

As one of the founders of data warehousing back in the mid-1980s, a question I increasingly ask myself over 25 years later is: Are our prior architectural and design decisions still relevant in the light of today's business needs and technological advances? I'll pose this and related questions in this blog as I see industry announcements and changes in way businesses make decisions. I'd love to hear your answers and, indeed, questions in the same vein.

About the author >

Dr. Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. He was responsible for the definition of IBM's data warehouse architecture in the mid '80s and authored the first paper on the topic in the IBM Systems Journal in 1988. He is a widely respected consultant and lecturer on this and related topics, and author of the comprehensive book Data Warehouse: From Architecture to Implementation.

Barry's interest today covers the wider field of a fully integrated business, covering informational, operational and collaborative environments and, in particular, how to present the end user with an holistic experience of the business through IT. These aims, and a growing conviction that the original data warehouse architecture struggles to meet modern business needs for near real-time business intelligence (BI) and support for big data, drove Barry’s latest book, Business unIntelligence: Insight and Innovation Beyond Analytics, now available in print and eBook editions.

Barry has worked in the IT industry for more than 30 years, mainly as a Distinguished Engineer for IBM in Dublin, Ireland. He is now founder and principal of 9sight Consulting, specializing in the human, organizational and IT implications and design of deep business insight solutions.

Editor's Note: Find more articles and resources in Barry's BeyeNETWORK Expert Channel and blog. Be sure to visit today!

Recently in Big data Category

"We are running through the United States with dynamite and rock saws so an algorithm can close the deal three microseconds faster," according to algorithm expert Kevin Slavin at last month's TEDGlobal conference.  Kevin was describing the fact that Spread Networks is building a fiber optic connection to shave three microseconds off the 825 mile trading journey between Chicago and New York.  The above comes courtesy of a thought-provoking article called "When algorithms control the world" by Jane Wakefield on the BBC website.

As someone who follows the data warehousing or big data scene, you're probably familiar with some of the decisions that algorithms are now making, based largely on the increasing volumes on information that we have been gathering, particularly over recent years.  We all know by now that online retailers and search engines use sophisticated algorithms to decide what we see when we go to a web page.  Mostly, we're pretty pleased that what comes up is well-matched to our expectations and that we don't have to plow through a list of irrelevant suggestions.  We're thankful for the reduction in information overload.

But, there are a few downsides to our growing reliance on algorithms and the data they graze upon...


The carving up of the US for fiber highlights the ecological destruction that comes from our ever-growing addiction to information and the speed of getting it.  We tend to sit back comfortably and think that IT is clean and green.  And it probably is greener than most, but it certainly far from environmentally neutral.  We need to pay more attention to our impact--carbon emissions, cable-laying and radio-frequency and microwave pollution are all part of IT's legacy.

But there's a much more subtle downside, and one that has been of concern to me for years.  Just because we can do some particular analysis, does it really make sense?  The classic case is in the use of BI and data mining in the insurance industry.  Large data sets and advanced algorithms allow insurers to discover subtle clues to risk in segments of the population and adjust premiums accordingly.  Now, of course, actuaries have been doing this since the 18th and 19th centuries.  But the principal driver in the past was to derive an equitable spread of risk in a relatively large population, such that the cost of a single event was effectively spread over a significantly larger number of people.  However, data mining allows ever more detailed segmentation of a population, and insurers have responded by identifying particularly high-risk groups and effectively denying them insurance or pricing premiums so high that such people cannot insure their risk.  While in some cases we can argue that this drives behavior changes that reduce overall risk (for example, safer driving practices among young males), in many other instances, no such change is possible (for example, for house owners living on flood plains).  I would argue that excessive use of data mining to segment risk in insurance eventually destroys the possibility to spread risk equitably and thus undermines the entire industry.

In a similar manner, the widespread use of sophisticated algorithms and technology to speed trading seems to me to threaten the underlying purpose of futures and other financial markets, which, in my simplistic view, is to enable businesses to effectively fund future purchases or investments.  The fundamental goal of the algorithms and speedy decision making, however, seems to be to maximize profits for traders and investors, without any concern for the overall purpose of the market.  We've seen the results of this dysfunctional behavior over the past few years in the derivatives market, where all sense of proportion and real value was lost in the pursuit of illusory financial gain.

But, it gets worse!  The BBC article reveals how a British company, Epagogix, uses algorithms to predict what makes a hit movie.  Using metrics such as script, plot, stars, location, and the box office takings of similar films, it tries to predict the success of a proposed production.  The problem here, and note that the same applies to book suggestions on Amazon and all similar approaches, is that the algorithm depends on past consumer behavior and predicts future preferences based upon that.  The question is: how do new preferences emerge if the only thing being offered is designed solely to satisfy past preferences?

I would argue that successful business strategy requires a subtle blend of understanding past buyer behavior and offering new possibilities that enable and encourage new behaviors to emerge.  If all that a business offers is that which has been successful in the past, it will be rapidly superseded by new market entrants that are not locked into the past.  The danger of an over-reliance on data mining and algorithms is that innovation is stifled for business.  More importantly, for civilization, imagination is suffocated and strangled for lack of new ideas and thoughts.

Do you want to live in such an algorithm-controlled world?  As Wakefield puts it so well: "In reality, our electronic overlords are already taking control, and they are doing it in a far more subtle way than science fiction would have us believe. Their weapon of choice - the algorithm."


Posted August 25, 2011 6:19 AM
Permalink | 1 Comment |

In my last post, I discussed some of the key points in the 5th annual Digital Universe study from IDC, released by EMC in June.  Here, I consider a few more: some of the implications of the changes in sourcing on security and privacy, the importance of considering transient data, where volumes are a number of orders of magnitude higher, and a gentle reminder that bigger is not necessarily the nub of the problem.

Let's start with transient data.  IDC notes that "a gigabyte of stored content can generate a petabyte or more of transient data that we typically don't store (e.g., digital TV signals we watch but don't record, voice calls that are made digital in the network backbone for the duration of a call)".  Now, as an old data warehousing geek, that type of statement typically rings alarm bells: what if we miss some business value in the data that we never stored?  How can we ever recheck at a future date the results of an old analysis we made in real-time?  We used to regularly encounter this problem with DW implementations that focused on aggregated data, often because of the cost of storing the detailed data.  Over the years, decreasing storage costs meant that more warehouses moved to storing the detailed data.  But now, it seems like we are facing the problem again.  However, from a gigabyte to a petabyte is a factor of a million!  And, as the study points out, the "growth of the [permanent] digital universe continues to outpace the growth of storage capacity".  So, this is probably a bridge to far for hardware evolution.

The implication (for me) is that our old paradigm about the need to keep raw, detailed data needs to be reconsidered, at least for certain types of data.  This leads to the point about "big data" and whether the issue is really about size at all.  The focus on size, which is the sound-bite for this study and most of the talk about big data, distracts us from the reality that this expanding universe of data contains some very different types of data to traditional business data and comes from a very different class of sources.  Simplistically, we can see two very different types of big data: (1) human-generated content, such as voice and video and (2) machine metric data such as website server logs and RFID sensor event data.  Both types are clearly big in volume, but in terms of structure, information value per gigabyte, retention needs and more, they are very different beasts.  And interesting to note that some vendors are beginning to specialize.  Infobright, for example, is focusing on what they call "machine-generated data", a class of big data that is particularly suited to their technical strengths.

Finally, a quick comment on security and privacy.  The study identifies the issues: "Less than a third of the information in the digital universe can be said to have at least minimal security or protection; only about half the information that should be protected is protected."  Given how much information that consumers are willing to post on social networking sites or share with businesses in order to get a 1% discount, this is a significant issue that proponents of big data and data warehousing projects.  As we bring this data from social networking sources into our internal information-based decision-making systems, we will increasingly expose our business to possible charges of misusing information, exposing personal information, and so on.

There are many more thought-provoking observations in the Digital Universe study.  Well worth a read for anybody considering integrating data warehouse and big data.


Posted August 12, 2011 11:45 AM
Permalink | No Comments |
I've just been reading the 5th annual Digital Universe study from IDC, released by EMC last month.  This year's study seems to have attracted less media attention than previous versions.  Perhaps we've grown blasé about the huge numbers of bytes involved - 1.8 ZB (zettabytes, or 1.8 trillion gigabytes) in 2011 - or perhaps the fact that the 2011 number is exactly the same as predicted in the previous study is not newsworthy.  However, the subtitle of this year's study, "Extracting Value from Chaos", should place it close to the top of every BI strategist's reading list.  Here, and in my next blog entry, are a few of the key takeaways, some of which have emerged in previous versions of the study, but all of which together reemphasize that business intelligence needs to undergo a radical rebirth over the next couple of years.

1.8 ZB is a big number, but consider that it's also a rapidly growing number, more than doubling every two years.  That's faster than Moore's Law.  By 2015, we're looking at 7.5-8 ZB.  More than 90% of this information is already soft (aka unstructured) and that percentage is growing.  Knowing that the vast majority of this data is generated by individuals and much of that consists of video, image and audio, you may ask: what does this matter to my internal BI environment?  The answer is: it matters a lot!  Because in that vast, swirling and ever-changing cosmic miasma of data there are hidden the occasional nuggets of valuable insight.  And whoever gets to them first - you or your competitors - will potentially gain significant business advantage.

With such volumes of information and such rapid growth, it is simply impossible to examine (never mind analyse) it manually.  This demands an automated approach.  Such tools are emerging - for example, facial recognition of photos on Facebook and elsewhere or IBM Watson's extraction of Jeopardy answers from the contents of the Internet.  Conceptually, what such tools do is generate data about data, which, as we know and love in BI, means metadata.  According to IDC, metadata is growing at twice the rate of the digital universe as a whole.  That's more than doubling every year!  

So, while we may well ask what you're doing about managing and storing soft information, an even more pressing question is what are you going to do about metadata?  Of course, the volumes of metadata are probably still relatively small (IDC hasn't published an absolute value), but that growth rate means they will get large; fast.  And we currently have a much more limited infrastructure and weaker methodologies to handle metadata than we've created over the years for data.  Not to mention that the value to be found in the chaos can be discovered only through the lens of the metadata that characterizes the data itself.

For BI, this shift in focus from hard to soft information is only one of the changes we have to manage.  Another major change involves the nature and sources of the hard data itself.  There is a growing quantity of hard data collected from machine sensors as more and more of the physical world goes on the Web.  RFID readers are generating ever increasing volumes of data.  (According to VDC Research, nearly 4 billion RFID tags were sold in 2010, a 35% increase over the previous year.)  From electricity meters to automobiles, intelligent, connected devices are pumping out ever increasing volumes of data that is being used in a wide variety of new applications.  And almost all of these applications can be characterized as operational BI.  So, the move from traditional, tactical BI to the near real-time world of operational BI is accelerating, with all of the challenges that entails.

Next time, I'll be looking at some of the implications of the changes in sourcing on security and privacy, as well as the interesting fact that although the stored digital universe is huge, the transient data volumes are a number of orders of magnitude higher.


Posted July 27, 2011 8:32 AM
Permalink | No Comments |
A chat with Max Schireson, President of 10gen makers of MongoDB (from humongous database), yesterday provided some food for thought on the topic of our assumptions about the best database choices for different applications.  Such thinking is particularly relevant for BI at the moment, as the database choices expand rapidly.

But first, for traditionalist BI readers, a brief introduction to MongoDB, which is one of the growing horde of so-called NoSQL "databases", some of which have very few of the characteristics of databases.  NoSQL stores come in a half a dozen generic classes, and MongoDB is in the class of "document stores" along with tools such as Apache's CouchDB and Terrastore.  Documents?  If you're confused, you are not alone.  In this context, we don't mean textual documents used by humans, but rather use the word in a programming sense as a collection of data items stored together for convenience and ease of processing, generally without a predefined schema.  From a database point of view, such a document is rather like the pre-First Normal Form set of data fields that users present to you as what they need in their application.  Think, for example, of an order consisting of an order header and multiple line items.  In the relational paradigm, you'll make two (or more) tables and join them via foreign keys. In a document paradigm, you'll keep all the data related to that order in one document.

Two characteristics--the lack of a predefined schema and the absence of joins--are very attractive in certain situations, and these turn out to be key design points for MongoDB. The lack of a schema makes it very easy to add new data fields to an existing database without having to reload the old data; so if you are in an emerging industry or application space, especially where data volumes are large, this is very attractive.  The absence of joins also plays well for large data volumes; if you have to shard your data over multiple servers, joins can be very expensive.  So, MongoDB, like most NoSQL tools, play strongly in the Web space with companies needing fact processing of large volumes of data with emergent processing needs.  Sample customers include Shutterfly, foursquare, Intuit, IGN Entertainment, Craigslist and Disney.  In many cases, the use of the database would be classed as operational by most BI experts.  However, there are some customers using it for real-time analytics, and that leads us to the question of using non-relational databases for BI.

When considering implementing Operational BI solutions, many implementers first think of copying the operational data to an operational data store (ODS), data warehouse or data mart and analysing it there.  They are immediately faced with the problem of how to update the informational environment fast enough to satisfy the timeliness requirement of the users.  As that approaches real-time, traditional ETL tools begin to struggle.  Furthermore, in the case of the data warehouse, the question arises of the level of consistency among these real-time updates and between the updates and the existing content.  The way MongoDB is used points immediately to an alternative, viable approach--go directly against the operational data.

As always, there are pros and cons.  Avoiding storing and maintaining a second copy of large volumes of data is always a good thing.  And if the analysis doesn't require joining with data from another source, using the original source data can be advantageous.  There are always questions about performance impacts on the operational source, and sometimes security implications as well.  However, the main question is around the types of query possible against a NoSQL store in general or a document-oriented database in this case.  It is generally accepted that normalizing data in a relational database leads to a more query-neutral structure, allowing a wider variety of queries to be handled.  On the other hand, as we saw with the emergence of dimensional schemas and now columnar databases, query performance against normalized databases often leaves much to be desired.  In the case of Operational BI, however, most experience indicates that the queries are usually relatively simple, and closely related to the primary access paths used operationally for the data concerned.  The experience with MongoDB bears this out, at least in the initial analyses users have required.

I'd be interested to hear details of readers' experience with analytics use of this and other non-Relational approaches.


Posted June 17, 2011 5:50 AM
Permalink | 1 Comment |

Having keynoted, spoken at and attended the inaugural O'Reilly Media Strata Conference in Santa Clara over the past few days, I wanted to share a few observations.

With over 1,200 attendees, the buzz was palpable.  This was one of the most energized data conferences I've attended in at least a decade.  Whether it was the tag line "Making Data Work", the fact it was an O'Reilly event or something else, it was clear that the conference captured the interest of the data community. 

The topics on the agenda were strongly oriented towards data science, "big data" and the softer (aka less structured) types of information.  This led me to expect that I'd be an almost lone voice for traditional data warehousing topics and thoughts.  I was wrong.  While there certainly were lots of experts in data analysis and Hadoop, there was no shortage of both speakers and attendees who did understand many of the principles of cleansing, consistency and control at the heart of data warehousing.

Given the agenda, I was also expecting to be somewhat of the "elder lemon" of the conference.  Unfortunately (in my personal view), in this I was correct.  It looked to me that the median age was well south of thirty, although I've done no data analysis to validate that impression.  Another observation, which was a bit more concerning, was that the gender balance of the audience was about the same as I've seen at data warehouse conferences since the mid-90s: about the same mid-90s percentage of males.  It seems that data remains largely a masculine topic.

The sponsor / vendor exhibitor list was also very interesting.  There were only a few of those that turn up at traditional data warehouse conferences.  Of course, the new "big data" vendors were there in force, as well as a few information providers.  Of the relational database vendors, only ParAccel and AsterData were represented.  Jaspersoft and Pentaho represented the Open Source BI vendors. While Pervasive and Tableau rounded out the vendors I recognized from the BI space.

As a final point, I note that the next Strata Conference has already been announced: 19-21 September in New York.  Wish I could be there!


Posted February 3, 2011 7:02 PM
Permalink | No Comments |


   VISIT MY EXPERT CHANNEL

Search this blog
Categories ›
Archives ›
Recent Entries ›