We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Blog: Barry Devlin Subscribe to this blog's RSS feed!

Barry Devlin

As one of the founders of data warehousing back in the mid-1980s, a question I increasingly ask myself over 25 years later is: Are our prior architectural and design decisions still relevant in the light of today's business needs and technological advances? I'll pose this and related questions in this blog as I see industry announcements and changes in way businesses make decisions. I'd love to hear your answers and, indeed, questions in the same vein.

About the author >

Dr. Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. He was responsible for the definition of IBM's data warehouse architecture in the mid '80s and authored the first paper on the topic in the IBM Systems Journal in 1988. He is a widely respected consultant and lecturer on this and related topics, and author of the comprehensive book Data Warehouse: From Architecture to Implementation.

Barry's interest today covers the wider field of a fully integrated business, covering informational, operational and collaborative environments and, in particular, how to present the end user with an holistic experience of the business through IT. These aims, and a growing conviction that the original data warehouse architecture struggles to meet modern business needs for near real-time business intelligence (BI) and support for big data, drove Barry’s latest book, Business unIntelligence: Insight and Innovation Beyond Analytics, now available in print and eBook editions.

Barry has worked in the IT industry for more than 30 years, mainly as a Distinguished Engineer for IBM in Dublin, Ireland. He is now founder and principal of 9sight Consulting, specializing in the human, organizational and IT implications and design of deep business insight solutions.

Editor's Note: Find more articles and resources in Barry's BeyeNETWORK Expert Channel and blog. Be sure to visit today!

Recently in Big data Category

1984first.png"Social networks already know who you know", "recommendation engines get much smarter", "early detection mitigates catastrophes".  Three of ten ways big data is creating the science fiction future.  These types of headlines appeal to the geek optimists in many of us.  We think that mitigating a catastrophe is certainly a good thing.  That smarter recommendations to whom we should connect and what we might be interested in buying could probably save us time, that most precious of commodities.  Most of us have grown up with a belief system that science and, by extension, technology and computers, are a sine qua non in today's world.  In truth, the world we live in today could not exist without them.

But, at what cost?

Three further headlines from the same blog: "surveillance gets really Orwellian", "doctors can make sense of your genome--and so can insurers", "dating sites that can predict when you're lying".  Perhaps these items give pause for thought.  Security cameras lurk in every corridor and public place.  And, as of last August, the NYPD has been monitoring Facebook and Twitter.  Even in our bedrooms, smart phones can be turned on remotely to monitor our most intimate indiscretions.  It's open season on our actions and communications.  Our genomes are fast becoming public property, ostensibly for our better health management; but, clearly, for better risk management--read profit--for insurance companies.  Even our thinking is being analyzed.

We're fast reaching 1984 some 30 years later than George Orwell imagined.  At least in our ability to monitor the actions, communications, genetic makeup and thoughts of an ever-increasing swathe of humanity.  As BI experts and data scientists, we celebrate our ability to gather and analyze ever more data with ever more sophistication and effort deeper granularity.  For marketeers, Utopia is a segment of one whose buying behavior is predictable with certainty.  As traders on the commodities or currency markets, our algorithms gamble on the Brownian motion of microscopic movements in prices.  For insurers, statistical averaging of risk across populations gives way to cherry picking the low-risk individuals for discounted premiums.

Am I overly pessimistic or even paranoid in imagining that big data brings risks at least as large as the benefits it promises?  Are the petabytes and exabytes of information we're gathering, storing and analyzing open to misuse?  We celebrate the role of social networking in pro-democracy movements around the world imagining that tweets and texts that are unassailable weapons for freedom, forgetting that the networks that carry them are run by big businesses whose bottom line is profit.  We reveal the secrets of our lives in dribs and drabs, in recordable phone conversations and even through the GPS tracking of our smart phones, oblivious that the technology exists to meet all the clues together, Sherlock Holmes-like, given sufficient time and money.

In my last post, I challenged us to take a step back and apply human insight to the results of big data analysis rather than take the results from statistical analyses at face value, to question the sources and play with other possible explanations before jumping to conclusions.  Now, knowing how fallible your own interpretation of big data may be, please give some consideration to the possibility that others, particularly those in positions of power, such as governments and businesses, can accidentally or deliberately misinterpret or misuse the big data resource.

But what can we do as an industry?  As individual analysts, consultants, data administrators and more?  At the very least, we can revisit the privacy and security controls we build into our systems.  Take a look at "Why you can't really anonymize your data" by Pete Warden and begin pressing the industry and academia to search for new solutions.  Look again at your business processes and evaluate if and how the use of big data subverts the intentions or ethics of how you work.  And, finally, reread George Orwell's "1984".

Posted February 17, 2012 2:47 AM
Permalink | 5 Comments |
4831625_s.jpgNow, I may be accused of getting up on my soap box in this first post of 2012, but... a few recent articles on the topic of big data / predictive analytics have really got me thinking.  Well, worrying, to be more precise.  My worry is that there seems to be a growing belief in the somehow magical properties of big data and a corresponding deification of those on the leading edge of working with big data and predictive analytics.  What's going on?

The first article I came across was "So, What's Your Algorithm?" by Dennis Berman in the Wall Street Journal.  He wrote on January 4th, "We are ruined by our own biases. When making decisions, we see what we want, ignore probabilities, and minimize risks that uproot our hopes.  What's worse, 'we are often confident even when we are wrong,' writes Daniel Kahneman, in his masterful new book on psychology and economics called 'Thinking, Fast and Slow.'  An objective observer, he writes, 'is more likely to detect our errors than we are.'"

I've read no more than the first couple of chapters of Kahneman's book (courtesy of Amazon Kindle samples), so I don't know what he concludes as a solution to the problem posed above--that we are deceived by our own inner brain processes.  However, my intuitive reaction to Berman's solution was visceral: how can he possibly suggest that the objective observer advocated by Kahneman could be provided by analytics over big data sets?  In truth, the error Berman makes is blatantly obvious in the title of the article... it always is somebody's algorithm.

The point is not that analytics and big data are useless.  Far from it.  They can most certainly detect far more subtle patterns in far larger and statistically more significant data sets than most or even all human minds can.  But, the question of what is a significant pattern and, more importantly, what it might mean remains the preserve of human insight.  (I use the term "insight" here to mean a balanced judgment combining both rationality and intuition.)  So, the role of such systems as objective observer for the detection and possible elimination of human error is, to me, both incorrect and objectionable.  It merely elevates the writer of the algorithm to the status of omniscient god.  And not only omniscient, but also often invisible.

Which brings me to the second article that got me thinking... rather negatively, it so happens.  "This Is Generation Flux: Meet The Pioneers Of The New (And Chaotic) Frontier Of Business" by Robert Safian was published by Fast Company magazine on January 9th.  The breathless tone, the quirky black and white photos and the personal success stories all contribute to a sense (for me, anyway) of awe in which we are asked to perceive these people.  The premise that the new frontier of business is chaotic is worthy of deep consideration and, in my opinion, is quite likely to be true.  But, the treatment is, as Scott Davis of Lyzasoft opined "more Madison Avenue than Harvard Business Review".  It is quite clear that each of the pioneers here has made significant contributions to the use of big data and analytics in a rapidly changing business world.  However, the converging personal views of seven pioneering people--presumably chosen for their common views on the topic--hardly constitutes a well-founded, thought-out theoretical rationale for concluding that big data and predictive analytics are the only, or even a suitable, solution for managing chaos in business.

As big data peaks on the hype curve this year (or has it done so already?), it will be vital that we in the Business Intelligence world step back and balance the unbridled enthusiasm and optimism of the above two articles with a large dollop of cold, hard realism based on our many years experience of trying to garner value from "big data".  (Since its birth, BI has always been on the edge of data bigger than could be comfortably handled by the technology of the time.)  So, here are three questions you might consider asking the next big data pioneer who is preaching about their latest discovery:  What is the provenance of the data you used--its sources, how it was collected/generated, privacy and usage conditions?  Can you explain in layman's terms the algorithm you used (recall that a key cause of the 2008 financial crash was apparently that none of the executives understood the trading algorithms)?  Can you give me two alternative explanations that might also fit the data values observed?

Big data and predictive analytics should be causing us to think about new possibilities and old explanations.  They should be challenging us to exercise our own insight.  Unfortunately, it appears that they may be tempting some of us to do the exact opposite: trust the computer output or the data science gurus more than we trust ourselves.  "Caveat decernor" to coin a phrase in in something akin to pig Latin--let the decision maker beware!

See also: "What is the Importance and Value of Big Data? Part 2 of Big Data: Giant Wave or Huge Precipice?"

Posted January 16, 2012 8:28 AM
Permalink | No Comments |
bp-napkin.jpg"Seven Faces of Data - Rethinking data's basic characteristics" - new White Paper by Dr. Barry Devlin.

We live in a time when data volumes are growing faster than Moore's Law and the variety of structures and sources has expanded far beyond those that IT has experience of managing.  It is simultaneously an era when our businesses and our daily lives have become intimately dependent on such data being trustworthy, consistent, timely and correct.  And yet, our thinking about and tools for managing data quality in the broadest sense of the word remain rooted in a traditional understanding of what data is and how it works.  It is surely time for some new thinking.

A fascinating discussion with Dan Graham of Teradata over a couple of beers in February last at Strata in Santa Clara ended up in a picture of something called a "Data Equalizer" drawn on a napkin.  As often happens after a few beers, one thing led to another...

The napkin picture led me to take a look at the characteristics of data in the light of the rapid, ongoing change in the volumes, varieties and velocity we're seeing in the context of Big Data.  A survey of data-centric sources of information revealed almost thirty data characteristics considered interesting by different experts.  Such a list is too cumbersome to use and I narrowed it down based on two criteria.  First was the practical usefulness of the characteristic: how does the trait help IT make decisions on how to store, manage and use such data?  What can users expect of this data based on its traits?  Second, can the trait actually be measured?

The outcome was seven fundamental traits of data structure, composition and use that enable IT professionals to examine existing and new data sources and respond to the opportunities and challenges posed by new business demands and novel technological advances.  These traits can help answer fundamental questions about how and where data should be stored and how it should be protected.  And they suggest how it can be securely made available to business users in a timely manner.

So what is the "Data Equalizer"?  It's a tool that graphically portrays the overall tone and character of a dataset, IT professionals can quickly evaluate the data management needs of a specific set of data.  More generally, it clarifies how technologies such as relational databases and Hadoop, for example, can be positioned relative to one another and how the data warehouse is likely to evolve as the central integrating hub in a heterogeneous, distributed and expanding data environment.

Understanding the fundamental characteristics of data today is becoming an essential first step in defining a data architecture and building an appropriate data store.  The emerging architecture for data is almost certainly heterogeneous and distributed.  There is simply too large a volume and too wide a variety to insist that it all must be copied into a single format or store.  The long-standing default decision--a relational database--may not always be appropriate for every application or decision-support need in the face of these surging data volumes and growing variety of data sources.  The challenge for the evolving data warehouse will be to ensure that we retain a core set of information to ensure homogeneous and integrated business usage.  For this core business information, the relational model will remain central and likely mandatory; it is the only approach that has the theoretical and practical schema needed to link such core data to other stores.

"Seven Faces of Data - Rethinking data's basic characteristics" - new White Paper by Dr. Barry Devlin (sponsored by Teradata)

Posted November 17, 2011 6:07 AM
Permalink | No Comments |
Laurel-Hardy.jpgIn the Information part of Information Technology, Big Data is the Big Hit of 2011.  It's also a wonderful phrase to play with: take the "big", place it in front of a few other words and suddenly you have a strapline... or a blog title!  So, is it a big change for IT, or is it just big hype?

There's no doubt in my mind that big data describes a real and novel phenomenon; unfortunately, there are also many existing and well-understood phenomena in the world of business intelligence and data warehousing that are getting sucked into marketing stories and, indeed, even into respectable articles about big data.

The recent McKinsey Quarterly article "Are you ready for the era of 'big data'?" (registration required) opens with the following example: "The top marketing executive at a sizable US retailer recently [discovered that a] major competitor was steadily gaining market share across a range of profitable segments...  [This] competitor had made massive investments in its ability to collect, integrate, and analyze data from each store and every sales unit and had used this ability to run myriad real-world experiments.  At the same time, it had linked this information to suppliers' databases, making it possible to adjust prices in real time, to reorder hot-selling items automatically, and to shift items from store to store easily.  By constantly testing, bundling, synthesizing, and making information instantly available across the organization... the rival company had become a different, far nimbler type of business.  What this executive team had witnessed first hand was the game-changing effects of big data" [my emphasis].

With all due respect to the authors, I believe that anybody who has been involved in business intelligence over the past ten years will be underwhelmed by this story.  It is almost entirely a scenario, and a common one, at that, describing a pervasive data warehousing implementation and operational BI excellence.  I suspect that the reason this example was tagged as big data was because of the reference to running myriad real-world experiments.  This is a behavior often associated with big data; however, on its own, it is generally not a sufficient characteristic.  

The remainder of the article provides many interesting examples and possible consequences, both beneficial and cautionary, of using big data.  For the business executive, it clearly whets the appetite.  But, from an IT perspective, it misses a key aspect--a viable definition of what big data really is.  This is hardly surprising; big data has reached the point on the hype curve where definitions are considered unnecessary.  We all seem to have an assumed definition that neatly meets our needs, be it selling a product or initiating a project.  Hear me clearly, though.  Despite the hype, there is something real going on here.  And it's fundamentally about the underlying characteristics of the information involved; characteristics that differ significantly from the data we in IT have stored and used over the years.

I contend that there are four types of information that together make up big data:
1.    Machine-generated data, such as RFID data, physical measurements and geolocation data, from monitoring devices
2.    Computer log data, such as clickstreams
3.    Textual social media information from sources such as Twitter and Facebook
4.    Multimedia social and other information from the likes of Flickr and YouTube

They are as different from traditional transactional data (the mainstay of BI) as they are from one another.  They have little in common, beyond their volume.  How business extracts value from them and how IT processes them vary widely.

While closely related to traditional BI and data warehousing, big data projects require additional and often very different skills in business and IT.  Their value is first to drive innovative change in business processes; only afterwards can their use become ongoing and operational.  These are topics I'll return to in the coming months.  But, in the meantime, join me for my webinar "Big Data Drives Tomorrow's Business Intelligence" on 25th October for further insights in this rapidly evolving area.

Posted October 21, 2011 11:03 AM
Permalink | 1 Comment |
Or, to be more precise, a pair of jeans on you!

girl in jeans.jpgKatia Moskvitch, writing for the BBC News website last week caught my attention with this question: "What if those new jeans you've just bought start tweeting about your location as you cross London Bridge?"  Those of us who've been following the uptake of RFID technology and the big data surge know that she's stretching the point a bit--RFID devices don't yet tweet and the chances of meeting a wild RFID reader on London Bridge is still low probability.  But we also know that she's close enough to the coming reality that many marketers and advertisers are beginning to envisage.  And make lots of money from...

The Internet of Things (IoT) is already becoming a reality as far as machines goes.  Smartphones, tablets and laptops lead the way, of course.  But automobiles and buildings, fridges and washing machines are not far behind.  And the ultimate vision is that every item of any value can be tagged with an RFID device and tracked wherever a reader exists.  Moskvitch quotes Gerald Santucci, head of the networked enterprise and RFID unit at the European Commission: "The IoT challenge is likely to grow both in scale and complexity as seven billion humans are expected to coexist with 70 billion machines and perhaps 70,000 billion 'smart things'".

From a BI point of view, that adds up to big data--very big data.  It also points to a type of data to which we've had only limited exposure in BI in the past.  The data generated from the IoT can be classified as (potentially) high-volume, raw micro-event data keyed by location, time and device ID.  Beyond its volume, such data poses interesting issues for traditional BI thinking.  

While BI implementations have typically invested much time and effort in cleansing data on loading, this raw IoT data is likely to come largely directly from the machine sources to the (big data) BI environment, rather than through operational systems that create a context for data gathered in traditional business operations.  And while current BI systems do deal with machine-generated data from devices such as ATMs, manufacturing machines and telephone exchanges, for example, these sources are highly controlled, internally managed, fixed and relatively few in number in comparison to IoT sources.  IoT data will require very different modeling and analysis approaches to today's BI.

But perhaps the most interesting dilemma is presented by the fact that we will be dealing directly with devices rather than people, which is really what interests marketing.  Yes, we will receive lots of information about where and when, but the question of who will be a matter of extrapolation.  Apart from fraud and crime, of which there will be myriad opportunities, the fact is that, other than implanted devices, the relationship between a device and a person is loose and variable.  To return to that RFID tag in the young lady's jeans above, linked via a credit card to a particular person at time of purchase, we can instantly see at least a dozen ways in which we could misidentify the person whose behavior we think we're tracking.  Even working at a statistical level, there may be issues.

And then there are the privacy issues that arise.  I'll return to that topic in another post.

Posted September 29, 2011 4:39 AM
Permalink | 1 Comment |


Search this blog
Categories ›
Archives ›
Recent Entries ›