Blog: Wayne Eckerson Subscribe to this blog's RSS feed!

Wayne Eckerson

Welcome to Wayne's World, my blog that illuminates the latest thinking about how to deliver insights from business data and celebrates out-of-the-box thinkers and doers in the business intelligence (BI), performance management and data warehousing (DW) fields. Tune in here if you want to keep abreast of the latest trends, techniques, and technologies in this dynamic industry.

About the author >

Wayne has been a thought leader in the business intelligence field since the early 1990s. He has conducted numerous research studies and is a noted speaker, blogger, and consultant. He is the author of two widely read books: Performance Dashboards: Measuring, Monitoring, and Managing Your Business (2005, 2010) and The Secrets of Analytical Leaders: Insights from Information Insiders (2012).

Wayne is currently director of BI Leadership Research, an education and research service run by TechTarget that provides objective, vendor neutral content to business intelligence (BI) professionals worldwide. Wayne’s consulting company, BI Leader Consulting, provides strategic planning, architectural reviews, internal workshops, and long-term mentoring to both user and vendor organizations. For many years, Wayne served as director of education and research at The Data Warehousing Institute (TDWI) where he oversaw the company’s content and training programs and chaired its BI Executive Summit. He can be reached by email at weckerson@techtarget.com.

Recently in Big Data Analytics Category

Advanced analytics promises to unlock hidden potential in organizational data. If that's the case, why have so few organizations embraced advanced analytics in a serious way? Most organizations have dabbled with advanced analytics, but outside of credit card companies, online retailers, and government intelligence agencies, few have invested sufficient resources to turn analytics into a core competency.

Advanced analytics refers to the use of machine learning algorithms to unearth patterns and relationships in large volumes of complex data. It's best applied to overcome various resource constraints (e.g., time, money, labor) where the output justifies the investment of time and money. (See "What is Analytics and Why Should You Care?" and "Advanced Analytics: Where Do You Start?")

Once an organization decides to invest in advanced analytics, it faces many challenges. To succeed with advanced analytics, organizations must have the right culture, people, organization, architecture, and data. (See figure 1.) This is a tall task. This article examines the "soft stuff" required to implement analytics--the culture, people, and organization--the first three dimensions of the analytical framework in figure 1. A subsequent article examines the "hard stuff"--the architecture, tools, and data.

Figure 1. Framework for Implementing Advanced Analytics
Part III - Implementation Challenges.jpg

The Right Culture

Culture refers to the rules--both written and unwritten--for how things get done in an organization. These rules emanate from two places: 1) the words and actions of top executives and 2) organizational inertia and behavioral norms of middle management and their subordinates (i.e., "the way we've always done it.") Analytics, like any new information technology, requires executives and middle managers to make conscious choices about how work gets done.

Executives. For advanced analytics to succeed, top executives must first establish a fact-based decision making culture and then adhere to it themselves. Executives must consciously change the way they make decisions. Rather than rely on gut feel alone, executives must make decisions based on facts or intuition validated by data. They must designate authorized data sources for decision making and establish common metrics for measuring performance. They must also hold individuals accountable for outcomes at all levels of the organization.

Executives also need to evangelize the value and importance of fact-based decision making and the need for a performance-driven culture. They need to recruit like-minded executives and continuously reinforce the message that the organization "runs on data." Most importantly, they not only must "talk the talk," they must "walk the walk." They need to hold themselves accountable for performance outcomes and use certifiable information sources, not resort to their trusted analyst to deliver the data view they desire. Executives who don't follow their own rules send a cultural signal that this analytics fad will pass and so it's "business as usual."

Managers and Organizational Inertia. Mid-level managers often pose the biggest obstacles to implementing new information technologies because their authority and influence stems from their ability to control the flow of information, both up and down organizational ladders. Mid-level managers have to buy into new ways of capturing and using information for the program to succeed. If they don't, they, too, will send the wrong signals to lower level workers. To overcome organizational inertia, executives need to establish new incentives for mid-level managers and hold them accountable for performance metrics aligned with strategic goals around the decision making and the use of information.

The Right People

It's impossible to do advanced analytics without analysts. That's obvious. But hiring the right analysts and creating an environment for them to thrive is not easy.
Analysts are a rare breed. They are critical thinkers who need to understand a business process inside and out and the data that supports it. They also must be computer-literate and know how to use various data access, analysis, and presentation tools to do their jobs. Compared to other employees, they are generally more passionate about what they do, more committed to the success of the organization, more curious about how things work, and more eager to tackle new challenges.

But not all analysts do the same kind of work, and it's important to know the differences. There are four major types of analysts:


  • Super Users. These are tech-savvy business users who gravitate to reporting and analysis tools deployed by the business intelligence (BI) team. These analysts quickly become the "go to" people in each department to get an ad hoc report or dashboard, if you don't want to wait for the BI team. While super users don't normally do advanced analytics, they play an important role because they offload ad hoc reporting requirements from more skilled analysts.

  • Business Analysts. These are Excel jockeys that executives and managers answer to create and evaluate plans, crunch numbers, and generally answer any question an executive or manager might have that can't be addressed by a standard report or dashboard. With training, they can also create analytical models.

  • Analytical Modelers. These analysts have formal training in statistics and a data mining workbench, such as those from IBM (i.e., SPSS) or SAS. They build descriptive and predictive models that are the heart and soul of advanced analytics.

  • Data Scientists. These analysts specialize in analyzing unstructured data, such as Web traffic and social media. They write Java and other programs to run against Hadoop and NoSQL databases and know how to write efficient MapReduce jobs that run in "big data" environments.

Where You Find Them. Most organizations struggle to find skilled analysts. Many super users and business analysts are self-taught Excel jockeys, essentially tech-savvy business people who aren't afraid to learn new software tools to do their jobs. Many business school graduates fill this role, often as a stepping stone to management positions. Conversely, a few business-savvy technologists can grow into this role, including data analysts and report developers who have a proclivity toward business and working with business people.

Analytical modelers and data scientists require more training and skills. These analysts generally have a background in statistics or number crunching. Statisticians with business knowledge or social scientists with computer skills tend to excel in these roles. Given advances in data mining workbenches, it's not critical that analytical modelers know how to write SQL or code in C, as in the past. However, data scientists aren't so lucky. Since Hadoop is an early stage technology, data scientists need to know the basics of parallel processing and how to write Java and other programs in MapReduce. As such, they are in high demand right now.

The Right Organization

Business analysts play a key role in any advanced analytics initiative. Given the skills required to build predictive models, analysts are not cheap to hire or easy to retain. Thus, building the right analytical organization is key to attracting and retaining skilled analysts.

Today, most analysts are hired by department heads (e.g., finance, marketing, sales, or operations) and labor away in isolation at the departmental level. Unless given enough new challenges and opportunities for advancement, analysts are easy targets for recruiters.

Analytics Center of Excellence. The best way to attract and retain analysts is to create an Analytics Center of Excellence. This is a corporate group that oversees and manages all business analysts in an organization. The Center of Excellence provides a sense of community among analysts and enables them to regularly exchange ideas and knowledge. The Center also provides a career path for analysts so they are less tempted to look elsewhere to advance their careers. Finally, the Center pairs new analysts with veterans who can give them the mentoring and training they need to excel in their new position.

The key with an Analytics Center of Excellence is to balance central management with process expertise. Nearly all analysts should be embedded in departments and work side by side with business people on a daily basis. This enables analysts to learn business processes and data at a granular level while immersing the business in analytical techniques and approaches. At the same time, the analyst needs to work closely with other analysts in the organization to reinforce the notion that they are part of a larger analytical community.

The best way to accommodate these twin needs is by creating a matrixed analytical team. Analysts should report directly to department heads and indirectly to a corporate director of analytics or vice versa. In either case, the analyst should physically reside in his assigned department most or all days of the week, while participating in daily "stand up" meetings with other analysts so they can share ideas and issues as well as regular off-site meetings to build camaraderie and develop plans. The corporate director of analytics needs to work closely with department heads to balance local and enterprise analytical requirements.

Summary

Advanced analytics is a technical discipline. Yet, some of the keys to its success involve non-technical facets, such as culture, people, and organization. For an analytics initiative to thrive in an organization, executives must create a fact-based decision making culture, hire the right people, and create an analytics center of excellence that attracts, trains, and retains skilled analysts.


Posted November 7, 2011 9:45 AM
Permalink | No Comments |

MP900399514.JPG

I don't think I've ever seen a market consolidate as fast the analytic platform market.

By definition, an analytic platform is an integrated hardware and software data management system geared to query processing and analytics that offers dramatically higher price-performance than general purpose systems. After talking with numerous customers of these systems, I am convinced they represent game-changing technology. As such, major database vendors have been tripping over themselves to gain the upper hand in this multi-billion dollar market.

Rapid Fire Acquisitions. Microsoft made the first move when it purchased Datallegro in July, 2008. But it's taken two years for Microsoft to port the technology to Windows and SQL Server so, ironically, it finds itself trailing the leaders. Last May, SAP acquired Sybase, largely for its mobile technology, but also for its Sybase IQ analytic platform, which has long been been the leading column-store database on the market and has done especially well in financial services. And SAP is sparking tremendous interest within its installed base for HANA, an in-memory appliance designed to accelerate query performance of SAP BW and other analytic applications.

Two months after SAP acquired Sybase, EMC snapped up massively parallel processing (MPP) database, Greenplum, and reportedly has done an excellent job executing new deals. Two months later, in September, 2010, IBM purchased the leading pureplay, Netezza, in an all cash deal worth $1.8 billion that could be a boon to Netezza if IBM can clearly differentiate between its multiple data warehousing offerings and execute well in the field.

And last month, Hewlett Packard, whose NeoView analytic platform died ingloriously last fall, scooped up Vertica, a market leading columnar database with many interesting scalability and availability features. And finally, Teradata this week announced it was purchasing AsterData, a MPP shared nothing database with rich SQL MapReduce functions that can perform deep analytics on both structured and unstructured data.

So, in the past nine months, the world's biggest high tech companies purchased five of the leading, pureplay analytic platforms. This rapid pace of consolidation is dizzying!

Consolidation Drivers

Fear and Loathing. Part of this consolidation frenzy is driven by fear. Namely, fear of being left out of the market. And perhaps fear of Oracle, whose own analytic platform, Exadata, has gathered significant market momentum, knocking unsuspecting rivals on their heels. Although pricey, Exadata not only fuels game-changing analytic performance, it now also supports transaction applications--a one-stop database engine that competitors may have difficulty derailing (unless Oracle shoots itself in the foot with uncompromising terms for licensing, maintenance, and proofs of concept.)

Core Competencies. These analytic platform vendors are now carving out market niches where they can outshine the rest. For Oracle, it's a high-performance, hybrid analytic/transaction system; SAP touts its in-memory acceleration (HANA) and a mature columnar database that supports real-time analytics and complex event processing; EMC Greenplum targets complex analytics against petabytes of data; Aster Data focuses on analytic applications in which SQL MapReduce is an advantage; Teradata touts its mixed workload management capabilities and workload-specific analytic appliances; IBM Netezza focuses on simplicity, fast deployments, and quick ROI; Vertica trumpets its scalability, reliability, and availability now that other vendors have added columnar storage and processing capabilities; Microsoft is pitching is PDW along with a series of data mart appliances and a BI appliance.

Pureplays Looking for Cover. The rush of acquisitions leaves a number of viable pureplays out in the cold. Without a big partner, these vendors will need to clearly articulate their positioning and work hard to gain beachheads within customer accounts. ParAccel, for example, is eyeing Fortune 100 companies with complex analytic requirements, targeting financial services where it says Sybase IQ is easy pickings. Dataupia is seeking cover in companies that have tens to hundreds of petabytes to query and store. Kognitio likes its chances with flexible cloud-based offerings that customers can bring inhouse if desired. InfoBright is targeting the open source MySQL market, while Sand Technology touts its columnar compression, data mart synchronization, and text parsing capabilities. Ingres is pursuing the open source data warehousing market, and its new Vectorwise technology makes it a formidable in-memory analytics processing platform.

Despite the rapid consolidation of the analytic platforms market, there is still obviously lots of choice left for customers eager to cash in on the benefits of purpose-built analytical machines that deliver dramatically higher price-performance than database management systems of the past. Although the action was fast and furious in 2010, the race has only just begun. So, fasten your seat belts as players jockey for position in the sprint to the finish.


Posted March 8, 2011 8:20 AM
Permalink | 2 Comments |

As companies grapple with the gargantuan task of processing and analyzing "big data," certain technologies have captured the industry limelight, namely massively parallel processing (MPP) databases, such as those from Aster Data and Greenplum; data warehousing appliances, such as those from Teradata, Netezza, and Oracle; and, most recently, Hadoop, an open source distributed file system that uses the MapReduce programming model to process key-value data in parallel across large numbers of commodity servers.

SMP Machines. Missing in action from this list is the venerable symmetric multiprocessing (SMP) machine that parallelizes operations across multiple CPUs (or cores) . The industry today seems to favor "scale out" parallel processing approaches (where processes run across commodity servers) rather than "scale up" approaches (where processes run on a single server.) However, with the advent of multi-core servers that today can pack upwards of 48 cores in a single CPU, the traditional SMP approach is worth a second look for processing big data analytics jobs.

The benefits of applying parallel processing within a single server versus multiple servers are obvious: reduced processing complexity and a smaller server footprint. Why buy 40 servers when one will do? MPP systems require more boxes, which require more space, cooling, and electricity. Also, distributing data across multiple nodes chews up valuable processing time and overcoming node failures, which are more common when you string together dozens, hundreds, or even thousands of servers into a single, coordinated system, adds to overhead, reducing performance.

Multi-Core CPUs. Moreover, since chipmakers maxed out the processing frequency of individual CPUs in 2004, the only way they can deliver improved performance is by packing more cores into a single chip. Chipmakers started with two-core chips, then quad-cores, and now eight- and 16-core chips are becoming commonplace.

Unfortunately, few software programs that can benefit from parallelizing operations have been redesigned to exploit the tremendous amount of power and memory available within multi-core servers. Big data analytics applications are especially good candidates for thread-level parallel processing. As developers recognize the untold power lurking within their commodity servers, I suspect next year that SMP processing will gain an equivalent share of attention among big data analytic proselytizers.

Pervasive DataRush

One company that is on the forefront of exploiting multi-core chips for analytics is Pervasive Software, a $50 million software company that is best known for its Pervasive Integration ETL software (which it acquired from Data Junction) and Pervasive PSQL, its embedded database (a.k.a. Btrieve.)

In 2009, Pervasive released a new product, called Pervasive DataRush, a parallel dataflow platform designed to accelerate performance for data preparation and analytics tasks. It fully leverages the parallel processing capabilities of multi-core processors and SMP machines, making it unnecessary to implement clusters (or MPP grids) to achieve suitable performance when processing and analyzing moderate to heavy volumes of data.

Sweet Spot. As a parallel data flow engine, Pervasive DataRush is often used today to power batch processing jobs, and is particularly well suited to running data preparation tasks (e.g. sorting, deduplicating, aggregating, cleansing, joining, loading, validating) and machine learning programs, such as fuzzy matching algorithms.

Today, DataRush will outperform Hadoop on complex processing jobs that address data volumes ranging from 500GB to tens of terabytes. Today, it is not geared to handling hundreds of terabytes to petabytes of data, which is the territory for MPP systems and Hadoop. However, as chipmakers continue to add more cores to chips and when Pervasive releases DataRush 5.0 later this year which supports small clusters, DataRush's high-end scalability will continue to increase.

Architecture. DataRush is not a database; it's a development environment and execution engine that runs in a Java Virtual Machine. Its Eclipse-based development environment provides a library of parallel operators for developers to create parallel dataflow programs. Although developers need to understand the basics of parallel operations--such as when it makes sense to partition data and/or processes based on the nature of their application-- DataRush handles all the underlying details of managing threads and processes across one or more cores to maximize utilization and performance. As you add cores, DataRush automatically readjusts the underlying parallelism without forcing the developer to recompile the application.

Versus Hadoop. To run DataRush, you feed the execution engine formatted flat files or database records and it executes the various steps in the dataflow and spits out a data set. As such, it's more flexible than Hadoop, which requires data to be structured as key-value pairs and partitioned across servers, and MapReduce, which forces developers to use one type of programming model for executing programs. DataRush also doesn't have the overhead of Hadoop, which requires each data element to be duplicated in multiple nodes for failover purposes and requires lots of processing to support data movement and exchange across nodes. But like Hadoop, it's focused on running predefined programs in batch jobs, not ad hoc queries.

Competitors. Perhaps the closest competitors to Pervasive DataRush are Ab Initio, a parallelizable ETL tool, and Syncsort, a high-speed sorting engine. But these tools were developed before the advent of multi-core processing and don't exploit it to the same degree as DataRush. Plus, DataRush is not focused just on back-end processing, but can handle front-end analytic processing as well. Its data flow development environment and engine are generic. DataRush actually makes a good complement to MPP databases, which often suffer from a data loading bottleneck. When used as a transformation and loading engine, DataRush can achieve 2TB/ hour throughput, according to company officials.

Despite all the current hype about MPP and scale-out architectures, it could be that scale-out architectures that fully exploit multi-core chips and SMP machines will win the race for mainstream analytics computing. Although you can't apply DataRush to existing analytic applications (you have to rewrite them), it will make a lot of sense to employ it for most new big data analytics applications.


Posted January 4, 2011 11:17 AM
Permalink | No Comments |