Blog: Lou Agosta Subscribe to this blog's RSS feed!

Lou Agosta

Greetings and welcome to my blog focusing on reengineering healthcare using information technology. The commitment is to provide an engaging mixture of brainstorming, blue sky speculation and business intelligence vision with real world experiences – including those reported by you, the reader-participant – about what works and what doesn't in using healthcare information technology (HIT) to optimize consumer, provider and payer processes in healthcare. Keeping in mind that sometimes a scalpel, not a hammer, is the tool of choice, the approach is to be a stand for new possibilities in the face of entrenched mediocrity, to do so without tilting windmills and to follow the line of least resistance to getting the job done – a healthcare system that works for us all. So let me invite you to HIT me with your best shot at LAgosta@acm.org.

About the author >

Lou Agosta is an independent industry analyst, specializing in data warehousing, data mining and data quality. A former industry analyst at Giga Information Group, Agosta has published extensively on industry trends in data warehousing, business and information technology. He is currently focusing on the challenge of transforming America’s healthcare system using information technology (HIT). He can be reached at LAgosta@acm.org.

Editor's Note: More articles, resources, and events are available in Lou's BeyeNETWORK Expert Channel. Be sure to visit today!

Recently in Agile Methods Category

Datameer takes its name from the sea - the sea of data - as in the French la mer or German, das Meer.

 

I caught up with Ajay Anand, CEO, and Stefan Groschupf, CTO. Ajay earned his stripes as Director of Cloud Computing and Hadoop at Yahoo. Stefan is a long-time open source consultant, and advocate, and cloud computing architect from EMI Music.

 

Datameer is aligning with datameerlogo.JPGthe two trends of Big Data and Open Source. You do not need an industry analyst to tell you that data volumes continue to grow, with unstructured data growing at a rate of almost 62% CAGR and structured less, but a still substantial 22% (according to IDC). Meanwhile, open source has never looked better as a cost effective enabler of infrastructure.

 

The product beta is launched with McAfee, nurago, a leading financial services company and a major telecommunications service provider  in April with the summer promising to deliver early adopters with the gold product shipping in the autumn. (Schedule is subject to changes without notice.) 

 

The value proposition of Datameer Analytics Solution (DAS) is  helping users perform advanced analytics and data mining with the same level of expertise required for a reasonably competent user of an Excel spreadsheet.

 

As is often the case, the back story is the story. The underlying technology is Hadoop. Hadoop is an open source standard for highly distributed systems of data. It includes both storage technology and execution capabilities, making it a kind of distributed operating system, providing a high level of virtualization. Unlike a relational database where search requires chasing up and down a binary tree, Hadoop performs some of the work upfront, sorting the data and performing streaming data manipulation. This is definitely not efficient for small gigabyte volumes of data. But when the data gets big - really big - like multiple terabytes and petabytes, then the search and data manipulation functions enjoy an order of magnitude performance improvement. The search and manipulation are enabled by the MapReduce algorithm.  MapReduce has been made famous by the Google implementation as well as the Aster Data implementation of it. Of course, Hadoop is open source. MapReduce takes a user defined mapping function and a user defined reduce function and performs key pair exchange, executing a process of grouping, reducing, and aggregation at a low level that you do not want to have to code yourself. Hence, the need for and value in a tool such as DAS. It generates the assembly level code required to answer business and data mining questions that business wants to ask of the data. In this regards, DAS functions rather like a Cognos or BusinessObjects front-end in that it presents a simple interface in comparison to all the work being done "under the hood". Clients who have to deal with a sea of data now have another option for boiling the ocean without getting steamed up over it.


Posted April 15, 2010 9:21 AM
Permalink | No Comments |

I caught up with Ben Werther, Director of Product Marketing, for a conversation about business developments at Greenplum and Greenplum's major new release.

 

According to Ben, Greenplum has now surpassed more than 100 enterprise customers and is enjoying revenue growth of about 100%, albeit from a revenue base that befits a company of relatively modest size. They also claim to be adding new enterprise customers faster than either Teradata or Netezza.

  gp_logo_greenplum.JPG

What is particularly interesting to me is that with its MAD methodology Greenplum is building an agile approach to development that directly addresses the high performance of its massively parallel processing capabilities. This is an emerging trend in high end parallel databases that is receving new impetus. More on this shortly. Meanwhile, release 4.0 includes enterprise class DBMS functionality such as -

-        Complex query optimization

-        Data loading

-        Workload Management

-        Fault-Tolerance

-        Embedded languages/analytics

-        3rd Party ISV certification

-        Administration and Monitoring

From the perspective of real world data center operations, the workload management features are often neglected but are critical path for successful operations and growth. Dynamic query balancing is a method used on mainframes for the most demanding workloads, and Greenplum has innovated in this area, with its solution now being "patent pending".

 

Just in case scheduling does not turn you on, a more sexy initiative is to be found in fault tolerance. Given that Greenplum is an elephant hunter, favoring large and high end installations, this is news you can use. Greenplum Database 4.0 enhances fault tolerance using a self-healing physical block replication architecture. Key benefits of this architecture are:

-        Automatic failure detection and failover to mirror segments

-        Fast differential recovery and catchup (while fully online / read-write)

-        Improved write performance and reduced network load

 Greenplum has also made is easier to update single rows against on-going queries. While data warehouses are mostly inquiry-intensive, it has been a well known secret that update activity is common in many data warehousing scenarios, driven by business changes to dimensions and hierarchies.

 

At the same time, Greenplum is announcing a new product - Chorus - aimed at the enterprise data cloud market. Public cloud computing has the buzz. What is less well appreciated is that much of the growth is in enterprise cloud computing - clouds of networked data stores with (relatively) user friendly frontends within the (virtual) four walls of a global enterprise such as a telecommunications company, bank, or related firm.

 

GreenplumCHORUS2.JPG

E N T E R P R I S E  D A T A  C L O U D

This shows the Enteprrise Data Cloud schematically with the Greenplum database on top of the virtualized commodity hardware, operating system, public Internet tunnel, and Chorus abstraction layer. Chorus aims at being the source of all the raw data (often 10X size of the EDW); providing a self-service infrastructure to support multiple marts and sandboxes; and, finally, furnishing a rapid analytic iteration, and business led solution. Chorus enables security, providing extensive, granular access control over who is authorized to view and subscribe to data within Chorus; collaboration, facilitating the publishing, discovery, and sharing of data and insight using a social computing model that appears familiar and easy-to-use. Chorus takes a data-centric approach, focusing on the necessary tooling to manage the flow and provenance of data sets as they are created/shared within a company.

One more thing. Even given the blazingly fast performance of massively parallel processing data warehousing, heterogeneous data requires management. It is becoming an increasingly critical skill to surround one's data and make it accessible with a useable, flexible method of data management. Without a logical, rational method of organizing data, the result is just more proliferating, disconnected  islands of information. Greenplum's solution to this challenge? Get MAD!

Of course, this is a pun, standing for a platform capable of supporting the magnetic, agile, and deep principles of MAD Skills. "Magnetic" does not refer to disk, though there is plenty of that. This conforms to data warehousing orthodoxy in one respect only - it agrees to get all the data into one repository; but it does not subscribe to the view that it must all be conformed or rendered consistent. This is where the "agile" comes in - deploying a flexible, stabe-by-stage process and in parallel. A laboratory approach to data analysis is encouraged with cleansing and structuring being staged within the same repository. Analysts are given their own "sandbox" in which to explore and test out hypotheses about buying behavior, trends, and so on. Successful solutions are generalized as best practices. In effect, given the advances in technology, the operational data store is a kludge that is no longer required. Regarding the "deep," advanced statistical methods are driven close to the data. For example, one Greenplum customer had to calculate the ordinary least square (OLS is a method of fitting a curve to data) by exporting the data into the statistical language R for calculation and then importing it back, a process that required several hours. This regression was moved into the database thanks to the capability of Greenplum and ran significantly faster due to much less data movement. In another example involving highly distributed data assembled by Chorus, T-Mobile assembled data from a number of large untapped sources (cell phone towers, etc), as well as data in the EDW and others source systems, to build a new analytic sandbox; ran a number of analyses including generating a social graph from call detail records and subscriber data; and discovered behavior where T-Mobile subscribers were seven times more likely to churn if someone in their immediate network left to another service provider. This work would ordinarily require months of effort just to provision databases and discover and assemble the data sources, but was completed within two weeks while deploying a one petabyte production instance of Greenplum Database and Greenplum Chorus. As the performance bar goes up, methodologies and architectures (such as Chorus) are required to sprint ahead in order to keep up. As already noted and in summary, with its MAD methodology, Greenplum is building an agile approach to development that promises to keep up with the high performance bar of its massively parallel processing capabilities. An easy prediction to make is that the competitors already know about it and are already doing it. Really!?

 


Posted April 14, 2010 7:46 AM
Permalink | No Comments |