Greetings and welcome to my blog focusing on reengineering healthcare using information technology. The commitment is to provide an engaging mixture of brainstorming, blue sky speculation and business intelligence vision with real world experiences – including those reported by you, the reader-participant – about what works and what doesn't in using healthcare information technology (HIT) to optimize consumer, provider and payer processes in healthcare. Keeping in mind that sometimes a scalpel, not a hammer, is the tool of choice, the approach is to be a stand for new possibilities in the face of entrenched mediocrity, to do so without tilting windmills and to follow the line of least resistance to getting the job done – a healthcare system that works for us all. So let me invite you to HIT me with your best shot at LAgosta@acm.org.
Lou Agosta is an independent industry analyst, specializing in data warehousing, data mining and data quality. A former industry analyst at Giga Information Group, Agosta has published extensively on industry trends in data warehousing, business and information technology. He is currently focusing on the challenge of transforming America’s healthcare system using information technology (HIT). He can be reached at LAgosta@acm.org.
Predictive analytics has been Page One news for many years
in the business and popular press; and I caught up with Predixion Software principals
Jamie McLennan (CTO) and Simon Arkell (CEO) shortly prior to the launch earlier
this month. Marketers, managers, and analysts alike continue to lust after
golden nuggets of information hidden amidst the voluminous river of data
flowing across business systems in the interest of revenue opportunities. Competetive
opportunities require fast time to results and competition for clients remains
intense when they can be enticed to engage.
Predixion represents yet another decisive step in the march
of predictive analytics in two tension-laden but related directions. First, the
march down market towards wider deployment of predictive analytics is enabled
by deployment in the cloud for a modest monthly fee. Second, the march up
market towards engaging and solving sophisticated predictive problems is enabled
by innovations in implementing advanced algorithms from Predixion in readily
usable contexts such as Excel and PowerPivot. Predixion Insight offers a
wizard-driven plug-in for Excel and PowerPivot that uses a cloud-based
predictive engine and storage.
The intersection of predictive analytics and cloud computing
forms a compelling value proposition. The initial effort on the part of the
business analyst is modest in comparison with provisioning an in house
predictive analytics platform. The starting price is reportedly $99 per seat
per month - though be sure to ask about additional fees that may apply. The
business need has never been greater. In survey after survey, nearly 80% of
enterprise clients acknowledge having a data warehouse in production. What is
the obvious next step, given a repository of clean, consistent data about
customers, products, markets, and promotions? Create a monetary opportunity
using advanced analytics without having to hire a cadre of statistical PhDs.
Predixion Insight advances the use of cloud computing to
deliver innovations in algorithms in advanced analytics. Wizards are provided
to analyze key influencers; detect categories; fill from example; forecast;
highlight exceptions; prediction calculator;
perform shopping basket analysis. While the Microsoft BI platform is not
the only choice for advances in usability, those marching to the beat of the
usability drum acknowledge the leadership in ease of deployment wizards and
implementation practices.
Finally,
notwithstanding the authentic advances being delivered by Predixion and its
competitors, a few words of caution about predictive analytics. Discovering and
determining meaning is a business task not a statistical one. There are
no spurious relationships between variables, only spurious interpretations. How
to guard against misinterpretation? Deep experience and knowledge of human
behavior (customer), the business (service), and the market (intersection of
latter two). The more challenging (competitive, immature, commodity-like, etc.)
the business environment, the more worthwhile are the advantages attainable
through predictive analytics. The more challenging the business environment,
the more important is planning, prototyping, and perseverance. The more
challenging the business environment, the more valuable is management
perspective and depth of understanding. Finally, prospective clients
will still need to be rigorous about data preparation and quality. Much of the
narrative about ease of use and lowering the implementation bar should not
obscure the back-story that many of the challenges in predictive analytics
relate to the accuracy, consistency, and timeliness of the data. You cannot
shrink wrap knowledge of your business. However, the market basket analysis
algorithm and wizards that I saw demoed by Predixion come as close to doing so
as is humanly possible so far.
What happens when an
irresistible force meets an immovable object? We are about to find out. The
irresistible force of BI, eDiscovery, compliance, fraud detection, governance,
risk management, and other analytic and regulatory mandates is heading straight
toward the immovable rock of year-to-year 10% reductions in information
technology budgets.
The convergence of the markets
for structured and unstructured data has been heralded many times, but maybe
the time has come. We think that the new generation of solutions with
increasing overlap of structured and unstructured data and multi-functionality
will emerge and that BMMsoft EDMT® Server is the pioneer in that space. Looking
into the crystal ball, what will happen is that an increasing overlap already
underway will disrupt incumbents across these diverse markets.
The world wide BI Market as defined by Gartner is sized at
$8.8B.[1]
Realistically that includes a lot of Business Objects (SAP), SAS applications,
and IBM solutions so the database part of that is probably closer to $6
billion.[2]The document management software market is estimated
at nearly $3 billion.[3]While email
archiving is relatively new and growing rapidly due to near federal
regulations, it has now reached the $1 billion "take off" point. In short, at
nearly $10 billion total, a product that addressed requirements across all
three of these markets with a reasonable prospect of response from even one
third of the enterprises, would have an outside boundary of over $3 billion.
This is a substantial market under any interpretation.
In the meantime, the exiting
markets for these three classes of products is fragmented into silos of the
traditional data warehousing vendors, email archiving, and document management,
the latter sometimes including compliance and governance software. The first
are well known in the market - extending from such stalwarts as HP, IBM,
Oracle, Microsoft, SAP, to data warehousing appliances and column-oriented
databases - and will not be repeated here (though one new developments will be
noted below). Document management systems include IBM FileNet Business Process
Manager (www.ibm.com), EMC Documentum (www.emc.com), OpenText LiveLink ECM
(www.opentext.com), Autonomy Cardiff Liquid Office (www.cardiff.com).
Strictly speaking, risk management is considered a separate market from
document management. Risk management and compliance offerings include Aventis (www.aventis.com),
BWise (www.bWise.com), Cura (www.curasoftware.com), Protiviti (www.protiviti.com),
Compliance 360 (www.compliance360.com) and IBM, which has at least two
offerings one based on Lotus Notes and one based on FileNet. This list is
partial and could easily be expanded with many best of breed offerings. The
result? Fragmentation. Diversity, though not in a positive sense. Many
offerings instead of a comprehensive approach to unified access and unified
analysis.
Five years from now data will
be as heterogeneous as ever and the uses of data even more so, but individual
products - single instance products, not solutions - will characterize a
transformed market for database management that traverses the boundaries between
email archiving, document management, and data warehousing with agility that is
only dreamt about in today's world. Video clips are now common on social
networking sites such as Facebook and YouTube. Corporate sponsorship of such
opportunities for viral marketing is becoming more common. The requirement to
track and manage product brands and images will necessitate the archiving of
such material, so multi-media (image/video/audio) are being added to the mix.
This future is being driven and
realized by the imperative for business transparency, risk management and
compliance, and growing regulatory requirements layered on top of existing
business intelligence and document management requirements. Still, document
management is distinct from workflow. If an enterprise needs workflow, then it
will continue to require a special purpose document management system. Workflow
was invented by FileNet in 1985, acquired by IBM in 2006, and continues to lead
the pack where detailed step-by-step process engineering is required. Elaborate
rules-engines for enterprise decision management are different than compliance.
If an enterprise requires a rule-engine for compliance and governance, then it
will need a special purpose compliance, risk management, and governance system.
Such solutions would be over-kill for those enterprises that require email
archiving for eDiscovery, document management for first order compliance, and
cross references to transactional data in the data warehouse. While the future
is uncertain, one of the vendors to watch is BMMsoft.
BMMsoft has put together a
product delivering functionality across these three previously unrelated silos
- data warehousing, eDiscovery (e-mail), and document management - and able to
be purchased as a EDMT®Server - a single part number from BMMsoft
(EDMT stands for "E-Mail, Documents, Media, Transactions"). The database "under
the hood" is Sybase IQ, a column-oriented data store with a proven track record
and several large objectively audited benchmarks. The latest of these weighs in
at 1000 terabytes - a petabyte - and was audited by Francois Raab, the same
professional who audits the TPC.org benchmarks.[4]The business need is real and based on customer
acceptance. So is the product.
The three keys to connect and
make intelligible the data from the three different sources are:
1) extreme scalability to
handle the data volumes - this is where a column-oriented database would come
in handy since the storage compaction is intrinsic and prior to the additional
compression that could be applied;
2) parallel, real-time high
performance ETL functionality to load all the data; and finally
3) search capabilities that
enable high performance inquiries against the data.
Such unified access to diverse
data types, intelligently connected by metadata, is also sometimes described as
a "data mashup."
A part of the challenge that a
start up - and up start - such as BMMsoft will face is building credibility,
which BMMsoft has already solved with numerous client installations in
production and success stories. In the case of BMMsoft EDMT® Server there is
another consideration:metadata is an
underestimated and underdeveloped opportunity. Innovations in metadata that
make possible many applications that require cross referencing emails,
documents, and the transactional data. For example, fraud detection, threat
identification, enhanced customer relations - all require navigating across the
different data types. Metadata makes that possible. That is not an easy problem
to solve; and BMMsoft has demonstrated significant progress with it. Second,
the column-oriented database is intrinsically skinny in terms of data storage
in comparison with the standard relational database, which continues to be
challenged by database obesity. As data warehouses scale up, the cost of
storage technology becomes a disproportionably large part of the price of the
entire system. Note that for the column-oriented approach proportional cost
savings come into view and are realized. Third, this also has significant
performance implications, since if there is less data - in terms of volume
points - to manage, then it is faster to do so. So when all the reasons are
considered, the claims are quite modest, or at least in line with common sense.
The wonder is that no one thought of it sooner.
When you think about it for a
minute, there is every reason that an underlying database should be capable of
storing a variety of different data types and doing so intelligently. The
latter intelligence is the "secret sauce" that differentiates BMMsoft. The
relationships between the different types of data are built as the data is
being loaded by BMMsoft using multiple software technology patents.The column-orientation of the underlying data
store - Sybase IQ - intrinsically condenses the amount of space required to
persist the information, yielding up to an order of magnitude - more typically
a factor of two or three - in storage savings, even prior to the application of
formal compression algorithms. This fights database obesity across all segments
- email, document, media, transactional (structured) data warehousing
information. This means that the application that lives off of the underlying
data is able to take advantage of performance improvements since less data is
being stored and more being fetched with every data retrieval. For those
enterprises with a commitment to installed Oracle or MySQL infrastructure,
BMMsoft provides investment protection. The EDMT® Server runs also on
Oracle, Netezza and MySQL and can be easily ported to any other relational
Database.
Thus, BMMsoft is a triple
threat and is able to function as a standalone product addressing data
warehousing, email archiving, and document management requirements as separate
silos. But just as importantly, for those enterprises that need or want to
compete with advanced applications in fraud detection, security threat
assessment, customer data mining beyond structured data, BMMsoft offers the
infrastructure and application to do so. For example, the ability to perform
cross-analysis between securities traded on the stock market and those
companies named in email and voice mail (remember multimedia handling) will
immediately provide a short list for follow up detection on on-going insider trading
or other fraudulent scheme. While hindsight is 20-20, a similar method of
identifying emerging patterns through cross-analysis would have been be useful
in surfacing the 8 billion dollar Societe General fraud, Madoff's nonexistent
options plays at the basis of the pyramid, the Georgia Tech shooter, and
relevant chatter that shows up prior to terrorist attacks. Going forward, this
technology is distinct in that it can be deployed on a small, medium, or large
scale to highlight emerging hot spots that require attention.
One may object - but won't the
competition be able to reverse engineer the functionality and provide something
similar using different methods? Of course, eventually every innovation will be
competitively attacked by some more-or-less effective "work around." Read the
prospectus - new start ups and existing software laboratories at HP, IBM, etc.
will eventually produce innovations that challenge the contender. However, that
could require three to five years. Then there is the matter of bringing it to
market. IBM provides an example, based on publicly available news reports. IBM
went out and purchased FileNet for about $5 billion dollars. FileNet is a great
company, which virtually invented workflow, and if one requires advanced
workflow capabilities, it is hands down a good choice. However, it does not do
data warehousing or email archiving. As a subsidiary of IBM which delivers
substantial revenue to the "mother ship," the executives in charge will set a
high bar on any IBM innovations which combine email archiving and structured
data warehousing with document management. In short, IBM is faced with the
classic innovator's dilemma.[5]The price points
that interest it - both internally and externally - are further up on the curve
than the deals that BMMsoft will be able to complete. Given that BMMsoft has
established presence in the market, it has a good chance of marching up market,
displacing the installed, legacy solutions as it goes. This happened before in
the client server revolution when IBM mainframe deals at the several million
dollar price point were undercut by a copy of PowerBuilder and a copy of
Sybase, albeit a different version of the database. Given that BMMsoft has a
head-start, it is exploiting first mover advantages and building an installed
base that will be challenged only with great difficulty. The relevance of such technology in the context of healthcare information technology (HIT) will be explored in a pending post. Please stand by for update - and keep in touch!
[2] For an alternative point of view see an IDC forecast (published 2007) that pegs the Data
Warehouse management/platforms market as approx $8.97B in 2010
Datawatch provides an ingenious solution to information
management, integration, and synthesis by working from the outside inwards. Datawatch's
Monarch technology reverse engineers the information in the text files that
would otherwise be sent to be printed as a hardcopy, using the text file as
input to drive further processing, aggregation, calculation, and transformation
of data into usable information. The text files, PDFs, spreadsheets, and
related printer input become new data sources. With no rekeying of data and no
programming, business analysts have a new data source to build bridges between silos
of data in previously disparate systems and attain new levels of data
integration and cohesion.
For those enterprises running an ERP system for back office
billing such as SAP or a hospital information
system (HIS) such as Meditech, the task of getting the data out of the system
using proprietary SAP coding or native MUMPS data store can be a high bar,
requiring custom coding. Datawatch intelligently zooms through the existing
externalization of the data in the reports, making short work of opening up
otherwise proprietary systems.
Note that a trade-off is implied here. If your reporting is
a strong point, Datawatch can take an installation to the next level, enabling
coordination and collaboration, breaking down barriers between reporting silos
that were previously impossible to bridge and doing so with velocity. Programming is not
needed, and the level of difficulty is comparable to that of managing an excel
spreadsheet targeting a smart business analyst. However, if the reports are
inaccurate or even junk, even Datawatch cannot spin the straw into gold. You will still have to
fix the data at its source.
Naturally, cross functional report mining works well in most
verticals extending from finance to retail, from manufacturing to media, from
the public sector to not for profit organizations. However, what makes
healthcare a particularly inviting target is the relatively late and still
on-going adoption of data warehousing combined with the immediate need to
report on numerous clinical, quality and financial metrics such as the pending
"Meaningful Use" metrics created via the HITECH Act. This is not a tutorial on
meaningful use; however, further details can be found in a related article
entitled "Game
on! Healthcare IT Proposed Criteria on 'Meaningful Use' Weigh in at 556 Pages"
click here.
One of the goals of "meaningful use" in HIT
is to combine clinical information with financial data in order to drive
improvements in quality care, patient safety and operational efficiency while simultaneously
optimizing cost control and reduction. The use of report mining and integration
of disparate sources also allow the healthcare industry to migrate towards a pay-for-performance
model, whereby providers will be reimbursed based on the quality and efficiency
of care provided. However, financial, quality, clinical metrics and the
evolving P4P models all require cross functional reporting from multiple
systems. Even for many modern hospital information systems (HIS) that is a high
bar. For those enterprises without an enterprise-wide data warehousing
solution, no one is proposing to wait three to five years for a multi-step
installation prior to learning the needed data still requires customization. In
the interim, Datawatch has a feasible approach worth investigating.
In conversations with Datawatch executives John Kitchen (SVP
Marketing) and Tom Callahan (Healthcare Product Manager), I learned that Datawatch
has more than 1,000 organizations in the healthcare sector using Datawatch
technology. Datawatch is surely a well kept secret, at least up until now. This
is a substantial resource for best practices, methods and models, and lessons
learned in the healthcare area. Datawatch can leverage these resources to its
advantage and the benefit of its clients. While this is not a recommendation to
buy or sell any security (or product), as a publicly traded firm, Datawatch is well
positioned to benefit as the healthcare market continues its expansion. Datawatch
provides a compelling business case with favorable ROI from the time of
installation to the delivery of problem-solving value for the end user client.
The level of IT support required by Datawatch is minimal, and sophisticated client
departments have sometimes gone directly to Datawatch to get the job done.
Let's end with a client success story in HIT. Michele Clark, Hospital Revenue Business
Analyst, Los Angles based Good Samaritan Hospital, comments on the application
of Datawatch's Monarch Pro: "We simply
run certain reports from MEDITECH's
scheduling module, containing data for surgeries already scheduled, by
location, by surgeon. We then bring those reports into Monarch Pro. Then, in
conjunction with its powerful calculated fields, Monarch allows us to report on
room utilization, block time usage and estimated times for various surgical
procedures. The flexibility of Monarch to integrate data from other sources
results in a customized, consolidated dataset in Monarch. We can then analyze,
filter and summarize the data in a variety of ways to maximize the efficiency
of our operating room resources. Thanks to Monarch, we have dramatically improved
the utilization of our operating rooms, can more easily match available
surgeons with required upcoming procedures, and better manage surgeon time and resources.
Our patients are receiving the outstanding standard of care they expect, while
we make the most of our surgical resources. This kind of resource efficiency is
talked about a lot in the healthcare community. With Monarch, we are achieving it."This makes Datawatch one to watch.
Datameer takes its name from the sea - the sea of data - as in the French la mer or German, das Meer.
I caught up with Ajay Anand, CEO, and Stefan Groschupf, CTO. Ajay earned his stripes as Director of Cloud Computing and Hadoop at Yahoo. Stefan is a long-time open source consultant, and advocate, and cloud computing architect from EMI Music.
Datameer is aligning with
the two trends of Big Data and Open Source. You do not need an industry analyst to tell you that data volumes continue to grow, with unstructured data growing at a rate of almost 62% CAGR and structured less, but a still substantial 22% (according to IDC). Meanwhile, open source has never looked better as a cost effective enabler of infrastructure.
The product beta is launched with McAfee, nurago, a leading financial services company and a major telecommunications service providerin April with the summer promising to deliver early adopters with the gold product shipping in the autumn. (Schedule is subject to changes without notice.)
The value proposition of Datameer Analytics Solution (DAS) ishelping users perform advanced analytics and data mining with the same level of expertise required for a reasonably competent user of an Excel spreadsheet.
As is often the case, the back story is the story. The underlying technology is Hadoop. Hadoop is an open source standard for highly distributed systems of data. It includes both storage technology and execution capabilities, making it a kind of distributed operating system, providing a high level of virtualization. Unlike a relational database where search requires chasing up and down a binary tree, Hadoop performs some of the work upfront, sorting the data and performing streaming data manipulation. This is definitely not efficient for small gigabyte volumes of data. But when the data gets big - really big - like multiple terabytes and petabytes, then the search and data manipulation functions enjoy an order of magnitude performance improvement. The search and manipulation are enabled by the MapReduce algorithm.MapReduce has been made famous by the Google implementation as well as the Aster Data implementation of it. Of course, Hadoop is open source. MapReduce takes a user defined mapping function and a user defined reduce function and performs key pair exchange, executing a process of grouping, reducing, and aggregation at a low level that you do not want to have to code yourself. Hence, the need for and value in a tool such as DAS. It generates the assembly level code required to answer business and data mining questions that business wants to ask of the data. In this regards, DAS functions rather like a Cognos or BusinessObjects front-end in that it presents a simple interface in comparison to all the work being done "under the hood". Clients who have to deal with a sea of data now have another option for boiling the ocean without getting steamed up over it.
I caught up with Ben Werther, Director of Product Marketing, for a conversation about business developments at Greenplum and Greenplum's major new release.
According to Ben, Greenplum has now surpassed more than 100 enterprise customers and is enjoying revenue growth of about 100%, albeit from a revenue base that befits a company of relatively modest size. They also claim to be adding new enterprise customers faster than either Teradata or Netezza.
What is particularly interesting to me is that with its MAD methodology Greenplum is building an agile approach to development that directly addresses the high performance of its massively parallel processing capabilities. This is an emerging trend in high end parallel databases that is receving new impetus. More on this shortly. Meanwhile, release 4.0 includes enterprise class DBMS functionality such as -
-Complex query optimization
-Data loading
-Workload Management
-Fault-Tolerance
-Embedded languages/analytics
-3rd Party ISV certification
-Administration and Monitoring
From the perspective of real world data center operations, the workload management features are often neglected but are critical path for successful operations and growth. Dynamic query balancing is a method used on mainframes for the most demanding workloads, and Greenplum has innovated in this area, with its solution now being "patent pending".
Just in case scheduling does not turn you on, a more sexy initiative is to be found in fault tolerance. Given that Greenplum is an elephant hunter, favoring large and high end installations, this is news you can use. Greenplum Database 4.0 enhances fault tolerance using a self-healing physical block replication architecture. Key benefits of this architecture are:
-Automatic failure detection and failover to mirror segments
-Fast differential recovery and catchup (while fully online / read-write)
-Improved write performance and reduced network load
Greenplum has also made is easier to update single rows against on-going queries. While data warehouses are mostly inquiry-intensive, it has been a well known secret that update activity is common in many data warehousing scenarios, driven by business changes to dimensions and hierarchies.
At the same time, Greenplum is announcing a new product - Chorus - aimed at the enterprise data cloud market. Public cloud computing has the buzz. What is less well appreciated is that much of the growth is in enterprise cloud computing - clouds of networked data stores with (relatively) user friendly frontends within the (virtual) four walls of a global enterprise such as a telecommunications company, bank, or related firm.
E N T E R P R I S E D A T A C L O U D
This shows the Enteprrise Data Cloud schematically with the Greenplum database on top of the virtualized commodity hardware, operating system, public Internet tunnel, and Chorus abstraction layer. Chorus aims at being the source of all the raw data (often 10X size of the EDW); providing a self-service infrastructure to support multiple marts and sandboxes; and, finally, furnishing a rapid analytic iteration, and business led solution. Chorus enables security, providing extensive, granular access control over who is authorized to view and subscribe to data within Chorus; collaboration, facilitating the publishing, discovery, and sharing of data and insight using a social computing model that appears familiar and easy-to-use. Chorus takes a data-centric approach, focusing on the necessary tooling to manage the flow and provenance of data sets as they are created/shared within a company.
One more thing. Even given the blazingly fast performance of massively parallel processing data warehousing, heterogeneous data requires management. It is becoming an increasingly critical skill to surround one's data and make it accessible with a useable, flexible method of data management. Without a logical, rational method of organizing data, the result is just more proliferating, disconnectedislands of information. Greenplum's solution to this challenge? Get MAD!
Of course, this is a pun, standing for a platform capable of supporting the magnetic, agile, and deep principles of MAD Skills. "Magnetic" does not refer to disk, though there is plenty of that. This conforms to data warehousing orthodoxy in one respect only - it agrees to get all the data into one repository; but it does not subscribe to the view that it must all be conformed or rendered consistent. This is where the "agile" comes in - deploying a flexible, stabe-by-stage process and in parallel. A laboratory approach to data analysis is encouraged with cleansing and structuring being staged within the same repository. Analysts are given their own "sandbox" in which to explore and test out hypotheses about buying behavior, trends, and so on. Successful solutions are generalized as best practices. In effect, given the advances in technology, the operational data store is a kludge that is no longer required. Regarding the "deep," advanced statistical methods are driven close to the data. For example, one Greenplum customer had to calculate the ordinary least square (OLS is a method of fitting a curve to data) by exporting the data into the statistical language R for calculation and then importing it back, a process that required several hours. This regression was moved into the database thanks to the capability of Greenplum and ran significantly faster due to much less data movement. In another example involving highly distributed data assembled by Chorus, T-Mobile assembled data from a number of large untapped sources (cell phone towers, etc), as well as data in the EDW and others source systems, to build a new analytic sandbox; ran a number of analyses including generating a social graph from call detail records and subscriber data; and discovered behavior where T-Mobile subscribers were seven times more likely to churn if someone in their immediate network left to another service provider. This work would ordinarily require months of effort just to provision databases and discover and assemble the data sources, but was completed within two weeks while deploying a one petabyte production instance of Greenplum Database and Greenplum Chorus. As the performance bar goes up, methodologies and architectures (such as Chorus) are required to sprint ahead in order to keep up. As already noted and in summary, with its MAD methodology, Greenplum is building an agile approach to development that promises to keep up with the high performance bar of its massively parallel processing capabilities. An easy prediction to make is that the competitors already know about it and are already doing it. Really!?