Managing Enterprise Data in a Time of Change: From MDM to Columnar to Open Source
Originally published April 22, 2010
Heterogeneity is alive and well in organizations with the need to manage large data. Volumes and the desire for spontaneous response to business events are necessitating a variety of specialized platforms for handling specific workloads while maintaining a sense of consistency across the enterprise architecture.
Master Data ManagementAt the core of such a strategy must be master data management (MDM). Organizational master data will have a high degree of reuse across a number of systems. While independent operation of the systems is to be somewhat expected, they must work from a common data foundation. Master data management systems build that common foundation, not by managing transactions, but by building an enterprise-adjudicated superset of attributes at a low granularity for business “subject areas” like customer, product, item, part, store, organization, etc.
Master data management data, built once and used in multiple systems, can then be spread to the platforms, operational and post-operational, that need it for their specialized processing. One such system is the enterprise data warehouse. The data warehouse provides most post-operational processing in many organizations. However, even as functions (such as master data management) are being moved to the operational world, the data warehouse will remain very important, in the grand scheme of data management, for the foreseeable future. Reporting, compliance, and queries that need large amounts of cross-functional data will utilize the data warehouse.
However, that data warehouse could have a different profile than the in-house, enterprise-managed, closed-source, disk-based, row-oriented DBMS from one of the major providers, and it could be multiple databases. Analytic specialist databases are absorbing workloads on the edge of the data warehouse. Either type of system could be housed and managed in-house or in one of the emerging viable offsite options such as in the cloud, managed at a data facility or in a software-as-a-service fashion.
Cloud ComputingCloud computing in particular is a term being widely adapted and an approach that is taking on increasing credibility with newer data players such as Amazon, Salesforce.com and Google taking the forefront. It’s a delivery model with hosted software paid for on a subscription basis. Various levels of dedication of machines, disks and physical space are possible. It’s like web hosting where there are virtual private servers, or you may be sharing some space and you don’t know or care where it is as long as it’s reliable. The great news with cloud is you can get started quickly with any system profile and move if it isn’t a good fit.
Some organization charters will disallow such arrangements, mainly due to security concerns. These concerns are quickly eroding as providers are shoring up user access services (such as immediate removal of access when an employee leaves the company), regulatory compliance, tested recovery procedures and service plans that ensure quick response if/when companies need to react quickly to urgent security measures and investigative matters.
Fault tolerance, high availability, on-demand capacity and other nonfunctional requirements are other concerns in these on-demand platforms and well worth deep research into these for enterprise platforms.
AppliancesRegardless of where it resides, the platform could be the black-box data appliance – which is also a platform with little tinkering required to it – for the data warehouse type of workload. Saturated with CPUs, and with low relative overall cost, the appliance market is stronger than ever. Appliances can reside in the cloud. They can also be column-based.
Columnar DatabasesMost databases we’ve used over the years have been “row-based.” This refers to the layout of the records on the data blocks. Within a block, in row-based, all columns of the row are stored in column order. This repeats for as many rows as can be fit into the block, which is the unit of input-output (prefetch operations notwithstanding). Columnar databases only store one column at a time. Blocks are filled with a column of values in the order of their corresponding rows. A multi-column query will have an operation, either early or late in the processing, to “glue” the columns together to simulate the row in the result set.
Columnar databases speak directly to the bottleneck that has formed in our input-output operations over the years. While transistors per chip and disk density has improved tremendously over the years, disk speed has not due to the physical arm movement required. Our designs have changed little to accommodate this reality. Random disk I/O has improved modestly compared to sequential – it’s all that disk head movement required. Columnar databases get the CPU busier by not flooding memory and cache with unnecessary columns – only the ones that are interesting to the query are analyzed.
One of the great advantages to columnar is compression. There is normal compression of values, but there is also the ability to just say the equivalent of “rows 1 to 100 contain the value 123” which is more efficient than repeating 123 100 times. A great indexing strategy on a row-based system mitigates some of the columnar advantages, but columnar has its preferred analytic workloads and many organizations are finding edge applications for columnar databases.
Open SourceThen, there’s open source, in which you trade off some robustness and culture for price (and source code if you value that!). The research pipeline has no shortage of other platform dynamics we’ll need to consider, such as solid state disk, phase change memory and RDF triple stores.
Heterogeneous Data Management FutureA heterogeneous data management future is certain, and there are many variables to understand in the system architecture for any systems being considered. With heterogeneity comes the increased need for integration and the acceptance of a virtualized and operational business intelligence environment, perhaps supported by an integration bus. The data management future brings confusion and opportunity. Look for those opportunities, perhaps on the edges of your enterprise workload, to bring in the platform with characteristics of cloud, open source, appliance model, memory efficiency, or columnar orientation.
Recent articles by William McKnight
Copyright 2004 — 2019. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC