Oops! The input is malformed!
Originally published September 19, 2011
Continuing my series on technologies that are changing our view of the data warehouse, this article focuses on columnar databases. Unlike Hadoop, described in the first article in this series, the best-fit workloads for columnar databases do compete with workloads in the data warehouse. However, because database management systems are increasingly adding columnar storage possibilities, columnar databases may be viewed not as competing with the data warehouse, but as finding their functionality embedded in the data warehouse. This will create new decision points for data warehouse designers everywhere. Columnar – whether the storage mechanism of a separate analytical database or the chosen storage for some columns or tables of an existing data warehouse – can dramatically improve the performance of certain queries and advance any company on its path to complete information utilization.
Columnar storage reduces input/output (I/O), which has gradually become the unquestioned bottleneck in analytic systems today. As you will see, columnar databases not only greatly limit the volume of storage going through the I/O channels, but also help ensure that whatever does go through I/O is very useful to building the query results. The I/O has become the bottleneck over the years due to increasing data sizes and the overwhelming need to consume that data. All data that is part of the I/O consumes resources that other I/Os cannot consume. If that data is page metadata or columns clearly uninteresting to the query, it is still consuming I/O resource. Hence, the I/Os bottleneck.
The most significant difference between row-based and columnar storage is that columnar storage “breaks apart” the row into its respective columns and treats the columns as entities in their own right, including for input and output. This means the values of only one column are retrieved in a single I/O. The data manager of the DBMS can now choose the “single” it wants rather than being forced to buy the whole “album.”
Each column store (each vendor has a different name for them) also lacks the “ID Map” used for page navigation that is crucial in a row-based page. Due to the homogeneity of data on the columnar page, value-level metadata is not required. The corresponding row is determined by the position of the value on the page. The third record’s value is in the third position and so on. This is true for each column store of the table. This ensures the row will be “glued together” correctly at the end of the query and it matches column four of row 45 with column five of row 45. The columnar optimizer will plan for the “gluing” together of the result set late in the query operation.
Page sizes have increased over the years in an effort to get more data into each I/O. Column databases have increased the amount of data in the I/O by providing more flexible I/O, the level of which is determined by query context. A row-based database retrieves the values of the projection (WHERE clause) and selection (SELECT list) at the same time since both occupy the same row. The column database has to break this up. It needs to build a select list from the projection criteria. However, if it exploits parallelism well, it can ramp up selection once minimal projection is complete. And once it gets well into the query, it can be performing I/O on both at the level of a row-based prefetch (hundreds of pages at once).
In row-based stores, choosing a table scan for a query is a multi-faceted decision that results in the expensive operation of performing I/O on all pages for the table. A single column, at a fraction of the table width, can be scanned with much less concern for inefficiency. For various reasons like this, the columnar optimizer, tuned for a columnar database, operates with greater efficiency as well. The homogenous nature of the data in a columnar structure makes many compression techniques available that would not make sense in a row-based structure. Foremost among the techniques is dictionary compression.
Dictionary compression enforces the previously mentioned feature of the data pages that all values are a consistent length. Variable-length fields could obviously compromise this so instead of an actual value, a token from a separate dictionary is used. This holds true for long character fields as well because it offers the advantage of trimming insignificant characters from the end of the character string. Consequently, full use of dictionary compression treats long characters just like variable length characters. Dictionaries can be kept at the column store level or, for those database management systems that break up the column into multiple files, at the column file level. They can even be kept at the page level. Some will keep dictionaries at various levels depending on the characteristics of the data.
Some will take the dictionary concept a step further and tokenize parts of a key, especially a character string like “John” when you have Johnson, Johnstone, Johns, Johnathan, etc., on a page/file/column store. A related compression technique to dictionary compression is to store just the difference the actual values are from a representative value. This works particularly well for large numeric values in a tight range.
Another technique made possible by columnar is run-length encoding. Stored alongside values in columns with lots of repeating values is a starting row number and either an ending row number or a number representing how many values from this point are the same value. Although this introduces the need to read some page overhead to get to a certain value, the compression can be tremendous for a column with a high level of repeat in its values. Compression also alleviates I/O bottlenecks by putting more valuable data into an I/O.
Columnar structure adds a significant tool to the arsenal that, at the least, provides significant change to a data warehouse and, at the most, takes workload off of the data warehouse and into an alternative structure. Some will duplicate data across row- and column-based structures to get the benefits of both, depending on the query. Columnar database benefits are especially significant with larger amounts of data, large scans and I/O bound queries. They also have unique abilities to compress their data at the cost of more expensive load operations.
SOURCE: Columnar Databases
Recent articles by William McKnight