It was columnar day for me at SQL PASS on Wednesday. On Tuesday, Microsoft announced that Denali, the, the code name for its next release, would have a columnar data store option. My talk was on columnar databases Wednesday. Here are some of the details I shared about Denali's column store, which has a project name of Apollo. If you're interested in columnar databases in general, see my blog entries here.
In Denali, there will be an index type of "COLUMN STORE". I find this to be an interesting use of index, because the resultant data stores that are created are not like traditional indexes. However, Microsoft has never been a conformist organization. The column stores are non-clustered indexes. No indexes can be created on top of the column stores.
Where the column store is like an index is that you need the accompanying row-wise database. The column stores are not created mutually exclusive to the row-wise database. To my knowledge, this is the only database that requires this. I don't expect this to be a long-term requirement. While this may seem like it's expanding your storage needs (and it is), it may not be as much as you initially think because some non-clustered indexes might become redundant in this architecture.
The good news about this is that the optimizer has been updated to route queries to the column stores or the row store accordingly. This could prove to be a competitive differentiator. Few other database systems have this. An intelligent hybrid optimizer will be key to the success of databases that are at least partly columnar.
Apollo's vectors (per my Sybase IQ language in my earlier posts) are called column segments, although there can be multiple segments per column, as explained below. You can only have one column store index per table, but you can name as many columns as you want. Of course, it doesn't matter what order you use because each column forms an independent segment. Only single-column segments are supported in Apollo.
Apollo leverages the intellectual property, patents, and good experiences that Microsoft has had with Vertipaq, the client-side, columnar, in-memory structure used with PowerPivot. Columnar remains the preferred, future, and only, format for Vertipaq.
In Apollo, no inserts, updates or deletes are allowed on the tables that have a COLUMN STORE (this is the part of the talk where I did a mock exit). You can, however, do incremental loads and you can switch partitions to add data. You can also sandwich DISABLE and REBUILD of the segments around your updates. I expect this will improve over time.
As long as I'm on limitations, the columns selected have data type restrictions. The columns must be integer, real, string, money, datetime or a decimal that is 18 digits or less. No other data types are supported. [11/19 update: Apollo will support all the date and time types (including datetime) except that datetimeoffset is only supported for precisions of 0 to 2.]
As we know, getting more (relevant) information in the I/O is one of the advantages of columnar data store. Microsoft has taken this to a new level. While data is still stored in pages (blocks), the unit of I/O is actually one million data values. That forms a "segment." You read it right - the unit of I/O is not a certain number of "K" but has to do with the NUMBER of data values. Inside those pages, the data is stored in blobs. Bitmapping is part of the storage somehow as well although columnar data page layouts are not public information. Neither is how it's doing materialization. As for I/O, compression algorithms have been reengineered for columnar. These are not the same compression algorithms from the row-wise database.
If there are more than a million data values represented, there will be multiple segments. Apollo uses run-length encoding ("row-range" in this post) and the 'million' refers to the number of corresponding rows covered by the data values. If there are 500,000 unique values in a 2,000,000 row table, there will be 500,000 entries but 2 segments.
Apollo is not available in the CTP1. I have no information on its licensing, except that it probably won't be available in the lower-tier SQL Server licenses. As for which columns to store this way, it's going to be per the recommendations from this post.
Get ready as Microsoft begins its foray into columnar and taking the pressure off the core DBMS to do all workloads and do it all fast.