Blog: William McKnight Subscribe to this blog's RSS feed!

William McKnight

Hello and welcome to my blog!

I will periodically be sharing my thoughts and observations on information management here in the blog. I am passionate about the effective creation, management and distribution of information for the benefit of company goals, and I'm thrilled to be a part of my clients' growth plans and connect what the industry provides to those goals. I have played many roles, but the perspective I come from is benefit to the end client. I hope the entries can be of some modest benefit to that goal. Please share your thoughts and input to the topics.

About the author >

William is the president of McKnight Consulting Group, a firm focused on delivering business value and solving business challenges utilizing proven, streamlined approaches in data warehousing, master data management and business intelligence, all with a focus on data quality and scalable architectures. William functions as strategist, information architect and program manager for complex, high-volume, full life-cycle implementations worldwide. William is a Southwest Entrepreneur of the Year finalist, a frequent best-practices judge, has authored hundreds of articles and white papers, and given hundreds of international keynotes and public seminars. His team's implementations from both IT and consultant positions have won Best Practices awards. He is a former IT Vice President of a Fortune company, a former software engineer, and holds an MBA. William is author of the book 90 Days to Success in Consulting. Contact William at wmcknight@mcknightcg.com.

Editor's Note: More articles and resources are available in William's BeyeNETWORK Expert Channel. Be sure to visit today!

September 2011 Archives

Potentially Teradata's most significant enhancement in a decade will be on display next week at the Teradata Partners conference.  And that is Teradata Columnar.  Few leading database players have altered the fundamental structure of having all of the columns of the table stored consecutively on disk for each record.  The innovations and practical use cases of "columnar databases" have come from the independent vendor world, where it has proven to be quite effective in the performance of an increasingly important class of analytic query.  Here is the first in a series of blogs where I discussed columnar databases. 

Teradata obviously is not a "columnar database" but would now be considered a hybrid, exhibiting columnar features upon those columns that are chosen to participate.  Teradata combines columnar capabilities with a feature-rich and requirements-matching DBMS already deployed by many large clients for their enterprise data warehouse.  Columnar is available in all Teradata platforms - Teradata Active Enterprise Data Warehouse, Teradata Data Warehouse Appliance, Teradata Extreme Data Appliance and Teradata Extreme Performance Appliance.

Teradata's approach allows for the mixing of row structure, column structures and multi-column structures directly in the DBMS in "containers."  The physical structure of each container can also be in row- (extensive page metadata including a map to offsets) which is referred to as "row storage format" or columnar- (the row "number" is implied by the value's relative position) format.  All rows of the table will be treated the same way, i.e., there is no column structure/columnar-format for the first 1 million rows and row structure for the rest.  However, (row) partition elimination is still very alive and, when combined with column structures, creates I/O that can now retrieve a very focused set of data for the price of a few metadata reads to facilitate the eliminations.

Each column goes in one container.  A container can have one or multiple columns.  Columns that are frequently access together should be put into the same container.  Physically, multiple container structures are possible for columns with a large number of rows.

Teradata Columnar utilizes several compression methods that take advantage of the columnar orientation of the data.  Methods include run-length encoding, dictionary encoding, delta compression, null compression, trim compression and the previously-available columnar-agnostic UTF8.  Multiple methods can be used with each column.

 

The dictionary representations are fixed length which allows the data pages to remain void of internal maps to where records begin.  This small fact saves calculations at run-time for page navigation, another benefit of columnar. Variable-length records are handled similarly.  Dictionaries are container-specific, which is advantageous in the usual case where column values are fairly unique to the column.   

Starting by analyzing the workloads to be used with the data and focusing on column-specific workloads, then grouping columns accessed together, the foundation for table creation, with its automatic compression, is laid.  Advantages will be seen in fewer storage needs, improvements in I/O bound query performance and scan operations. 


Posted September 30, 2011 3:17 PM
Permalink | No Comments |

NoSQL solutions are solutions that do not accept the SQL language against their data stores.   Ancillary to this is the fact that most do not store data in the structure SQL was built for - tables.  Though the solutions are "no SQL", the idea is that "not only" SQL solutions are needed to solve information needs today.  The Wikipedia article states "Carlo Strozzi first used the term NoSQL in 1998 as a name for his open source relational database that did not offer a SQL interface".  Some of these NoSQL solutions are already becoming perilously close to accepting broad parts of the SQL language.  Soon, NoSQL may be an inappropriate label, but I suppose that's what happens when a label refers to something that it is NOT.


So what is it?  It must be worth being part of.  There are currently at least 122 products claiming the space.  As fine-grained as my information management assessments have had to be in the past year routing workloads across relational databases, cubes, stream processing, data warehouse appliances, columnar databases, master data management and Hadoop (one of the NoSQL solutions), there are many more viable categories and products in NoSQL that actually do meet real business needs for data storage and retrieval.

 

Commonalities across NoSQL solutions include high volume data which lends itself to a distributed architecture.  The typical data stored is not the typical alphanumeric data.  Hence the synonymous nature of NoSQL with "Big Data".  Lacking full SQL generally corresponds to a decreased need for real-time query.  And many use HDFS for data storage.  Technically, though columnar databases such as Vertica, InfiniDB, ParAccel, InfoBright and the extensions by Teradata 14, Oracle (Exadata), SQL Server (Denali) and Informix Warehouse Accelerator deviate from the "norm" of full-row-together storage, they are not NoSQL by most definitions (since they accept SQL and the data is still stored in tables).

 

They all require specialized skill sets quite dissimilar to traditional business intelligence.  This dichotomy in the people who perform SQL and NoSQL within an organization has already led to high walls between the two classes of projects and an influx of software connectors between "traditional" product data and NoSQL data.  At the least, a partnership with CloudEra and a connector to Hadoop seems to be the ticket to claiming Hadoop integration.

NoSQL solutions fall into categories.  These labels may (I dare say should) replace "NoSQL" as the operative term since, despite the similarities, the divergences are many and are exacerbating.  Whereas once all this data was excluded from management (or force-fit into relational databases), NoSQL solutions access this data better, as well as save cost and don't have a per-CPU cost model.  Naturally, many of the solutions are open source and embraced by various vendors with value-added code, training, support, etc.

 

The categories (and future industries) are:

 

Key-Value Stores

 

KVS like Redis store data paired with its key and accessible by a navigable tree structure or a hash table.  KVS support dynamic online activity with unstructured data. 

 

Document Stores

 

Document Stores like mongoDB and CouchDB support schema-less sharding for guaranteed availability. 

 

Column Stores

 

While sharing the concept of column-by-column storage of columnar databases and columnar extensions to row-based databases, column stores like HBase and Cassandra do not store data in tables but store the data in massively distributed architectures.

 

Graph Stores

 

Graph Stores like Bigdata represent connections across nodes and is useful for relationships among associative data sets.

 


Posted September 14, 2011 6:22 PM
Permalink | No Comments |


   VISIT MY EXPERT CHANNEL

Search this blog
Categories ›
Archives ›
Recent Entries ›