The Green Data Warehouse, Part 2 Top 10 Things You Can Do to Improve Energy Efficiency

Originally published February 19, 2009

My previous article, The Green Data Warehouse, Part 1, described how to plan out a green computing strategy for your data warehouse initiative and then measure actual results once implemented. The article presented a holistic view of the key infrastructure components required to support such an initiative and presented an approach for measuring consumption at the component level. More importantly, the linkage between reducing server size and associated cost savings at the data center level was discussed. As a refresher, the measurement framework for an individual server or other component consists of the following components:

  • Power consumption measures how much power is consumed for a given time quantity for the entire system. A minimum benchmark period is required (say, 24 hours) to get an average consumption measurement. Savings are calculated by a kilowatt per hour (kWh) difference multiplied by your cost per kWh ($/kWh).

  • Heat generation measures the average heat generated during the benchmark period. Savings will result by a reduction in the cost to cool the incremental heat generated by the baseline system.

  • Footprint measures the volume of the entire system (height x width x length). Savings are calculated by multiplying the difference in volume versus the baseline by the average cost of rack space per square foot of data center space.

  • Materials include the weight of key materials in the system (need top 5-10 materials used in production of a typical platform). Savings can be calculated by incremental cost to both produce and dispose of those materials, although the most significant savings is to the environment as a whole.

This article will take a more tactical approach, presenting 10 things you can do immediately to reduce consumption at the system level. For each method, I’ll discuss how consumption will be reduced based on the measurement framework, provide examples and discuss pragmatic ways in which you can introduce this approach into your current environment, or bake this into your design efforts for your new initiative. The list is in no particular order.

Top 10 List

  1. Data Compression: Reducing the space required to store data results in a reduction in the footprint of the data storage devices, lowers power consumption and decreases heat generation. Data compression can occur within the application, at the database or operating system level, or burned into the hardware. Examples of this include:

    • Columnar databases that store data sequentially by column, allowing for sigificant compression especially for low cardinality situations

    • Compression applications that work in conjunction with the operating system to compress data before it is stored and then uncompress upon retrieval

    • Specialized hardware that compresses data as it is being stored and then uncompresses upon retrieval

  2. Accelerated Query Processing: Increasing the number of queries that can be processed in a given time period reduces server footprint, leading to a reduction in power consumption and heat generation. The traditional focus of query processing improvements has been within the database since that is where queries are compiled and executed, and where operative data sets are stored. Recent technological advances have presented opportunities to improve query performance at the hardware level. This includes specialized processors that are built to execute SQL or solid state memory devices used to store hot spots or heavily used tables within the data warehouse.

  3. Rationalized Backups: This wasn’t on my radar until a colleague mentioned in passing that he was able to reduce backups by 75%, resulting in a $750,000 annual cost savings. The majority of that savings was in the reduction of hardware purchases and didn’t include the reduction in energy consumption as they outsourced their data center. However, they surely received a benefit from their outsource provider based on the reduction in footprint and associated consumption levels. The tactic they used was fairly obvious – they inventoried their backups and eliminated any overlaps.

  4. Improved Database Access: This is a close cousin to accelerated query processing, but I felt it was necessary to list this separately as the focus is more on the application level. The benefits are the same, namely reduced footprint and lower power consumption. Examples include using stored procedures to capitalize on pre-compiled code, adding appropriate indexes to heavily used tables and creating aggregate tables to serve up data at “hot spots” in the dimensional cube. All of these techniques make code more efficient and reduce the need to add costly and power-hungry components such as servers and memory. Of course, the skeleton in the closet is poorly written code, resulting in inefficient queries. Early in my career, I was a database lead on a large project and spent approximately 50% of my time rewriting poorly written SQL that resulted in table scans on multi-million row datasets.

  5. Optimized Reporting Tool: An extension of improving database access, this approach focuses on tuning your reporting tool(s) to allow the most efficient access to data within the architecture and capabilities of the tool. Examples include utilizing caching, query optimization, aggregate awareness and overnight batch report generation. The benefits are a reduction in query processing, not only on a per-query basis, but more importantly the ability to spread the query load out over time. The associated reduction in peak load can result in significant cost savings by eliminating the need to purchase additional hardware and by running heavy queries at off-peak hours where energy costs are usually lower.

  6. Improved Requirements and Design: This technique reaches into the early stages of the project life cycle. Most corporate technology systems are launched to support a specific set of business needs, starting with requirements definition and continuing with system design. This not only lays the foundation for the system, but can also plant the seeds for future inefficiencies. These inefficiencies usually result in increased data storage and additional ETL processing requirements. Examples include the following:

    • Redundant or overlapping requirements that are not identified and consolidated in the design phase

    • “Nice-to-have” data elements that aren’t currently required by the business but are included anyway

    • The dreaded bottom-up approach to building a data warehouse, grabbing every data element in the source systems and hoping someone will utilize the data sometime down the road

    • “Heavily de-normalized tables that replicate data elements in order to reduce joins

  7. Extract, Transform and Load (ETL) Efficiencies: In addition to the benefits attributed to improved requirements and design, ETL efficiencies can result from optimizing the data model, reducing temp storage and optimizing the transformation logic. These can lead to significant reduction in data storage requirements and server processing capability. ETL processing can consume a significant portion of overall system capacity (both processing and data storage), and reductions in this area can result in tremendous reductions in energy consumption associated with your data warehouse. Some techniques include implementing delta processing, indexing temporary tables and processing incoming data sets more frequently (e.g., trickle feeds).

  8. Virtualization: This can be difficult to manage in a data warehouse environment since it is generally not a good idea to mix analytical with transactional systems. And not many organizations are large enough to benefit from abstracting physical resources for their analytical environment alone. The benefit can come from combining transient data warehouse processing with other environments that are not used at the same time of day, thus maximizing the processing and storage capability of existing infrastructure. For example, data warehouses require significant temporary disk space to support ETL processing, which can be virtualized into a pool of other transient applications that run at different times of the day.

  9. Cloud Computing: This IT buzzword du jour has not gotten a foothold within the data warehouse community, and may never represent a significant portion of the installed applications. Concerns about security, performance of queries and data loads, and connectivity issues are all valid. The benefits revolve primarily around centralizing servers, disk and other components in a (hopefully) highly optimized data center.

  10. Networked Data: This includes storage area networks (SANs), network attached storage (NAS) and other shared network devices. The reduction results primarily from the centralization of the disk into a cabinet or set of cabinets that can be centrally managed in a data center equipped with efficient cooling and storage. This happens under the covers and can be accomplished independent of any changes to the application or even the database management system.

Conclusion

While the focus of this article is on reducing energy consumption, it is no coincidence that most of these techniques will also improve the overall health and efficiency of the data warehouse and should also result in increased end user efficiency. Overall efficiency of a system is a common denominator to reduced energy consumption.

One a related note, I just finished reading Tom Friedman’s Hot, Flat, and Crowded, and learned quite a bit about the power generation industry in this country. I feel that everyone should understand how turning on a light or plugging in a server impacts the central power plant. Therefore, my next article will explore the linkage from computer components to data center to power plant, and explore options for building a smart data center that is in tune with centralized power generation.

  • Rick AbbottRick Abbott

    Rick Abbott is President of 360DegreeView, LLC. He has over 20 years of information management and technology experience, including private and public sector work. He has significant experience in both the telecommunications and financial services industries, and has over 8 years of "Big 5" experience, including an associate partnership position with Deloitte Consulting. He has direct experience in all aspects of business intelligence and data warehouse projects, including business case development, strategic planning and business alignment, business requirements, and technical architecture and design. He also has significant experience in assisting clients in negotiating large technology product, service, and outsourcing contracts. Rick can be contacted at rick@360degreeview.com.

Recent articles by Rick Abbott

 

Comments

Want to post a comment? Login or become a member today!

Posted February 19, 2009 by Joe Foley

Rick, the greenest EDW will generally be the one with the least hardware.   I just posted an entry on my blog about shared everything vs. shared nothing vs. shared something.  In a sentence, a database system that overloads processors rather than I/O can be greener than any combination of features in a system that overloads I/O channels and forces massive parallelization.

Is this comment inappropriate? Click here to flag this comment.