Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

This by no means is meant to be an exhaustive list, however some of you may find it helpful. It's a set of information about Very Large Data Warehousing I've gleaned over the years.  I hope you enjoy it, and please add your own knowledge by commenting on this entry.

Things to do when growing your project from 1TB to 50TB, or 50TB to 500TB, or above.

* Establish governance across the IT staff supporting and maintaining the VLDW

* Get training on performance and tuning for the database you are using

* ask the vendor to show you live, working examples of their database/ETL, <insert your favorite vendor here>  of customers working with data sets in the size range you are considering

* Read VLDW reports from the web (some of which the vendor will happily provide to you), some cost money, and others are free

* Ensure your throughput rates for ETL/ELT + loaders are 80,000 rows per second on average with 1k row sizes (without partitioning/parallelism), anything slower is seriously detrimental to your ability to grow the environment

* Check the Data Model.  If you are coming from a column based database (because perhaps you outgrew it), then you must map, create, and manage the data model for relational database that you are moving to.   Even if you are using a column based appliance, you should have a solid data model foundation for logical data representation.  Governance...  the bigger the system gets, the harder it becomes to manage without strict rules and standards and good data models.

* Be flexible.  Learn to align IT, and to turn IT into a lean machine, that can execute rapidly and adapt to business needs.

* don't be afraid to scale out, or in some cases - scale up (mostly with Big-Iron, and no, Big-Iron is not dead, far from it).

* Learn the terminology MPP, SMP, NUMA, Clustered, Grid, Cloud

* Establish a mitigation plan, and a risk analysis plan for "what if this node fails?"

Technically what you can do:

* Test the limits of your machines, networks, disk, cpu, RAM, and so on- understand their maximum throughput, average throughput for multiple parallel processes.

* Test the database, how does it perform with queries running at the same time as a large load?  The bigger the system gets, the harder it will be to "manage this".  Testing with 100,000 rows of data won't cut it when one feed might deliver 1.5 Billion every night.

* Increase I/O disk speed, 300 to 400 mbits throughput per second from the server to the disk and back again.  Watch the BUFFERING that occurs on the disk, ensure that the test clears the buffer on the disk before you run it again, otherwise your results will be invalid

* Increase the I/O channels. The number of I/O channels can have a huge impact on performance of very large systems

Again, theses are some of the things you can do today.  Love to hear about your environment, along with your results.

 

Cheers,

Dan L  danL@GeneseeAcademy.com


Posted May 6, 2009 1:41 AM
Permalink | No Comments |

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›