Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

I hope you've enjoyed this series; I've not received any comments either way. I'll be finishing up this series soon. In this entry I'll address "ETL" processing in general, in another entry I'll discuss "ELT" in general, and then I'll begin to discuss BI Queries and Query engines going forward, finally at the end of this series I'll bring in a couple "appliances" or "appliance like" performance enhancers.

In this entry, I'm going to focus on super fast, high speed ETL. This entry is less about the tools, and more about the architectures that work. I hope you enjoy it.

ETL is frequently used to move huge batches of data around; in fact, it's now an "aging" technology, but has proven itself time and time again as a necessity for the following reasons:

a) Managing metadata: technical metadata, process flows, process scheduling, structured data sets, and more recently: "semi-structured" data sets like XML and so on.
b) GUI Driven partitioning and parallelism
c) GUI Driven grid setup / administration and maintenance
d) Data Quality.
e) Access to unstructured data, and conversion to a structured world.
f) GUI driven connections to heterogeneous sources and targets

When we look at this list, there's not only a lot of power available, there are tremendous benefits to using the ETL tools on the market (which I'm sure there are more, but some of the ones I'm familiar with are) - (hopefully in alphabetical order)
* Ab-Initio
* Business Objects - Data Integrator
* Hummingbird - Genio
* IBM - DataStage
* Informatica - PowerCenter
* Microsoft - SSIS
* Oracle - Sunopsis (whom we haven't heard from since the acquisition)
* Pentaho - KETLE
* Talend
* Teradata - Warehousing Utilities

Within these tool sets, there are a variety of options to choose from, but what is most important in VLDW or large data sets is the architecture. The _manner_ in which we construct our loads greatly impacts performance.

Too often, the industry or the vendor suggest that simply by throwing hardware at the problem, performance can be solved (oh if only you would buy grid, or buy this, or buy that...) That helps for a little bit, but ultimately doesn't solve the problem.

If you're ETL routines (during peak operation) are not pushing the existing hardware beyond an average load of 60%, or are not sustaining a hardware load of 80%, then most likely the architectures of individual "data flows" are not correct, especially for big data.

Now wait a minute! You're telling me that just because I can, doesn't mean I should?
Yep... Too many tools make it too easy to "add another source, add another target, mix inserts, with updates, with deletes" all in the same data flow. This adds unwanted and unneeded complexity. Any time complexity enters, performance leaves.

I've blogged about this before... If you're headed down a highway towards a brick wall, and you refuse to change direction, what will happen? You'll hit the brick wall.

If you've got performance problems, and you refuse to change the architecture to try new things, you'll still have performance problems. There's only so much tweaking of knobs that can help performance, then it's all up to the architecture.

With Big Data & ETL, the first thing to realize is the mathematics behind the arrival of the data sets. The mistakes that ETL designers make are as follows: (these get more painful, the larger the data set, and / or the lower the latency of arrival of the data)

1) Incorporating Inserts, Updates, and Deletes in to the _same_ data flow / same process.
2) Sourcing multiple systems at the same time, depending on heterogeneous systems for data.
3) Targeting more than 1 or 2 target tables
4) moving rows that are too wide through a single process
5) loading very large data sets to targets WITH INDEXES ON
6) not running a cost-based optimizer in the database
7) not keeping statistics up to date in the database
8) not producing the correct indexes on the sources / lookups that need to be accessed
9) not purchasing enough RAM for the ETL server to house the RAM caches in memory.
10) running on a 32 bit environment which causes significant OS swapping to occur
11) running on a 32 bit environment which causes significant OS swapping to occur
12) running on a 32 bit environment which causes significant OS swapping to occur
13) Trying to do "too much" inside of a single data flow, increasing complexity and dropping performance
14) believing that "I need to process all the data in one pass because it's the fastest way to do it." This is completely false, multi-passing the data can actually improve performance by orders of magnitude. IF parallelism can be increased.
15) Letting the database "bounce" errors back to the ETL tool, dropping flow rates and throughput rates by factors of 4x to 10x.
16) "THINKING" in a transactional mode, rather than a batch mode, and processing each row, one row at a time (like they would code a cursor in a database language).
17) LOOPING inside an ETL process, because they think it's necessary (transactional processing again).

Oh, and did I mention running on a 32 bit environment which causes significant OS Swapping to occur?
Let me explain this one. Any single process started within a 32 bit environment is limited (immediately) to 2GB of addressable RAM, now - take away the RAM needed for the threads or the executable engine code... usually around 300M or more, leaving us with 1.5GB to 1.7GB of RAM to work with. Then, take away any "caching" objects that might use this RAM, and multiply that "drop" in RAM by the number of caching objects that you have....

Now, you are left with possibly 200MB, maybe 100MB of RAM left to allocate for "data reading/data writing"... and it drops from there.

Let me explain this other piece as well: Windows 32 bit OS MUST RUN CODE UNDER THE 640k BOUNDARY!!! So there's additional Limitations there, Windows 32 Bit PAGEFILE.SYS (swap/temp) is SINGLE THREADED, AND BLOCKS I/O OPERATIONS when swapping pieces to disk. Oh yes, one more: Windows 32 Bit - RAM allocation will ONLY give the application 1/2 of the requested RAM, and AUTOMATICALLY put 1/2 in the pagefile.sys swap area.

By the way, here's a tid-bit for you that you have to try to believe:
If you are running Windows 32 bit, SQLServer (either version)....
1. Change your ETL "block size" to 8k to match the Database (4k I think in SQLServer2000)
2. Change your row size to fit as many rows as possible into an 8k block, the more rows per block the faster the performance.
3. Change your commit point to 8,500 rows (this is the sweet spot)
4. Check in on NETWORK PACKET SIZES and increasing those between the ETL engine and the Database, increase them from 2k/4k to 8k - again to match the disk.
5. USE ODBC/OLE DB connectivity; do NOT use NATIVE libraries (except in SQLServer2005 direct CT-LIB). If your tool uses DB-LIB as "native" connectivity, it will be slow, slow slow - DB-LIB is 12 year old technology, and is NOT multi-threaded, nor parallel, where CT-LIB is.

Windows 64 Bit does NOT have these limitations, but requires 30% more hardware to run "cool" and efficiently, but it is FAST when installed and configured correctly on the right hardware.

Linux, and Unix do not have these limitations either.

So, architecture is everything. If you take the list I've published of what NOT to do, and change your architecture accordingly, you should see significant performance gains when running "the tuned data flow" by itself (as a test).

There's one more piece I'd like to share (for instance, why letting the DB handle errors slows down the processing).
Most ETL engines handle "blocks" of rows, which is why if you think transactionally, or architect transactionally (row by row), then performance will NEVER reach it's maximum. Let's say the ETL engine puts 150 rows in a block, and sends this block to the database. The database then caches this block until it reaches a commit point (in general).

The database is then "asked" to commit the rows it has cached (in TEMP mind you). The database then has to iterate all the rows in each block (using optimized bind array) and bounce the "rows in error" ONE AT A TIME back to the ETL engine. Now I ask you: is this transactional or "batch" oriented thinking? Right! Transactional... Of course it's going to be slow. Now, the other piece that happens is the ETL engine has to STOP PROCESSING NEW ROWS, Pick up the "rows in error" one at a time, and flow them out to disk.

We've just quadrupled the I/O for that row, decreased our performance by a factor of 4x (at least). Why? The row traveled from the ETL buffer to the network, over the network to the database, was cached in TEMP in the database, then was read from TEMP (possibly disk), checked against the real table (again disk), and bounced back over the network. The ETL engine at this point STOPS it's parallel processing of all kinds, just to handle the row in error, one at a time, and write it to local disk.

This same process is repeated whether we load direct from ETL or we load from a database Loader. The process is worse when we mix inserts with updates with deletes, why? because the database can no longer use block style bind arrays, it now must process each row in it's own individual statement (transactionally based).

So already you've lost 4x to 10x the performance you COULD be achieving, which means that a "data flow" that COULD execute at 80,000 rows per second now executes at 8,000 rows per second. A significant price to pay when dealing with large data sets.

The math:
800 million rows / 80,000 rows per second = 2.77 hours (about)
800 million rows / 8,000 rows per second = 27.77 hours (about)

A huge performance drain. Again, performance and tuning at these volumes usually means going contrary to the grain of what you've typically learned in building ETL load routines.

Thoughts? Comments? I'd love to hear from you.

Cheers,
Dan L


Posted September 21, 2007 6:14 AM
Permalink | 5 Comments |

5 Comments

Great series of posts on large data warehouses. I blogged about this post but I don't know if you have a trackback available.

What do you mean by point 14 multi-pass?

Hi Vincent,

Good to see your comment here. Trackback has been "shut-off" due to spam attacks here on the network. But, if you'd like a track-back, please send your request to me, and I'll be happy to forward it to the right person @ B-Eye network, and get it listed.

By the way, point 14: multi-pass, I mean that the ability to "break-up" a single flow into multiple flows that might execute in serial, and to change the width of the data sets and possibly the parallelism along the way would help the performance. Maybe next year I'll blog on an example of this.

Thanks,
Dan L

While searching for imformation on high-volume ETL in SSIS, I came across your blogs. I was interest to note your 80000 rows/sec number - is this a number you've had from real world? I am trying to predict volumes and durations, and we have some big numbers for teh analyst (6 billiob in 1 table, for example), so I am trying to get a real feel for the scales achievable by SSIS. Our initial take would be running on a 64 bit SSIS box with 64 GB ram, pulling from a replicated environment, and pushing into a 'star schema' (not really star, but lets settle on that term for now).

Naturally, we will be trying to achieve parallelism where possible, but I'm trying for 'finegr in the wind' speeds, so I can estimate the order of magnitude executions times....

Hi Wanderer,

Yes, these are real-world numbers. But the speeds are not from SSIS, they are from another ETL tool known as Informatica. The top speeds I've measured with SSIS are around 30,000 to 50,000 rows per second using 32 bit systems.

Now, that said, using BULK INSERT command on the SQL Command line in SQLServer, and setting the DB to bulkcopy true, and logging to BULK, I've received much higher speeds than that, somewhere around 80,000 rps was what I saw.

Parallelism should only increase these speeds. But: before you publish any numbers, you need to performance test your own system. This is vital, as I've been performance and tuning systems for over 15 years. On the other hand, if you'd like to contact me about assistance in this area, that would be fine too.

Cheers,
Dan Linstedt
DanL@DanLinstedt.com

Hi Dan,

Thanks for the prompt response.

Will take on board the tips around BULK INSERT - our intention is to build 'rows/per second' tracking, and then start testing a few different designs in terms of the ETL, to determine our throughput from single components, and what presents the most efficient overall solution.

In terms of parallelizing items, there will surely be an overhead on the actual managing of the parallel streams, but naturally we would expect to achieve an overall efficiency (2 x 50000 rows/sec in parallel may be more efficient and 1x80000 and 1x10000, but that will also be impacted on by the amount of rows to processes. Also, in terms of the parallelizing of extracts, I would expect that at a certain point, we might run into memory pressure (where we would then start trying to consider changes to DefaultBufferMaxRows and DefaultBufferMaxSize to see if that speeds up things).

We are also considering creating some parallelizing at a 'module' level - since we have 2 discrete systems we will be extracting from, and potentially up to 7/8 discrete databases (one of the systems we need to extract from is still in design phase, so details are unclear). Our current intention was to design robust, where possible independant SSIS packages (or modules), that would then be run in a higher level SSIS ETL 'orchestration' package (stole that term from a Biztalk colleague). Again, where possible, some modules may run parallel, while others may need to run sequentially.

Also, it is felt that by designing and estyablishing boundaries for independant 'modeules', we could port some modules to other servers, and distribute the processing load is we need to scale out (don't see us scaling up from a 16 core, 64 GB 64 bit machine unless someone pays for 128 GB ram :D).

What is you take on the above?

Leave a comment

    
Search this blog
Categories ›
Archives ›
Recent Entries ›