Business Intelligence Network business intelligence resources

Blog: Dan E. Linstedt

« System of Entry, System of Record - System of Shifting Sands | Main | Thoughts about Dynamic Data Warehousing & Context »

Indexing: VLDW and Data Sets

We are nearing the end of the entries I will be making (for now) on the VLDW world. I will discuss indexing going forward in a traditional RDBMS engine point of view. "Appliances" are changing some of this as they move into the field. But for now, indexing of large data sets requires some consideration.

When people think of large data sets, they often forget to consider the indexing. They will state to me: I have to move 1.5 Billion rows in to the RDBMS within 1 hour and 20 minutes. But then they make statements like: I can't get the performance over 30,000 rows per second, or 8,000 rows per second, or otherwise. They don't understand why the table "causes deadlocks", or why they should be concerned about TEMP I/O on load.

Well, when you load a table using a bulk loader, you can load it only two ways: FAST or SLOW, these are the only two modes available. You can request FAST load from a number of loaders, but they a) either will stop, or b) silently switch to slow mode if they run in to specific conditions. One of those conditions is indexing, another is clustering, another is constraints, and in some databases DEFAULT VALUES are considered constraints, also causing a slow down.

Indexes play a huge role in the performance of the queries, and by all means are necessary to make systems perform. In some database engines, indexing has been tremendously improved (within the last year or so) over the previous release. Here's what matters:

1. Don't have indexes on the table / partition during load, disable them, remove them, rebuild them after the load is complete; and if you're not familiar with indexing partitions, read about it and learn about it. People sometimes forget - they partition the table, but not the index.
2. Partition your indexes
3. Use Raid 0+1, move the indexes to different parts of the disk, even across separate I/O channels if you can - unless your Disk device has HUGE RAM Buffers, and your server has a HUGE RAM cache for your Database Indexing efforts.
4. Setup the indexes to work in parallel. Often times DBA's will tune the table for parallelism, but forget to tune the indexes for parallelism.
5. Change to COST BASED OPTIMIZERS, KEEP YOUR STATISTICS UP TO DATE (as I've mentioned in a previous posting).
6. Always, keep your indexes small - no more than 6 fields, no more than 250 bytes wide. Some databases state they won't even index past 250 bytes, EXCEPT in the Text fields.
7. Use SEPARATE indexes: UNIQUE for instance with (TextField, Primary Key) just for text fields
8. Don't try to index more than 1 text field at a time
9. SPLIT the text field out into it's own table, and join to it (normalize it), probably your biggest performance gain all around.
10. Increase the parallelism ON THE INDEX
11. set the FillFactor to 90%, PCTFREE to 10%, especially in data warehouses - and if you PARTITION your indexes, bump those even higher / lower to match. Data should always be inserted into a NEW partition by time, causing new pieces of the index to be built.
12. DO NOT CLUSTER, EXCEPT for maybe the Primary Key (Surrogate Sequence Number). Why? Because it multiplies the I/O of the incoming data by a factor of 2x, 3x, 4x, and more - especially as the data set grows.
13. Use Multiple SMALLER indexes to gain coverage across your data, reduce the size of each index - but have a few more of them, you'll find even the queries are faster.
14. Increase the amount of RAM that holds the indexes in buffers within the Database, INCLUDING BITMAP type indexes.
15. Increase the Read-Ahead buffering of the Indexes.
16. Increase the BLOCK SIZE of the database, it affects # of rows per index chain as well as the data.

IF YOU ARE SWITCHING FROM ONE DBMS VENDOR TO ANOTHER:
**** RE-EVALUATE YOUR INDEXING STRATEGIES ****
Do Not assume that the same indexes on the new RDBMS will perform the same as the indexes on the old RDBMS, it simply is NOT true (in most cases).

Hope this helps,
Daniel Linstedt
http://www.DanLinstedt.com - Data Vault Modeling

  Posted by Dan Linstedt on October 16, 2007 3:52 AM |

Post a comment