Performance Trade-Offs – Joins vs. Storage

Originally published March 11, 2010

Just like every other decision that needs to be made when looking at performance, we need to look at trade-offs to speed up the process of building data warehouses and data marts. One such trade-off is to look at the performance of large data sets. My hypothesis is that landing data on disk (and hence requiring more space) may be worth the added cost based on being able to build the data warehouses and data marts faster.

The Question

What is the best performing way to load a fact table that has 20 different dimensions? As I see it, there are 3 options:
  1. Create the table from a single FROM … WHERE clause

  2. Create a view that does the FROM … WHERE clause and join that to the input for the fact table.

  3. Create the table with multiple staging steps joining

The Test

20 dimension tables (named dim_a through dim_t)

1 input table (load_detail)

1 fact table (name TheBigFact)

SQL Server 2008

The dimension tables and the single load table were populated with a script and then I used WhereScape RED to generate the stored procedure(s) that will build the fact. The data model is shown in Figure 1.
 


Figure 1: The Star Schema


The dimensions were generated with a variable number of rows. The counts are in the table.



The input table is called load_bigfact and contains 8,176,989.

In order to get all rows from my input table, I am going to LEFT OUTER JOIN all of the dimensions and if they do not match, I will put a zero value in the corresponding KEY value in the table. Below is the generated code for building this table.

 INSERT INTO TheBigFact WITH ( TABLOCK )
 (
 amount,
 unitprice,
 exendedprice,
 dim_a_id,
 dim_b_id,
 dim_c_id,
 dim_d_id,
 dim_e_id,
 dim_f_id,
 dim_g_id,
 dim_h_id,
 dim_i_id,
 dim_j_id,
 dim_k_id,
 dim_l_id,
 dim_m_id,
 dim_n_id,
 dim_o_id,
 dim_p_id,
 dim_q_id,
 dim_r_id,
 dim_s_id,
 dim_t_id,
 dim_a_key,
 dim_b_key,
 dim_c_key,
 dim_d_key,
 dim_e_key,
 dim_f_key,
 dim_g_key,
 dim_h_key,
 dim_i_key,
 dim_j_key,
 dim_k_key,
 dim_l_key,
 dim_m_key,
 dim_n_key,
 dim_o_key,
 dim_p_key,
 dim_q_key,
 dim_r_key,
 dim_s_key,
 dim_t_key,
 dss_update_time
 )
 SELECT
 load_bigfact.amount,
 load_bigfact.unitprice,
 load_bigfact.exendedprice,
 load_bigfact.dim_a_id,
 load_bigfact.dim_b_id,
 load_bigfact.dim_c_id,
 load_bigfact.dim_d_id,
 load_bigfact.dim_e_id,
 load_bigfact.dim_f_id,
 load_bigfact.dim_g_id,
 load_bigfact.dim_h_id,
 load_bigfact.dim_i_id,
 load_bigfact.dim_j_id,
 load_bigfact.dim_k_id,
 load_bigfact.dim_l_id,
 load_bigfact.dim_m_id,
 load_bigfact.dim_n_id,
 load_bigfact.dim_o_id,
 load_bigfact.dim_p_id,
 load_bigfact.dim_q_id,
 load_bigfact.dim_r_id,
 load_bigfact.dim_s_id,
 load_bigfact.dim_t_id,
 ISNULL(dim_a.dim_a_key,0) dim_a_key,
 ISNULL(dim_b.dim_b_key,0) dim_b_key,
 ISNULL(dim_c.dim_c_key,0) dim_c_key,
 ISNULL(dim_d.dim_d_key,0) dim_d_key,
 ISNULL(dim_e.dim_e_key,0) dim_e_key,
 ISNULL(dim_f.dim_f_key,0) dim_f_key,
 ISNULL(dim_g.dim_g_key,0) dim_g_key,
 ISNULL(dim_h.dim_h_key,0) dim_h_key,
 ISNULL(dim_i.dim_i_key,0) dim_i_key,
 ISNULL(dim_j.dim_j_key,0) dim_j_key,
 ISNULL(dim_k.dim_k_key,0) dim_k_key,
 ISNULL(dim_l.dim_l_key,0) dim_l_key,
 ISNULL(dim_m.dim_m_key,0) dim_m_key,
 ISNULL(dim_n.dim_n_key,0) dim_n_key,
 ISNULL(dim_o.dim_o_key,0) dim_o_key,
 ISNULL(dim_p.dim_p_key,0) dim_p_key,
 ISNULL(dim_q.dim_q_key,0) dim_q_key,
 ISNULL(dim_r.dim_r_key,0) dim_r_key,
 ISNULL(dim_s.dim_s_key,0) dim_s_key,
 ISNULL(dim_t.dim_t_key,0) dim_t_key,
 Getdate() --@v_dss_update_time
 FROM
 load_bigfact
 LEFT OUTER JOIN dim_a
 ON load_bigfact.dim_a_id = dim_a.a_code
 LEFT OUTER JOIN dim_b
 ON load_bigfact.dim_b_id = dim_b.a_code
 LEFT OUTER JOIN dim_c
 ON load_bigfact.dim_c_id = dim_c.a_code
 LEFT OUTER JOIN dim_d
 ON load_bigfact.dim_d_id = dim_d.a_code
 LEFT OUTER JOIN dim_e
 ON load_bigfact.dim_e_id = dim_e.a_code
 LEFT OUTER JOIN dim_f
 ON load_bigfact.dim_f_id = dim_f.a_code
 LEFT OUTER JOIN dim_g
 ON load_bigfact.dim_g_id = dim_g.a_code
 LEFT OUTER JOIN dim_h
 ON CONVERT(DATETIME,(CONVERT(VARCHAR,load_bigfact.dim_h_id,112))) =
dim_h.a_code
 LEFT OUTER JOIN dim_i
 ON load_bigfact.dim_i_id = dim_i.a_code
 LEFT OUTER JOIN dim_j
 ON CONVERT(DATETIME,(CONVERT(VARCHAR,load_bigfact.dim_j_id,112))) =
dim_j.a_code
 LEFT OUTER JOIN dim_k
 ON load_bigfact.dim_k_id = dim_k.a_code
 LEFT OUTER JOIN dim_l
 ON load_bigfact.dim_l_id = dim_l.a_code
 LEFT OUTER JOIN dim_m
 ON load_bigfact.dim_m_id = dim_m.a_code
 LEFT OUTER JOIN dim_n
 ON load_bigfact.dim_n_id = dim_n.a_code
 LEFT OUTER JOIN dim_o
 ON CONVERT(DATETIME,(CONVERT(VARCHAR,load_bigfact.dim_o_id,112))) =
dim_o.a_code
 LEFT OUTER JOIN dim_p
 ON load_bigfact.dim_p_id = dim_p.a_code
 LEFT OUTER JOIN dim_q
 ON load_bigfact.dim_r_id = dim_q.a_code
 LEFT OUTER JOIN dim_r
 ON load_bigfact.dim_r_id = dim_r.a_code
 LEFT OUTER JOIN dim_s
 ON load_bigfact.dim_s_id = dim_s.a_code
 LEFT OUTER JOIN dim_t
 ON load_bigfact.dim_t_id = dim_t.a_code

The run started at 14:42:58 and completed at 14:58:27 – for a total of 15 minutes and 29 seconds.

Using a Gigantic View

So, although I would have never done this, I have heard of some folks creating views that bring together a lot of tables and then use that view in their INSERT statement. So I have done this as well. However, our test case may be an extreme case since there is nothing that you can join together in the view itself. I accept that, but wanted to show you the results. Basically, what I did was create a view that brought all of the dimensions together in a single view and joined that to the load table to produce the output to the fact table.

The run started at 15:59:50 and completed at 16:25:59 – for a total of 26 minutes and 9 seconds.

Next Month

Next month, I will perform the test with a different number of joins in separate steps (or stages) to produce the final fact table. The test will be to determine if there are a reasonable number of joins for 20 dimensions to run in a single step that, when added together, would run faster than a single statement. Then we will be able to draw some conclusions on whether the cost of landing more data is worth the time of processing – of course, assuming that there will be time savings, which is my hypothesis.
  • Chuck KelleyChuck Kelley
    Chuck is an internationally known expert in database technology. He has more than 30 years of experience in the design and implementation of operational/production systems, operational data stores and data warehouses (data marts). Chuck teaches seminars on a variety of database and data warehousing topics. He has co-authored or contributed to four books on data warehousing and has been published in numerous trade magazines and written columns on database technology, data warehousing, metadata, master data management, data governance and enterprise data strategies. He may be contacted at chuckkelley@usa.net.

    Editor's Note: More articles, news and resources are available in Chuck Kelley's BeyeNETWORK Expert Channel on Database Design Techniques. Be sure to visit today!

Recent articles by Chuck Kelley


Related TechTarget Editorial Content


 

Comments

Want to post a comment? Login or become a member today!

Posted October 24, 2010 by Chuck Kelley

In the particular case here, I have a "zero" key in my dimension so that if I lookup and do not find a row, then I put a zero in the key ( see the "ISNULL( key, 0)" in the SELECT statement).  That way the referential integrity is intact.

If you have late arriving dimension keys, then you can write a specific UPDATE for that key when the data arrives.

Is this comment inappropriate? Click here to flag this comment.

Posted October 24, 2010 by Mohanraj CP cpmohanraj@gmail.com

When you load a fact table by left joining to the dimensions, how do you ensure the referential integrity for the early arriving facts?

 

Is this comment inappropriate? Click here to flag this comment.