Just like every other decision that needs to be made when looking at performance, we need to look at trade-offs to speed up the process of building data warehouses and data marts. One such trade-off is to look at the performance of large data sets. My hypothesis is that landing data on disk (and hence requiring more space) may be worth the added cost based on being able to build the data warehouses and data marts faster.
The Question
What is the best performing way to load a fact table that has 20 different dimensions? As I see it, there are 3 options:
- Create the table from a single FROM … WHERE clause
- Create a view that does the FROM … WHERE clause and join that to the input for the fact table.
- Create the table with multiple staging steps joining
The Test
20 dimension tables (named dim_a through dim_t)
1 input table (load_detail)
1 fact table (name TheBigFact)
SQL Server 2008
The dimension tables and the single load table were populated with a script and then I used WhereScape RED to generate the stored procedure(s) that will build the fact. The data model is shown in Figure 1.

Figure 1: The Star SchemaThe dimensions were generated with a variable number of rows. The counts are in the table.
The input table is called load_bigfact and contains 8,176,989.
In order to get all rows from my input table, I am going to LEFT OUTER JOIN all of the dimensions and if they do not match, I will put a zero value in the corresponding KEY value in the table. Below is the generated code for building this table.
INSERT INTO TheBigFact WITH ( TABLOCK )
(
amount,
unitprice,
exendedprice,
dim_a_id,
dim_b_id,
dim_c_id,
dim_d_id,
dim_e_id,
dim_f_id,
dim_g_id,
dim_h_id,
dim_i_id,
dim_j_id,
dim_k_id,
dim_l_id,
dim_m_id,
dim_n_id,
dim_o_id,
dim_p_id,
dim_q_id,
dim_r_id,
dim_s_id,
dim_t_id,
dim_a_key,
dim_b_key,
dim_c_key,
dim_d_key,
dim_e_key,
dim_f_key,
dim_g_key,
dim_h_key,
dim_i_key,
dim_j_key,
dim_k_key,
dim_l_key,
dim_m_key,
dim_n_key,
dim_o_key,
dim_p_key,
dim_q_key,
dim_r_key,
dim_s_key,
dim_t_key,
dss_update_time
)
SELECT
load_bigfact.amount,
load_bigfact.unitprice,
load_bigfact.exendedprice,
load_bigfact.dim_a_id,
load_bigfact.dim_b_id,
load_bigfact.dim_c_id,
load_bigfact.dim_d_id,
load_bigfact.dim_e_id,
load_bigfact.dim_f_id,
load_bigfact.dim_g_id,
load_bigfact.dim_h_id,
load_bigfact.dim_i_id,
load_bigfact.dim_j_id,
load_bigfact.dim_k_id,
load_bigfact.dim_l_id,
load_bigfact.dim_m_id,
load_bigfact.dim_n_id,
load_bigfact.dim_o_id,
load_bigfact.dim_p_id,
load_bigfact.dim_q_id,
load_bigfact.dim_r_id,
load_bigfact.dim_s_id,
load_bigfact.dim_t_id,
ISNULL(dim_a.dim_a_key,0) dim_a_key,
ISNULL(dim_b.dim_b_key,0) dim_b_key,
ISNULL(dim_c.dim_c_key,0) dim_c_key,
ISNULL(dim_d.dim_d_key,0) dim_d_key,
ISNULL(dim_e.dim_e_key,0) dim_e_key,
ISNULL(dim_f.dim_f_key,0) dim_f_key,
ISNULL(dim_g.dim_g_key,0) dim_g_key,
ISNULL(dim_h.dim_h_key,0) dim_h_key,
ISNULL(dim_i.dim_i_key,0) dim_i_key,
ISNULL(dim_j.dim_j_key,0) dim_j_key,
ISNULL(dim_k.dim_k_key,0) dim_k_key,
ISNULL(dim_l.dim_l_key,0) dim_l_key,
ISNULL(dim_m.dim_m_key,0) dim_m_key,
ISNULL(dim_n.dim_n_key,0) dim_n_key,
ISNULL(dim_o.dim_o_key,0) dim_o_key,
ISNULL(dim_p.dim_p_key,0) dim_p_key,
ISNULL(dim_q.dim_q_key,0) dim_q_key,
ISNULL(dim_r.dim_r_key,0) dim_r_key,
ISNULL(dim_s.dim_s_key,0) dim_s_key,
ISNULL(dim_t.dim_t_key,0) dim_t_key,
Getdate() --@v_dss_update_time
FROM
load_bigfact
LEFT OUTER JOIN dim_a
ON load_bigfact.dim_a_id = dim_a.a_code
LEFT OUTER JOIN dim_b
ON load_bigfact.dim_b_id = dim_b.a_code
LEFT OUTER JOIN dim_c
ON load_bigfact.dim_c_id = dim_c.a_code
LEFT OUTER JOIN dim_d
ON load_bigfact.dim_d_id = dim_d.a_code
LEFT OUTER JOIN dim_e
ON load_bigfact.dim_e_id = dim_e.a_code
LEFT OUTER JOIN dim_f
ON load_bigfact.dim_f_id = dim_f.a_code
LEFT OUTER JOIN dim_g
ON load_bigfact.dim_g_id = dim_g.a_code
LEFT OUTER JOIN dim_h
ON CONVERT(DATETIME,(CONVERT(VARCHAR,load_bigfact.dim_h_id,112))) =
dim_h.a_code
LEFT OUTER JOIN dim_i
ON load_bigfact.dim_i_id = dim_i.a_code
LEFT OUTER JOIN dim_j
ON CONVERT(DATETIME,(CONVERT(VARCHAR,load_bigfact.dim_j_id,112))) =
dim_j.a_code
LEFT OUTER JOIN dim_k
ON load_bigfact.dim_k_id = dim_k.a_code
LEFT OUTER JOIN dim_l
ON load_bigfact.dim_l_id = dim_l.a_code
LEFT OUTER JOIN dim_m
ON load_bigfact.dim_m_id = dim_m.a_code
LEFT OUTER JOIN dim_n
ON load_bigfact.dim_n_id = dim_n.a_code
LEFT OUTER JOIN dim_o
ON CONVERT(DATETIME,(CONVERT(VARCHAR,load_bigfact.dim_o_id,112))) =
dim_o.a_code
LEFT OUTER JOIN dim_p
ON load_bigfact.dim_p_id = dim_p.a_code
LEFT OUTER JOIN dim_q
ON load_bigfact.dim_r_id = dim_q.a_code
LEFT OUTER JOIN dim_r
ON load_bigfact.dim_r_id = dim_r.a_code
LEFT OUTER JOIN dim_s
ON load_bigfact.dim_s_id = dim_s.a_code
LEFT OUTER JOIN dim_t
ON load_bigfact.dim_t_id = dim_t.a_code
The run started at 14:42:58 and completed at 14:58:27 – for a total of 15 minutes and 29 seconds.
Using a Gigantic View
So, although I would have never done this, I have heard of some folks creating views that bring together a lot of tables and then use that view in their INSERT statement. So I have done this as well. However, our test case may be an extreme case since there is nothing that you can join together in the view itself. I accept that, but wanted to show you the results. Basically, what I did was create a view that brought all of the dimensions together in a single view and joined that to the load table to produce the output to the fact table.
The run started at 15:59:50 and completed at 16:25:59 – for a total of 26 minutes and 9 seconds.
Next Month
Next month, I will perform the test with a different number of joins in separate steps (or stages) to produce the final fact table. The test will be to determine if there are a reasonable number of joins for 20 dimensions to run in a single step that, when added together, would run faster than a single statement. Then we will be able to draw some conclusions on whether the cost of landing more data is worth the time of processing – of course, assuming that there will be time savings, which is my hypothesis.
-
Chuck Kelley
Chuck is an internationally known expert in database technology. He has more than 30 years of experience in the design and implementation of operational/production systems, operational data stores
and data warehouses (data marts). Chuck teaches seminars on a variety of database and data warehousing topics. He has co-authored or contributed to four books on data warehousing and has been
published in numerous trade magazines and written columns on database technology, data warehousing, metadata, master data management, data governance and enterprise data strategies. He may be
contacted at chuckkelley@usa.net.
Editor's Note: More articles, news and resources are available in Chuck Kelley's BeyeNETWORK
Expert Channel on Database Design Techniques. Be sure to visit today!
Recent articles by Chuck Kelley
Comments
Want to post a comment? Login or become a member today!
Posted October 24, 2010 by Chuck Kelley
In the particular case here, I have a "zero" key in my dimension so that if I lookup and do not find a row, then I put a zero in the key ( see the "ISNULL( key, 0)" in the SELECT statement). That way the referential integrity is intact.
If you have late arriving dimension keys, then you can write a specific UPDATE for that key when the data arrives.
Is this comment inappropriate? Click here to flag this comment.
Posted October 24, 2010 by Mohanraj CP cpmohanraj@gmail.com
When you load a fact table by left joining to the dimensions, how do you ensure the referential integrity for the early arriving facts?
Is this comment inappropriate? Click here to flag this comment.