Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

March 2007 Archives

As it turns out, there are a lot of applications of the Data Vault architecture which I've built over the past 15 years. As testimony to the efforts, there are a few companies who've built concepts, and data models on the Data Vault architectures, then proceeded to patent the data models and the processes around the data models in order to provide competitive edge. I think the Data Vault architecture has finally grown up. If investment bankers can see the value of the Data Vault architecture and are willing to fund a patent effort on the models built based on the standards, then there must be a correlation between the standards in the architecture and the understanding of business users.

I've recently been involved with many different efforts in building Data Vault (Common Foundational Integrated Data Models) architectures. Nearly every one of these efforts are resulting in the CIO and investment bankers putting the money forward to patent the underlying architecture, along with the methods to get the data in and out of the data models. It seems to show a growing trend: that the Data Vault architecture is achieving it's design goals - providing value to the business, and being a model for the business to utilize. In other words, the architecture is designed in such a way as to be repeatable, flexible, scalable, and auditable - paving the way for incredible business flexibility. In other words: when the business changes, the model changes, on the fly with little to no impact to the business.

The Data Vault is an open, public, and freely available data modeling architecture. If you're stuck between a rock and hard place, or your Star Schemas are "beginning to fall apart" because of volume, or real-time, then the Data Vault might be a fit for you.

The Data Vault modeling architecture is a hybrid architecture consisting of the best of breed data modeling techniques used in both 3rd normal form, and Star Schema - except it is a foundationally based architecture with standards, which if adhered to can stear your enterprise common data model in the right direction. Below are a few links which discuss the nature of the Data Vault, and describe from a business perspective the value that it brings:

CIO in India
US Military, pitched by Oracle Executive (Follow the quotes made about the model and assimilation techniques), very powerful!
Kent Graziano discusses Data Vault and Agile Modeling Techniques
Dr. Claudia Imhoff is quoted in CRM at the Speed of Light...
Information Technology Series Book on applying the Data Vault
Article on SAP and pDM using the Data Vault

There are other documents detailing the use and application of the Data Vault, however I won't go into those here. The point of this entry is to say that by utilizing the Data Vault architecture to design your common data model (be it for services, or operational needs, or for Data Warehousing), you can in effect - generate intellectual property, and additional value for your firm - along with a consistent, standard, and repeatable approach that can be built out by an automated tool.

The thing that strikes me is the number of resulting data models (built on Data Vault principles) that are being patented as company owned IP.

Do you have thoughts on patenting internal data models? Will it help or hurt the cause of invention and moving forward? Do you see a need to model your data after your business?

Please respond.

As always, Thank-you very much,
Dan Linstedt
Get your Masters of Science in Business Intelligence, http://www.COBICC.org


Posted March 27, 2007 8:09 AM
Permalink | No Comments |

I've received some good feedback and comments from readers in the field regarding an entry I made recently about I.T. costs and profitability. One comment discussed the notion that I.T. chargeback really isn't profitability, but rather just a shifting of sands, as the business has money which is simply re-allocated. In this entry we'll explore some other notions of profitability for I.T. along with discussing why Chargeback works in certain industries, but not in others.

We'll also discuss the notions of standardization and it's correlation to profitability. As always, I'd like to hear from you. What is it you have questions on or disagree with?

First, I'd like to point out the following Gartner report: "Profitable Business Models in I.T. Professional Services" It discusses the nature of profitability related to alignment. What it doesn't address is the cost aspect of the I.T. services provided in relation to the business profitability.


Here's another report (albeit from a vendor of profitability systems) which discusses a valuation framework and profitability alignment. However, I am of the opinion that this vendor's notions appear to be based on ITIL and CoBIT frameworks (which are based on SEI / CMMI) as standardizations. A quote from their paper: "The increased transparency and financial integrity resulting from implementing this process contributed to Schwab’s ability to increase their operating margin by $600 Million."

Why should I care?
The ability of IT to be profitable and actually make money for business is dependant on standards, BPM (business process management), alignment, and operating margins. IT is a slice of business that has to sell it's projects (typically) internally to other parts of the organization. In this situation they also receive much of their funding from within the business sides of the organization. I.T.'s ability to produce, while keeping costs low, quality high, and delivery on-time, is in direct proportion to the costs that are incorporated into the bids that the business side makes to the external customers. Why? because the cost has to be paid somewhere.

By initiating competitive cost management, alignment, and standardization within I.T. business itself can become more competitive on the bids to their customers, and thus are enabled to win more bids at a lower cost with higher quality. This is only one angle.

But what about Chargeback?
Chargeback to other lines of business is an interesting proposition, one of the readers made a comment that chargeback is nothing more than moving money around inside the total company, and that this does not reflect profitability within I.T. at all. I would state that while this may be true for some organizations, it is not true for those organizations where different lines of business earn profit through their own contracting, and have their own customer base, and furthermore write their own contracts for "their" teams within I.T. - When I.T. is segmented in this fashion (where you can't talk to the person next to you without giving them a charge number, even though they are in I.T.) profitability of Chargeback to that line of business becomes a reality. It is no longer simply "shifting the sands."

In another report: "IT Spending, Staffing, and Technology Trends" They discuss the notions of where the money is being spent, and where the high costs are being accumulated. While most I.T. departments (and people) fight standardization, and consistency, they eventually are pressured into working on those directions (or they face extinction).

A number of "new" consulting companies in the past 7 years have all hammered away at the notions that they are "compliant" and standardized, and have passed audits - so that their costs are lower, their quality output is better, and that they can deliver on-time. This has (in some cases) turned out to be completely false - and I.T. project costs have risen (not fallen) as a result of their mis-steps.

Let me give you an idea of what happens to costs, time, and overhead once standardization and quality BPM have been achieved:

PRODUCTIVITY METRICS Performance (Averaged)
Unit Performance Measurement Pre Vs Post CMMI / L3
1. Requirements Management Person days 50% Reduction
2. Impact Assessment Days per work object 43% Reduction
3. Design Reverse Engineering Person days 33% Reduction
4. Test Coverage Cases per drop 300% Increase
5. Regression Testing Efforts Days per drop 80% Reduction
6. Unit Level Testing Days per work object 50% Reduction
7. Code Review Effort Days per work object 33% Reduction
8. Code Freeze Duration Days prior to a drop 80% Reduction
9. Work Objects per Drop Number 67% Increase
10. Drop Frequency Drops per month No Restriction

This chart came from: http://www.cmminews.co.uk/Pres2006/24th/Keynotes/The%20Key%20Why%20And%20How%20Of%20CMMI%20-%20final.pdf

Ok, so CMMI, ITIL, ISACA, and CoBIT are all part of improving the profitability, standardizing I.T. and getting a handle on how to audit and produce better I.T. products. What about the charge-back portion of making money? How does that fit in?

Well, if you think about it this way: as I explained earlier - some businesses don't "win contracts" based on different sectors of business or different lines of business, in these cases charge-back doesn't do much except begin to hold business accountable for their demands on how much data they want to have accessible, how fast they want it accessible - so in a way chargeback works to the advantage of of I.T. as a cost-control measure and a reality check on the business expectations and requirements they are putting forward.

In other situations, different lines of business win their own contracts, and each "segment" or sector of "I.T." is competing for funding. If a particular segment or sector of I.T. can draw funding away from those that are in-efficient, poor quality, or have bad delivery times (over-budget, beyond scope, etc..) then I.T. overall improves it's efficiency, because this underperforming I.T. unit (which is high cost) must either change it's ways to conform to standards, and start managing better, or go out of business (be replaced by the better I.T. team). This draws funding across lines of business, while providing lower cost bids to the end-customer regardless of business unit making the bid.

Chargeback’s are very effective when applied within the correct context, they can be helpful within single businesses as well.

How can I.T. make money as a result of standardization?
Well, I'm not the first one to discuss this, as I pointed out before: I.T. is a business that should be run like a business, and like any good business it needs to be capable of selling projects to other companies, it also needs to be competitive for the entire business so that customers don't leave and go to the competition. I.T. through standardization can optimize their processes so well, that they can "slice off" pieces of their business process flow which can be "sold" to external customers. They can act as a data virtualization or hosting company for data, or they can act as consultants for a variety of tasks.

Take Qwest for example, they are a telecom company which operates Server Hosting (virtualization) farms, as does Lockheed Martin, as does SAIC, and a few others. In this arena, I.T. is mitigating internal costs by acquiring and developing external sources of income, hopefully this becomes a profitable stream.

If you have thoughts or comments, I'd love to hear them.

Thank-you,
Dan Linstedt
Get your Masters Of Science in Business Intelligence at: http://www.COBICC.com


Posted March 25, 2007 6:05 AM
Permalink | 1 Comment |

An interesting thing happened on the way to the data bank.... Well, I just couldn't help myself. This is a discussion on performance, notions of derivations, product improvements and directions in SQLServer2005, and some annotations on SSIS and where it has gone, where it needs to go, and so forth. Please remember, this is for 35 Million row tables on a 1 CPU-2GB RAM-1 Disk I/O Channel laptop, so it's the bare-minimum. I'd hope to see exponential (or at least linear) performance gains by going to a larger machine (but this doesn't happen either).

I've recently been working with Daniels College of Business, Denver University on a State Tax Data Initiative, laying out and integrating information by year, by county, and so on. We have a very small amount of information, some 35 million assessed properties over 8 years of effort. The point of which is to say that I'm working with this information in SQLServer2005, and SSIS data integration components.

When running this information on my Single CPU AMD Athlon, 64 bit processor it runs fairly slowly. When I run a query on all 35 million rows (table scan based), no aggregation, it runs for about 10 to 20 minutes, depending on how many rows it returns (between 100,000 rows and 535,000 rows). 2 GB RAM. But when I run the same query on a brand new laptop with 2GB RAM, and Intel Centrino Duo (dual-core) it runs between 1 minute and 3 minutes to return data.

A huge difference in performance or I/O to disk. SQLServer2005 with Table Scans and no joins works extremely well, and I'm very happy with that aspect of the performance on a laptop. However, entering queries which require joins (after adding the appropriate indexes) slow the query down exponentially, due to the significant increase in DISK I/O. Especially when I set it up to insert into a secondary target table.

Now here's an interesting thought / difference I noticed. When I run the query as a view through "import / export" data wizard, it runs very quickly until it gets to the final commit. The final commit takes forever, especially if the target table has foreign keys, primary keys, and is set to auto-shrink, auto-statistics update, and has traditional logging and so on. Now, if I turn off auto-shrink, auto-stats, and set db_options for select into/bulk-copy, and trunc. log on chkpt, then drop the foreign keys leaving only the default constraints and a sequence identifier, the import/export wizard screams. I can move 35 Million rows from one table to the next (on the new laptop) inside of 5 minutes. Not bad for a single disk, single table copy.

However, I noticed something else: if I use a view (join across not more than two tables), and then load data using SSIS, all of the sudden performance jumps. Trying to execute a join/view inside of Import/Export wizard gives me a different result (slower).

Anyhow, SSIS is proving to be extremely powerful when you have access databases to load, OLE/DB sources, ODBC sources, or are all SQLServer based. I'm impressed with the technology that Microsoft has put together. SSIS is NOT the answer (in my opinion) for non-microsoft shops, as it has a hard time connecting to non-microsoft operating systems, and non-microsoft systems (like SAP, Oracle Financials, PeopleSoft, etc..)

The transformations within SSIS are powerful, and include fuzzy logic grouping, fuzzy logic matching, and neural net learning capacities (very helpful indeed). Finally, A.I. has made it to the transformation main-stream. Yes, I'm aware that certain RDBMS engines have had this functionality for a couple years now, buried in the SQL layers, but to see it inside an ETL type engine is great, particularly now a part of drag-and-drop GUI.

Where does SSIS and SQLServer need to go?
These are my thoughts, I would like to see the following focus areas improved upon in future products, also - in order to compete, it may be wise for Microsoft to implement some of these ideas going forward:

SSIS:
1. Expandable Data Flow Views (so that columns can be seen in a standard view format, and so that I don't have to double click to see the column views exploded).
2. Additional connectors / adapters for other sources of data, maybe even unstructured data, word docs, and the like.
3. Separated notion of "connection" from "table definition", today I have to import the table definition every time I setup a new data flow, upon selecting an existing connector to use.
4. Conditional aggregation functionality (conditional aggregate functions for each column)
5. performance numbers (rows per second throughput)
6. Row width information
7. CPU/DISK and I/O performance information (easily accessible from SSIS)
8. Better "stop job" capabilities, I recently noticed SSIS outputting messages (warnings about low virtual memory). I had to watch for these, then physically kill the job, change the virtual mem, reboot, then restart the job. This is fine, except I'd rather have a job setting that says: STOP ANY JOB when virtual or physical RAM is running low.

SQLServer2005:
1. Easier partitioning, while partitioning is there, it can be difficult to setup and maintain.
2. Improved Query Optimizer, the query optimizer begins to execute some things in parallel (whcih is nice), but it has a hard time optimizing more than 2 or 3 table joins even with the proper indexes in place - particularly when the data set is 35 million rows or more in each table that is being joined. Keep in mind, we have 1 disk, 1 CPU, and 1 I/O channel - so this is a huge limitation, but none-the-less, the query optimizer should be able to make more efficient use of block reads and data organization.
3. Better data modeling capabilities. I find the "data model diagramming" engine to be severly lagging behind the times. Let's get some serious data modeling capabilities built in to SQLServer, now that Microsoft owns Visio, I'm surprised they haven't built the Visio data modeling engine into SQLServer (it's a little better in some areas, but worse in others). We need physical and logical data modeling capabilities, metadata definition capabilities, and so on.

What-ever you do, don't cluster the SQLServers together (unless you're willing to pay the maintenance and upgrade prices). Maintenance of clustered SQLServers is tough at best, but you can see some significant performance improvements. Me? I prefer the big-iron boxes to clustering. In other words: take a mainframe box (p-Series/z-Series, etc..) run a logical partition / virtual partition with Windows OS, and SQLServer2005 - this should scream performance....

Do you have Suggestions for SQLServer? Are you running SQLServer2005? Questions? Comments?

Thanks,
Dan Linstedt
Want to recruit students with an MS-BI? Talk to us: http://www.COBICC.com


Posted March 21, 2007 5:14 AM
Permalink | 6 Comments |

My last post discussed the notion of unstructured data being as much as 80% of the data that we in IT will / should begin to deal with. One of the readers requested that I expand on what I'm including in Unstructured Data. This entry discusses the types of structured/unstructured and semi-structured data as I see it. As usual, this pertains to business knowledge, and is a huge part of DW2.0. As it turns out, it also is (or will become) a huge part of changing IT from a cost center into a profit center; why? Because if we can integrate unstructured information, and glean the knowledge from it (determine contextual linkages), we can better understand where our business holes are.

There are three terms being bantered about in our DW2.0 world: Structured, Semi-Structured, and Unstructured data. Let's take a look at defining these terms and what they mean to us going forward.

Structured Data
Data that is sitting in a data store, defined by a catalogue (table definitions), something accessible via SQL, or data models, or Cobol Copybooks, or Object definitions. Data in rows and columns. Furthermore, this data has a characteristic of being contextualized by the heading (field name), and possibly defined in relation to other "fields". This data is also capable of being processed in a simple manner, summed, and aggregated and so on. What this data is NOT: is images, blobs, binary fields, free form documents, and so on.

Semi-Structured Data
Semi-Structured data seems to be that which houses structure with free-form elements, things like e-mails for instance, which have structure and context to specific elements in the header, but are free-form text documents in the body. Semi-structured comes in many forms, but it depends on what you are looking at as to whether or not the data is semi-structured or unstructured. For instance, semi-structured data for a fire-wall might be TCP/IP packets, where they care about the contents of the individual packet, along with a string of packets from the same IP to establish a pattern, and so on.

Unstructured Data
Unstructured data typically is all that which is not semi-structured or structured. For instance, images, this blog entry, content of web documents, standard documents, movies, audio, and so on.

What's the big difference? Why the hoopla? I thought Word Docs were structured!
Well, it all depends on your perspective. If the application is MS-Word, then in fact, the document itself is structured, however the CONTENT is not. Just like a web-page, the tags are structured, as are CSS elements, and XML, and HTML, and so on, the CONTENT is not.

Free form text (content) is NOT structured, until you are looking at a document which has sentences, and punctuation. Then, from a grammatical standpoint it is structured at a lower level of grain. But do you care? This is the big question. Just like we care about the grain of structured data, we should care about the grain of unstructured data.

We need to separate the terms: in an unstructured or semi-structured world, we need to make the choice: do we care about the "encapsulating structure" or do we care about the content or both? This is where the knowledge is, buried in the content, and doing something meaningful with the content.

Why?
Because unstructured and semi-structured and structured data are "one-and-the-same" when we talk about the encapsulating structure. All word docs for instance have markers, metadata, and processing instructions for Word to follow (layout, borders, size, color, font, etc..) All emails have standardized "structure", all images have specific processing instructions for standardized rendering engines, all audio (the same), all blobs, etc...

But when we talk about CONTENT, the playing field changes. Not all content is "the same", in other words, when you process a series of images, detecting when one is a face, one is a human, one is a tree, one is an ocean, etc... Determining WHAT the image is where the knowledge lies, and how it relates to other data based on WHAT it is - that's where the unstructured data processing lives.

Content derivation, assimilation, and integration is part of the story, once the content can be parsed, then hopefully basic outliers of context (important points) can be derived. In other words, like a search engine looking for key terms, but take it further than that: key terms that make sense or have relevance, ok: one more step further: not only have relevance but actually tie together what's duplicate, what's not, and learn from "elimination" of search results that the context is not relevant for those particular search terms...

This is just one example. Anyhow, all of this relates to DW2.0 and the stack within. Unstructured and semi-structured and structured data are NOT the same within a contextual sense, but are the same from within a structural encapsulation sense. In DW2.0, we must integrate the contextual information (meaning mine, and link together) in order to increase our awareness of what's going on in both the external and internal worlds of the corporation.

In order to make money, increase profits in IT, and actually provide more business value back to the business we MUST as IT professionals, undertake automation, and data mining of unstructured information, along with contextual integration as a step forward or we will lose sight of valuable information (particularly competitive).

As always, in the next blog I'll talk a little more about approaching IT automation, and how to integrate unstructured information into your enterprise from a DW2.0 perspective.

Please don't hesitate to comment, or ask questions.

Thank-you,
Dan Linstedt
Get your Masters of Science in Business Intelligence at: http://www.COBICC.com


Posted March 21, 2007 4:38 AM
Permalink | No Comments |

I just read an interesting article in CIO decisions that talks about how IT's costs are continually rising, profitability of companies is becoming razor thin, and that IT (if it continues at this rate) will eventually put business out of business. There are several points to this article, and I applaud gentlemen like Greg Hackett for stepping out and stating things that most of us (including me) don't often see. It's like taking time to stop and smell the roses, you know you should do it, but when you have a split second decision to make, do you really stop?

The article: "How IT is putting you out of Business", is written by Christopher Koch, and is a bold interview with Greg Hackett. It get's down to brass tacks. There are a couple of points that I really like about the article, and then there's a few things I'd like to think about differently.

One of the major points is that IT's costs are rising, faster than profit margins, and in fact, profit margins are shrinking, even though quality is improved, delivery time is improved, and over-all satisfaction (on a customer basis) is improved. He mentions that some of this is due to the lack of attention to external information, external competitive analysis. Some of this is attributed to the cost of infrastructure that IT is bringing to bear on the company for managing internal information systems only. He also talks about IT as a cost center, from a basis of always being over-head - which in my opinion is true, until you can turn IT from a cost center into a profit center.

I found it fascinating that he suggested an increase in the profitability actually can drive stock price down, due to a variety of factors. This has a negative connotation on managing the market place from a pricing perspective. He lists some sobering numbers about the number of companies that have crashed, been eaten up (and subsequently gutted), or completely wiped out by competition.

So what are we facing here? What's going on in the market place? and Why are you writing about this in a BI blog?
Well, I'm writing about it here, because IT has been "my life" for a long time. I've been around the block with Business Users, and understand how to communicate with the business sides of the house. In fact, my minor is Business Administration, and now I'm involved in entrepreneurial activities. All that aside, I've gotten to the point where I can work with business users, and work with IT to turn IT into a profit center (away from the cost center). However, this requires standards, procedures, and common business practices. IT must be managed like a business, because it IS a business. I blogged about turning IT from a cost center into a profit center a while back, and in the future I'll add more to this type of discussion. I do however believe that IT is raising costs, and cutting into the already razor thin profit margins available, this needs to stop!

One of the problems is the sheer size of Infrastructure and architecture that IT sets up for enterprise operations, another problem is that IT has become so focused on maintaining the tasks at hand that they may have lost site, lost budget, lost people (or all of the above) and can't put forward new projects. Another of the problems is the lack of mature Infrastructure Management Tools, lack of mature architecture management tools, and lack of automation at these high levels. The next "big technology swing" will meld low-level infrastructure tools together through metadata, through process, through execution, and have a pluggable front-end GUI that is easy to use, and allows management over all of these under-lying infrastructure tools/devices.

IT needs to charge back to internal and external customers for disk space, project hours, estimations, and so on. IT needs to get good enough to center their employees on standardization, and automation of tasks.

But more than that, a great point that Hackett made at the end of the article is the fact that IT is so focused on internal data management, that they are no longer enabling the business for competitive intelligence and advancement. Unstructured Information and Text mining is out there, IT needs to shift it's focus (and fast) to including external data sources, and figuring out how to contextualize that data so it makes sense to the business. In other words, IT needs to enable infrastructure that can assist / automate in the gathering of outside information that may have relevance or bearing on what decisions the executive offices should be making.

Think about it - let's take two examples: Google is literally "one-big-IT business", and quite possibly, Yahoo is the same. In the search engine world, they each try to out-do the other. How? by internalizing as much external information as possible, and then putting it together so it makes "more sense" to their customers than the other search engines (competition using external data). The one thing these companies cannot count on is "customer loyalty." Most web-users these days will use any search engine that will find interesting items that fit their needs, and they'll switch search engines without thinking twice, just to get a different look at results.

These companies SURVIVE by internalizing external information; it is their way of business. We could learn a few lessons from this in the IT departments of the world, and hopefully begin to understand that a) 80% of our data is unstructured b) IT has been focused on 20% of the data set for the past decade or more, c) of the unstructured data, I would argue that at least 60% of that is external - and would have relevance, bearing, and direct impact on our corporate bottom lines (profitability), probably more relevance than the internal data mountain (to some degree).

As was pointed out in the article: the CIO who drills down to slush machines in Texas isn't doing their job. The CIO who is mining and integrating EXTERNAL information with the INTERNAL information (as checks and balances), has the superior dash-board, or superior content from which to make accurate decisions (possibly avoiding disaster).

Do you have any thoughts on this? I'd love to hear them.
Daniel Linstedt
Check out our MS-BI and Entreprenuer program at COBICC.com


Posted March 7, 2007 10:48 PM
Permalink | 2 Comments |
PREV 1 2