We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Blog: Dan E. Linstedt Subscribe to this blog's RSS feed!

Dan Linstedt

Bill Inmon has given me this wonderful opportunity to blog on his behalf. I like to cover everything from DW2.0 to integration to data modeling, including ETL/ELT, SOA, Master Data Management, Unstructured Data, DW and BI. Currently I am working on ways to create dynamic data warehouses, push-button architectures, and automated generation of common data models. You can find me at Denver University where I participate on an academic advisory board for Masters Students in I.T. I can't wait to hear from you in the comments of my blog entries. Thank-you, and all the best; Dan Linstedt http://www.COBICC.com, danL@danLinstedt.com

About the author >

Cofounder of Genesee Academy, RapidACE, and BetterDataModel.com, Daniel Linstedt is an internationally known expert in data warehousing, business intelligence, analytics, very large data warehousing (VLDW), OLTP and performance and tuning. He has been the lead technical architect on enterprise-wide data warehouse projects and refinements for many Fortune 500 companies. Linstedt is an instructor of The Data Warehousing Institute and a featured speaker at industry events. He is a Certified DW2.0 Architect. He has worked with companies including: IBM, Informatica, Ipedo, X-Aware, Netezza, Microsoft, Oracle, Silver Creek Systems, and Teradata.  He is trained in SEI / CMMi Level 5, and is the inventor of The Matrix Methodology, and the Data Vault Data modeling architecture. He has built expert training courses, and trained hundreds of industry professionals, and is the voice of Bill Inmons' Blog on http://www.b-eye-network.com/blogs/linstedt/.

January 2007 Archives

My friend Bill recently wrote an article on Time-Value of Information, in which he declared that the value decreases exponentially over time. I have no argument there when it comes to data sets that are non-metadata. However, where metadata is involved, I believe that the value of the some metadata actually increases in conjunction with utilization. Conversely - metadata that is not utilized drops in value like a stone in water. In this entry we dive into specific attribute based valuation, and begin exploring a hard method for finding / assigning value.

If you examine the trends of SOA, Master Data Management you might quickly discover that master metadata is fairly useless without definition: things like How to use it, when to use it, where to use it, what it means, how it was aggregated, if it was aggregated, and where it came from - each of which help to define the element from a corporate perspective. You may also quickly discover that while corporate perspectives are nice & necessary, they don't necessarily fit the bill for your division / Line of business / subject area of expertise.

At that point, you're left wondering about the data itself and it's relevance to your piece of business that you manage. Now, if there's no metadata at your level - even if the master data is "fresh or current and quality cleansed" will it be of value to you and your decision making? It might, depending on how intuitive the data set is, and how widely utilized and defined the data is as a standard. But if either the metadata is missing, and / or your level of understanding of the data set is missing, then the value of the data could very well be zero - because you can't apply data to aggregations, predictions and answer sets unless you understand how to use it, and what it is defined as.

For instance, take a dollar figure defined as gross sales (nothing more). Gross sales for your line of business or your unit may be what you're looking for, but if there's no additional metadata, you may not know if this is gross sales for the corporation or gross sales from a financial perspective, or gross sales from a "prospect only" perspective. Without additional metadata, the gross-sales figure cannot be utilized in your report, therefore it is useless (zero value) to you.

I would argue that at this point, metadata is the most valuable piece of information to you, and that metadata (which is kept current with the business) is the single most valuable asset that the business has, and keeps a "straight line" of valuation that Bill has written about. I would even argue that the metadata valuation is 2x what the data is, because with centralization, common services for data distribution, and web-services (SOA's) metadata really becomes the keys to the kingdom for everyone who touches/access that information.

I believe that Master Metadata (enterprise metadata for those of you out there) has a straight line time-valuation (again as long as it is kept current with the enterprise definitions). Master Data is mostly valueless without metadata definitions and understanding - even if the metadata is implied within someone's head - and they understand how the business runs.

The next set of questions around Master Metadata, and Master Data include how do you actually perform a valuation of your data set?

Here's a couple of items that discuss their opinion on the matter, then I'll share with you mine.
Oil and Gas Exploration - Value of Information In one of their slides they say that the value of their information is the difference of the project value with the information and the project value without the information - well that doesn't go a long way to telling you HOW to calculate the value of information.

The Value of Information
In this article, written in 2002, they discuss how information get's a valuation in the first place. I would say that there are some manners in which particular pieces of information can be assigned an estimated and assessed value based on importance to the organization, which of course is driven by utilization and metadata definition.

Surprisingly, quality of the data plays a tertiary role in the value of the asset. In my mind, poor quality data only means a poor quality decision, or the wrong decision - wich could in theory cost you your job, or land you in court. Quality of the data is a goal, an intangible value that can only be measured after the outcome of the decision is known.

Here, a well written article discusses a lot about why you want to measure ROI, and value of Information - but they don't provide the how details except at a 50k foot level.

Now, here's one way to "measure value" of the data, I've blogged on this before (here). But even then I didn't go into detail.

Here's a ONE way of looking at hard value of information, but again - it relies on knowing and understanding the use of the data (master metadata) - we cannot overlook the importance of master metadata, and the fact that without it, this case would make NO SENSE ($$) at all.

Suppose you are a credit card company, and your goal right now it is to stimulate existing customers into spending more money. You're method of choice is to go direct and personalized marketing; a direct mail campaign. Now, what data is important (of value) to you in reaching your prospective customers? Let's say, at a minimum, address, city, state zip are most valuable - knowing where to send the direct mail.

Next on the list might be their name and gender - especially if you are personalizing. Followed by how much money they spent on their cards last year - while you might not print this information for direct mail, it might be utilized to decide which personalized package is sent to the customer. Ok, so we have the basic elements (or so we think).

What if, by accident, you don't consider AGE of an individual, what if you don't know the age of an individual? What happens if you accidentally send a huge incentive package to your best customer’s daughter who's only 11 years old? What if your best customer becomes irate, and switches credit card companies - how does that affect your revenue stream? What's the value of that single piece of information in your decision process?

Ok - so what I'm getting at is this: Information valuation is only applicable based on the TIME at which the question is being asked, and the QUESTION that in fact is being answered by the data. In this case, a direct marketing campaign with highly specialized packages of data being sent out. Two key elements / attributes (among all that are mentioned) are the association of the child to the parent, and the age. Another key element is the address.

There could be a direct cost of not having this information. Let's say you don't have the address, but you send an offer to your best customers' best friend across the street, but not your best customer - what happens when your best customer finds out they didn't get the offer? Worse yet, they wanted the offer, and were just about to spend a couple million dollars on their (your) credit card? What's the COST of doing business without that address? You've just found the value of the CELL of information.

I think valuation fluctuates by row, and by cell (column) - based on the surrounding data set. I would say that you could compute an overall average cost for a missing element of information, and compound that cost through multipliers by putting your customers into market-basket analysis. The market baskets would change depending on the question. Again, an overall value (average) exists to all information, and then there is a specific valuation based on segmentation of data sets (even beyond customer).

More to come on this, if you're interested, drop your questions into the comments below - I'll do my best to answer them going forward.

Dan L
CTO, Myers-Holum, Inc

Posted January 14, 2007 4:54 AM
Permalink | No Comments |

Warning: this is a rant! (my appologies to my readers)
RFIDs are causing quite a stir, they have a multitude of problems, none of which seem to matter to Government officials. At least that's a part of what this report says. I'm a believer in using technology for the right task, and I do see value in RFID for specific things, but please - don't invade my personal space with RFID tags, and please - don't force it on me. Unfortunately whether we like it or not Governments around the world are heading this way, dictating the use of RFID in pass-ports, drivers licenses, and medical ID cards. I fear that in the future we may be subjected to RFID implants (as I blogged before) in order to receive service, shop for groceries, go through the airport, and so on. It's a sad day to see that ethical and privacy problems with RFID are so well documented, and so well ignored by governments.

RFID Chip in Passports - Hacked into by Security Expert, Shows flaws of information, discusses the serious nature of release of private information, and one of the surprising things they wrote about is the RFID has no "stop-gap" measures to shut-down, self-destruct, or ward off attacks. I vote for Hitting it with a blunt object so as to smash the chip.

Here's another one that raises questions about the privacy and protection of top-secret personell, top secret locations, and so on...
RFID Spy chip implanted in "hollow coin" appears in Canada

I don't know how you feel about this, but I'm certainly upset.
1. As people in a free country where the government is elected by votes, shouldn't the government be asking rather than telling us that they will implement something this invasive, without a vote, and all in the name of "security".
2. What exactly does it mean to compromise "ethics and privacy" in the name of "security"?
3. By having an RFID tag in my drivers license or passport, how much more "secure" am I really?

Here's a great report from a University on the Privacy Enhancing Technology claims for RFID, and what some of the ethical problems are:

I hope someone comes up with a device called "RFID Jammer" that can be embedded into your own clothing, placed into your wallet or stuck to your cell phone, a device that silences the radio waves or burns out the chip electronically.

Anyone can buy an RFID reader on-line, no background checks, no security, no questions asked, for about $921.00

These are questions I have, but alas, no answers. If you have articles on RFID that you'd like to share, I'd like to hear about them.

Dan L
CTO, Myers-Holum, Inc

Posted January 12, 2007 3:52 PM
Permalink | 1 Comment |

It seems these days that many people have similar problems with performance and tuning of their ETL routines (in another blog entry I'll discuss performance and tuning of ELT). ETL may be the "old-horse" in the stables, but it will exist for a very long time to come, as it serves many different purposes (such as sharing or balancing the workload) between the Transformation Engine and the Database Engine. Particularly where ELT is 100% database engine based, and puts some serious strain on the RDBMS (especially in huge volumes). So where does that leave ETL? What are some of the top suggestions for getting ETL to perform?

I've been teaching performance and tuning for the past 7 years, and working on systems analysis, design, performance and tuning architectures for over 10 years. I started life as an 8080 CPM assembly level programmer where I re-wrote the Digital BIOS to read MS-DOS disks, and then proceeded to re-write the compiler and linker because I only had 64kb of RAM on the machine, and the compiler wouldn't compile the BIOS, and then the linker couldn't link it (too many modules, not enough RAM). So if there's one thing I understand, its speed of a machine and execution cycles.

I frequent clients where their performance of their ETL routines (Data Stage and Informatica, and Java ETL) starts at 800 to 12,500 rows per second - with an average row size of 1500 bytes per row, do the math: (800 rps)(1500 bytes) = 1,200,000 bytes per second = 1.2MB per second.

Usually the IT staff considers this "fast." This couldn't be further from the truth. In this blog I will disclose some of the things you need to look at to get higher performance, but if you want to know how to accomplish those tasks - well that involves consulting, and you'll have to contact me. Typically my customers see anywhere from 400% to 4000% performance improvements by implementing my recommendations.

If that's not "fast" then what is "fast"? Consider this: on my HP Pavilion, AMD 64 bit CPU, 2 GB RAM, single internal 80GB disk @ 7200 rpm, with ETL engine and Database co-located. I'm reading from a flat file of 2M rows, and inserting to the database (non-empty table) with a single primary key index, and receiving between 40,000 rps and 60,000 rps (best case: 60,000 rps x 1500 bytes per row = 90MB per second). For updates I receive 12,500 to 20,000 rps x 1500 bytes per row = 30Mb per second), for Deletes it varies by key selection (range or singular).

Hint 1:
These are the numbers you should be shooting for without using parallel objects, and without partitioning the data set. This way, when parallelism and partitioning are applied you gain a multiplier of these numbers.

Hint 2:
If you're running too many instances of an RDBMS engine on a single machine, you can easily over-run your available hardware. CONSOLIDATE ALL DIFFERENT INSTANCES to a SINGLE instance of the engine, tune that instance, and you'll see better performance almost guaranteed. For instance, an 8 cpu engine with 8 GB RAM can handle at most, 2 instances of a DBMS engine, IF each one is tuned and limited to use only 4 cpu's and 4 GB of ram MAX.

Hint 3:
Rule of thumb with ETL: Always always always, separate your inserts from your updates from your deletes. Running mixed-mode (inserts and updates) within the same stream causes performance slowdown by orders of magnitude. Trust me on this one. A data flow (mapping) that contains inserts and updates may run at 12,500 rows per second (1500 byte rows), where when split apart, sees the performance gain mentioned above.

Hint 4:
What is the shortest distance between two points? A straight line right? Well, the same goes for ETL data flows - the more splits across transformation objects, the less performance is usually seen. * Note this is NOT true for Ab-Initio, because Ab-Initio runs optimization algorithms (highly sophisticated mathematics) to remove and eliminate bottlenecks in the mapping/graph. What you design in Ab-Initio is not always what is run under the covers.

Keeping the data flowing through the transformation objects rather than branching around them is always preferable for performance. But watch out!! The more you tune, the more standards and best practices you break, the more metadata is lost (often times). ONLY TUNE WHAT IS TRULY BROKEN AND SIGNED OFF AS SUCH WITH AN SLA AND THE BUSINESS USERS.

Hint 5:
I run into too many underpowered hardware engines - or the converse: too much parallelism. People try to do too much all at the same time. Balancing the load cycle is much better than overloading the hardware, and will almost always yield faster performance across the board. For example: I know of places that run 400 ETL jobs in parallel for example, the longest one in that parallel group runs at 800 rps - and runs for about 2 hours (5.76M rows); on average, they all run slowly. When we split it in to two parallel groups of 200 each group run sequentially, we saw the jobs running speed increase to 20,000 rps, the running time of the longest dropped to 4.8 minutes!! Both groups end to end ran in under 15 minutes. That's a two hour run time reduced in total to 15 minutes!!

Beware of overloading; don't believe the hype that always adding parallelism will give you performance boosts.

Hope this helps, Feel free to contact me directly with your performance issues.
Dan L

Posted January 12, 2007 5:52 AM
Permalink | 3 Comments |

I've posted and written many different things over the years about what technology (specifically BI tool vendors, and RDBMS vendors, and ELT / ETL vendors, EII, EAI vendors) need to have in the future. This is another look at an updated wish-list, along with market expectations and what I'm seeing as faults in the industry today. Don't get me wrong, the vendors (some) are scrambling to put new technology in place such as temperature based data, high speed interconnectivity, and massive parallelism to handle volumes - it's just they aren't quite there yet. So here's a look at what I'm hoping to see in 2007 and beyond.

Convergence, convergence, convergence - I've written about it, spoken about it, and conferred with colleagues about it. It's happening, like it or not. The once well-defined "niches" and edges that software and hardware technology vendors used to have are fading away. Customers want single consolidated instances of data, single points of management, and consistent (common) models, common architectures, common services, common metadata.... and they are (as always) attempting to reduce tool sets they use to move data around the organization.

So what does all this mean anyway? ETL/ ELT and RDBMS lines are blurring, EAI and EII, and Web Services lines are blurring, BI (reporting/analytic) and RDBMS lines are blurring, There's cross-over, cross out, buy-up, snatch up, use up cross company convergence happening. Look at HP - buys Knightsbridge, brings Tandem Non-Stop SQL back to life as MPP on HP SuperDome to compete with appliances. Appliances, coupling RDBMS, fast access, fast load paradigms, MPP and parallelism, and so on.

ELT and ETL are morphing into a balancing act between the RDBMS and "transform" in stream, but wait - there's more!! EAI has morphed (somewhat) into web-services, new vendors for web-services sprung up to handle metadata, and EII is playing in a middle-ware integration role. Each are vying for their space, but each are beginning to rely on the other for accessibility, transformation, on-demand information delivery, metadata management and so-on.

So what kinds of things should we expect to see in 2007?
I predict (which is hard to do) that ETL and ELT will become singular through tool utilization
* the last remaining independent tool vendors may be purchased by large hardware companies and the functionality morphed into RDBMS feature sets, and the rest of the functionality embedded in a web-services tool set built with SOA in mind.
* ETL and ELT will be around for a while as a legacy integration, but as people move more towards real-time or active, the need will dissipate somewhat - as systems finally consolidate, and OSS sets up single sources, common data models, and operational layers for SOA and the enterprise - legacy systems may finally sunset. This will trigger sunset on "moving large batches of history around".
* New systems that are written will be architected on a common data model that will act as BOTH an operational system, AND a data warehouse - these will no longer be separated.
* EAI will complete it's transition into the SOA space, and those vendors that don't - will eventually die out, as SaaS takes over applications, and web-services take over the integration components.
* MDaaS - metadata as a service - will begin to take shape. There will be new entries in the market place claiming common metadata for sale, linked to common data models - these models will shape the SOA market place and OSS / DSS systems of the future.
* EII will make it's mark in the metadata management world, and begin to truly allow business users point and click management of Ontology’s and Taxonomies of both DATA and METADATA, along with access paths, and security, and push-button web-services management.
* RDBMS vendors will push the barriers on volumes of data, adding compression of data sets, encryptions, searches of both encrypted and compressed data; new indexing mechanisms will arise as a result of DW2.0, volumes, and Real-Time or Active warehousing. RDBMS vendors will begin incorporating common data models as a part of their delivery to customers. RDBMS vendors will add in-memory aggregations, and temperature based data sets. RDBMS engines will perfect their self-tuning, and MPP operations, they will also perfect query re-writes, and provide dynamic aggregation capabilities for constantly accessed / grouped data.
* BI tool vendors will STOP having to deal with middle-server aggregations, and start having to address serious volume, serious performance. Those BI tools that don't run queries in parallel for the same report will have to re-write their core architecture to support "every query parallel - every time / all the time". BI Vendors will begin to push the envelope on Data Visualization, and exploration (walk-throughs of visual data). BI Vendors may even begin to experiment with visualizing data in new "models" that we haven't thought of, such as 3D (showing data as chemistry models for instance). BI Vendors will have to deal with scalability on single servers, they will have to make reports available via WEB-SERVICE REQUESTS, and at the selection of the web-service requestor will have to produce the report in any format requested.
* BI Vendors will finally have to deal with security at a cell level for display of sensitive data on the reports. Thresholds for individual fields will be set to hide data, and show data based on who, what and when.

There's more, if you're interested in hearing more of my thoughts, please don't hesitate to ask, however I'd like to hear from all of you - what is it you need most in 2007 from these vendors? What do you want them to produce for you? What isn't working for you right now?

Hope to hear from you,
Dan Linstedt
CTO, Myers-Holum, Inc

Posted January 8, 2007 9:13 PM
Permalink | 2 Comments |

RFID (Radio Frequency Identifier Tags) have been stopped in terms of productions, usage, and mandates to be implemented from companies like Wal-Mart and others. Of course, you'll still see RFID on store shelves, particularly for larger and more expensive products - but this is a problem that has been stated as containing tons of problems ranging from ethical questions to simple data gathering questions. In case you're a follower of the RFID channel, you might be interested in some of these findings.

Quite a while back I wrote on RFID and what a Database manufacturer would have to do to support RFID. See my article here. Then, there is the notion of RFID as it pertains to privacy and security context (within VLDW). I wrote about that here. But Alas, RFID brings with it tons of problems and issues that haven't been resolved - and may not be. Wal-Mart has quietly pulled back on implementing the RFID across all its suppliers. GM, and Ford have also pulled back, Congress has raised all kinds of issues surrounding the privacy of RFID de-activation.

Here is a simple discussion of these issues: (this is a fictitious example to illustrate a point)

Wal-Mart wanted every item tagged from inception through completion. Suppose these items are "M" earrings. M earrings are tagged as a pair, the pair is put into a carton, their are 24 pair to a carton, then - each carton is tagged. There are 48 cartons put on a single shrink wrapped unit, the unit is then tagged. There are 15 units per palette. Each palette is tagged. Then finally there are thousands of pallets on the warehouse floor.

Now come the questions:
1. What if one of the tags on the earring boxes "dies", how do you locate the dead signal to replace the RFID tag? Furthermore, there are machines for packaging, but no machines for unpackaging. If you do manage to find the dead signal, you have to unwrap the entire palette and all subsequently wrapped sub-components to get to the tag.
2. What if some of the tags interfere with each other? Their signals get crossed, and you can no longer tell which product is which.
3. That many radio signals all require their own frequency - with thousands of palettes on the shipping floor, you have millions of signals - resulting in interference of cell phones, wireless networking, car-radio's, and other items not linked directly to copper wire. Bleed-over into other frequencies quickly becomes an issue.
4. How do you know (electronically) that you want to track the signal or activate only the signal on a palette once all have been wrapped in a unit? How do you shut-off or filter out all sub-signals within a palette? RFID transponders cannot do this, they send radio frequencies across the board, and all the RFID's in range respond - resulting in huge signal overload.

Now on to the ethics side of the questions:
1. As a consumer you probably don't want someone tracking you (the pants / jacket / shirt you're wearing) as you move around in the mall, or your home or car as you pass an RFID transponder sitting on top of a stop-light at major intersections. That is pure invasion of privacy, very similar to the invasion of privacy that the cameras on top of major intersections today also create.
2. Once you leave a store, how do you know that the store has in fact shut-down the RFID or removed the tag? Some of the tags were supposed to be sewn into the material directly - and it's not just clothing - it's coffee, tea, food items, toys, cars, bicycles, and so on.
3. What would happen if you accidentally drank an RFID? You can't see it, and if it gets in to the food item you're making and you ingest it, then what?

Ok, I'm not the only one bringing to light major concerns. Congress is asking tons of questions, as are the retailers. Below are some interesting press releases about RFID and concerns:

RFID Software a “Pandora’s Box”
Fake Products Can Bypass Quality, Safety
Item-Level RFID Tags Cost More than Expected
Report: Major RFID Hurdles Ahead
IPOs in RFID: If Not Alien, Then Who?
RFID & Individual Privacy
Ethical Problems and RFID
Doctor Tagged with RFID worries about privacy.

One problem? I searched and searched for RFID problems, ethics, issues, privacy, and so on - I found many voices speaking of these issues, but it seems as though the big-dogs are not publicly stating what they've found to be issues, nor are they openly discussing why they are backing down. I'll continue to look for this information, and as I find it - I'll post it here. If you can find quality articles from well-known journals that discuss the ethical implications of RFID, I'd love to hear from you.

RFID is not dead, it still will be utilized (good or bad), because it is a technological advancement, and has been proven to be effective at some levels of tracking. And as always, with new technology like this implementation leads the way long before the impacts are known, and legislation can take place.

Hope this was interesting for you,
Dan L
CTO, Myers-Holum, Inc

Posted January 8, 2007 4:14 AM
Permalink | No Comments |