Blog: Dan E. Linstedt« December 2006 | Main | February 2007 » January 14, 2007Time Value of MetadataMy friend Bill recently wrote an article on Time-Value of Information, in which he declared that the value decreases exponentially over time. I have no argument there when it comes to data sets that are non-metadata. However, where metadata is involved, I believe that the value of the some metadata actually increases in conjunction with utilization. Conversely - metadata that is not utilized drops in value like a stone in water. In this entry we dive into specific attribute based valuation, and begin exploring a hard method for finding / assigning value. If you examine the trends of SOA, Master Data Management you might quickly discover that master metadata is fairly useless without definition: things like How to use it, when to use it, where to use it, what it means, how it was aggregated, if it was aggregated, and where it came from - each of which help to define the element from a corporate perspective. You may also quickly discover that while corporate perspectives are nice & necessary, they don't necessarily fit the bill for your division / Line of business / subject area of expertise. At that point, you're left wondering about the data itself and it's relevance to your piece of business that you manage. Now, if there's no metadata at your level - even if the master data is "fresh or current and quality cleansed" will it be of value to you and your decision making? It might, depending on how intuitive the data set is, and how widely utilized and defined the data is as a standard. But if either the metadata is missing, and / or your level of understanding of the data set is missing, then the value of the data could very well be zero - because you can't apply data to aggregations, predictions and answer sets unless you understand how to use it, and what it is defined as. For instance, take a dollar figure defined as gross sales (nothing more). Gross sales for your line of business or your unit may be what you're looking for, but if there's no additional metadata, you may not know if this is gross sales for the corporation or gross sales from a financial perspective, or gross sales from a "prospect only" perspective. Without additional metadata, the gross-sales figure cannot be utilized in your report, therefore it is useless (zero value) to you. I would argue that at this point, metadata is the most valuable piece of information to you, and that metadata (which is kept current with the business) is the single most valuable asset that the business has, and keeps a "straight line" of valuation that Bill has written about. I would even argue that the metadata valuation is 2x what the data is, because with centralization, common services for data distribution, and web-services (SOA's) metadata really becomes the keys to the kingdom for everyone who touches/access that information. I believe that Master Metadata (enterprise metadata for those of you out there) has a straight line time-valuation (again as long as it is kept current with the enterprise definitions). Master Data is mostly valueless without metadata definitions and understanding - even if the metadata is implied within someone's head - and they understand how the business runs. The next set of questions around Master Metadata, and Master Data include how do you actually perform a valuation of your data set? Here's a couple of items that discuss their opinion on the matter, then I'll share with you mine.
Surprisingly, quality of the data plays a tertiary role in the value of the asset. In my mind, poor quality data only means a poor quality decision, or the wrong decision - wich could in theory cost you your job, or land you in court. Quality of the data is a goal, an intangible value that can only be measured after the outcome of the decision is known. Here, a well written article discusses a lot about why you want to measure ROI, and value of Information - but they don't provide the how details except at a 50k foot level. Now, here's one way to "measure value" of the data, I've blogged on this before (here). But even then I didn't go into detail. Here's a ONE way of looking at hard value of information, but again - it relies on knowing and understanding the use of the data (master metadata) - we cannot overlook the importance of master metadata, and the fact that without it, this case would make NO SENSE ($$) at all. Suppose you are a credit card company, and your goal right now it is to stimulate existing customers into spending more money. You're method of choice is to go direct and personalized marketing; a direct mail campaign. Now, what data is important (of value) to you in reaching your prospective customers? Let's say, at a minimum, address, city, state zip are most valuable - knowing where to send the direct mail. Next on the list might be their name and gender - especially if you are personalizing. Followed by how much money they spent on their cards last year - while you might not print this information for direct mail, it might be utilized to decide which personalized package is sent to the customer. Ok, so we have the basic elements (or so we think). What if, by accident, you don't consider AGE of an individual, what if you don't know the age of an individual? What happens if you accidentally send a huge incentive package to your best customer’s daughter who's only 11 years old? What if your best customer becomes irate, and switches credit card companies - how does that affect your revenue stream? What's the value of that single piece of information in your decision process? Ok - so what I'm getting at is this: Information valuation is only applicable based on the TIME at which the question is being asked, and the QUESTION that in fact is being answered by the data. In this case, a direct marketing campaign with highly specialized packages of data being sent out. Two key elements / attributes (among all that are mentioned) are the association of the child to the parent, and the age. Another key element is the address. There could be a direct cost of not having this information. Let's say you don't have the address, but you send an offer to your best customers' best friend across the street, but not your best customer - what happens when your best customer finds out they didn't get the offer? Worse yet, they wanted the offer, and were just about to spend a couple million dollars on their (your) credit card? What's the COST of doing business without that address? You've just found the value of the CELL of information. I think valuation fluctuates by row, and by cell (column) - based on the surrounding data set. I would say that you could compute an overall average cost for a missing element of information, and compound that cost through multipliers by putting your customers into market-basket analysis. The market baskets would change depending on the question. Again, an overall value (average) exists to all information, and then there is a specific valuation based on segmentation of data sets (even beyond customer). More to come on this, if you're interested, drop your questions into the comments below - I'll do my best to answer them going forward. Thanks, January 12, 2007RFID tracking for Individuals needs to go awayWarning: this is a rant! (my appologies to my readers) RFID Chip in Passports - Hacked into by Security Expert, Shows flaws of information, discusses the serious nature of release of private information, and one of the surprising things they wrote about is the RFID has no "stop-gap" measures to shut-down, self-destruct, or ward off attacks. I vote for Hitting it with a blunt object so as to smash the chip. Here's another one that raises questions about the privacy and protection of top-secret personell, top secret locations, and so on... I don't know how you feel about this, but I'm certainly upset. Here's a great report from a University on the Privacy Enhancing Technology claims for RFID, and what some of the ethical problems are: I hope someone comes up with a device called "RFID Jammer" that can be embedded into your own clothing, placed into your wallet or stuck to your cell phone, a device that silences the radio waves or burns out the chip electronically. Anyone can buy an RFID reader on-line, no background checks, no security, no questions asked, for about $921.00 These are questions I have, but alas, no answers. If you have articles on RFID that you'd like to share, I'd like to hear about them. Thanks, Performance of ETL From an Architecture PerspectiveIt seems these days that many people have similar problems with performance and tuning of their ETL routines (in another blog entry I'll discuss performance and tuning of ELT). ETL may be the "old-horse" in the stables, but it will exist for a very long time to come, as it serves many different purposes (such as sharing or balancing the workload) between the Transformation Engine and the Database Engine. Particularly where ELT is 100% database engine based, and puts some serious strain on the RDBMS (especially in huge volumes). So where does that leave ETL? What are some of the top suggestions for getting ETL to perform? I've been teaching performance and tuning for the past 7 years, and working on systems analysis, design, performance and tuning architectures for over 10 years. I started life as an 8080 CPM assembly level programmer where I re-wrote the Digital BIOS to read MS-DOS disks, and then proceeded to re-write the compiler and linker because I only had 64kb of RAM on the machine, and the compiler wouldn't compile the BIOS, and then the linker couldn't link it (too many modules, not enough RAM). So if there's one thing I understand, its speed of a machine and execution cycles. I frequent clients where their performance of their ETL routines (Data Stage and Informatica, and Java ETL) starts at 800 to 12,500 rows per second - with an average row size of 1500 bytes per row, do the math: (800 rps)(1500 bytes) = 1,200,000 bytes per second = 1.2MB per second. Usually the IT staff considers this "fast." This couldn't be further from the truth. In this blog I will disclose some of the things you need to look at to get higher performance, but if you want to know how to accomplish those tasks - well that involves consulting, and you'll have to contact me. Typically my customers see anywhere from 400% to 4000% performance improvements by implementing my recommendations. If that's not "fast" then what is "fast"? Consider this: on my HP Pavilion, AMD 64 bit CPU, 2 GB RAM, single internal 80GB disk @ 7200 rpm, with ETL engine and Database co-located. I'm reading from a flat file of 2M rows, and inserting to the database (non-empty table) with a single primary key index, and receiving between 40,000 rps and 60,000 rps (best case: 60,000 rps x 1500 bytes per row = 90MB per second). For updates I receive 12,500 to 20,000 rps x 1500 bytes per row = 30Mb per second), for Deletes it varies by key selection (range or singular). Hint 1: Hint 2: Hint 3: Hint 4: Keeping the data flowing through the transformation objects rather than branching around them is always preferable for performance. But watch out!! The more you tune, the more standards and best practices you break, the more metadata is lost (often times). ONLY TUNE WHAT IS TRULY BROKEN AND SIGNED OFF AS SUCH WITH AN SLA AND THE BUSINESS USERS. Hint 5: Beware of overloading; don't believe the hype that always adding parallelism will give you performance boosts. Hope this helps, Feel free to contact me directly with your performance issues. January 8, 2007My Holiday Wish List for BI of TomorrowI've posted and written many different things over the years about what technology (specifically BI tool vendors, and RDBMS vendors, and ELT / ETL vendors, EII, EAI vendors) need to have in the future. This is another look at an updated wish-list, along with market expectations and what I'm seeing as faults in the industry today. Don't get me wrong, the vendors (some) are scrambling to put new technology in place such as temperature based data, high speed interconnectivity, and massive parallelism to handle volumes - it's just they aren't quite there yet. So here's a look at what I'm hoping to see in 2007 and beyond. Convergence, convergence, convergence - I've written about it, spoken about it, and conferred with colleagues about it. It's happening, like it or not. The once well-defined "niches" and edges that software and hardware technology vendors used to have are fading away. Customers want single consolidated instances of data, single points of management, and consistent (common) models, common architectures, common services, common metadata.... and they are (as always) attempting to reduce tool sets they use to move data around the organization. So what does all this mean anyway? ETL/ ELT and RDBMS lines are blurring, EAI and EII, and Web Services lines are blurring, BI (reporting/analytic) and RDBMS lines are blurring, There's cross-over, cross out, buy-up, snatch up, use up cross company convergence happening. Look at HP - buys Knightsbridge, brings Tandem Non-Stop SQL back to life as MPP on HP SuperDome to compete with appliances. Appliances, coupling RDBMS, fast access, fast load paradigms, MPP and parallelism, and so on. ELT and ETL are morphing into a balancing act between the RDBMS and "transform" in stream, but wait - there's more!! EAI has morphed (somewhat) into web-services, new vendors for web-services sprung up to handle metadata, and EII is playing in a middle-ware integration role. Each are vying for their space, but each are beginning to rely on the other for accessibility, transformation, on-demand information delivery, metadata management and so-on. So what kinds of things should we expect to see in 2007? There's more, if you're interested in hearing more of my thoughts, please don't hesitate to ask, however I'd like to hear from all of you - what is it you need most in 2007 from these vendors? What do you want them to produce for you? What isn't working for you right now? Hope to hear from you, RFID Is Dead! Or Is It?RFID (Radio Frequency Identifier Tags) have been stopped in terms of productions, usage, and mandates to be implemented from companies like Wal-Mart and others. Of course, you'll still see RFID on store shelves, particularly for larger and more expensive products - but this is a problem that has been stated as containing tons of problems ranging from ethical questions to simple data gathering questions. In case you're a follower of the RFID channel, you might be interested in some of these findings. Quite a while back I wrote on RFID and what a Database manufacturer would have to do to support RFID. See my article here. Then, there is the notion of RFID as it pertains to privacy and security context (within VLDW). I wrote about that here. But Alas, RFID brings with it tons of problems and issues that haven't been resolved - and may not be. Wal-Mart has quietly pulled back on implementing the RFID across all its suppliers. GM, and Ford have also pulled back, Congress has raised all kinds of issues surrounding the privacy of RFID de-activation. Here is a simple discussion of these issues: (this is a fictitious example to illustrate a point) Wal-Mart wanted every item tagged from inception through completion. Suppose these items are "M" earrings. M earrings are tagged as a pair, the pair is put into a carton, their are 24 pair to a carton, then - each carton is tagged. There are 48 cartons put on a single shrink wrapped unit, the unit is then tagged. There are 15 units per palette. Each palette is tagged. Then finally there are thousands of pallets on the warehouse floor. Now come the questions: Now on to the ethics side of the questions: Ok, I'm not the only one bringing to light major concerns. Congress is asking tons of questions, as are the retailers. Below are some interesting press releases about RFID and concerns: RFID Software a “Pandora’s Box” One problem? I searched and searched for RFID problems, ethics, issues, privacy, and so on - I found many voices speaking of these issues, but it seems as though the big-dogs are not publicly stating what they've found to be issues, nor are they openly discussing why they are backing down. I'll continue to look for this information, and as I find it - I'll post it here. If you can find quality articles from well-known journals that discuss the ethical implications of RFID, I'd love to hear from you. RFID is not dead, it still will be utilized (good or bad), because it is a technological advancement, and has been proven to be effective at some levels of tracking. And as always, with new technology like this implementation leads the way long before the impacts are known, and legislation can take place. Hope this was interesting for you, |