Business Intelligence Network business intelligence resources

Blog: Dan E. Linstedt

« September 2005 | Main | November 2005 »

October 26, 2005

Got Dirty SOX? EII & ADW & IQ

Recently my discussions in the field have centered on Information Quality (or the lack thereof) and the EII tool set as well as the Active Data Warehouse (right-time data warehouse). We will explore this exceedingly dry (hopefully interesting) aspect in this blog entry, particularly in relation to Compliance and Integration - but I felt that it fits under SOA as well - so here goes.

Information Quality (according to Larry English) includes business processes, data, reporting, and people involved in interpretation of the information. But Information Quality both helps and hurts compliance efforts, particularly when the corporation is audited.

One of the over-simplified definitions of SOX is (at least at the data level): Can your system show the "before it was changed, after it was changed, and when this change occurred" audit trail. Without being able to answer these questions, any "software product" that claims it is SOX compliant is flat out wrong.

What does this have to do with EII?
EII can be quality driven too - but in doing so, it can and will, break compliance - IF it is told to transform data in the middle without producing an audit trail of what it did, what it used, and when - this is where the Write-Back capability of EII comes in handy - we almost need a EII-SOX warehouse to record the information flowing THROUGH EII tool sets in order to meet compliance and auditability.

I've written an EAB (executive action brief) on this site that talks about "making your data integration processes compliant" - click on B-Eye-Network, go to the HOME page, and look for "education" link in the lower left corner.

Now, if EII is _not_ transforming data, then the data set it pulls from should be "sox compliant" - it shifts the owness back on to the source systems to maintain audit trails of any information it changes. For source systems, this is a no-brainer, they are capture systems and are supposed to be "systems of record" for the business, which means the business is already supposed to "trust" these systems - even though the data quality may or may not be there.

Time out - this doesn't make a lot of sense, where's the quality in all of this?
Ok - compliance is one thing, but the reason I talk about it is this: Information Quality tools CHANGE DATA under the covers, therefore in order meet compliance initiatives and be auditable, we must surround these tools with a before and after process. At the DATA level - this means that if we introduce "quality processes" in-stream with EII, we could be in serious trouble with Compliance - again, unless we record the effects (before/after and when).

Quality Tools are nothing more than transformation engines (ok - they do a LOT more than that), but when it comes to bare-bones they are CHANGING DATA sets. Therefore: everything that applies to ETL/EAI and data mining (in accordance with compliance) also applies to EII, and the processes that load active warehouses.

Wait a minute! Active Warehouses have a refresh cycle that's too fast to put a quality trigger in play, right?
Right and wrong - remember active warehouses are "right-time" warehouses, it's all about latency. However, there are active warehouses that cannot use quality initiatives in-stream, because the data decays too fast.

Now what we will say is this: even ADW's still have "strategic" initiatives to them, which means that only the tactical sides for-go the quality settings (until the strategic based quality engine cleanses the historical data, and sometimes that historical data is returned to the source during transactional/tactical processing).

Remember this: Information Quality is SUBJECTIVE, it is one version/one flavor of the truth - truth is subjective, and will change depending on the eye-of-the beholder (the end user). Therefore, quality engines MUST be held accountable and auditable by surrounding them with processes that capture before-after-when (BAW).

Can EII use IQ tools or Data mining processes in-stream?
Sure, and they probably should - especially when sourcing external or freely available data. I'm just saying that EII will have to take the extra hit, and write-back the BAW somewhere to be compliant. The challenge here is: when to initiate a quality process, and keep it so that it doesn't impact the query timing too significantly. Now if EII is pulling from the strategic side of the warehouse - wonderful, it should be pulling quality data (already altered/cleansed/patched).

Can ADW use IQ tools or data mining processes in stream?
Yes and no - depending on the latency requirements this may vary. Most ADW's at 5 minutes or less latency, don't run IQ processes in-stream, companies at this level use IQ tools plugged directly in to their source/capture systems, which raises other questions, like how do I find my broken business processes? But that's for another day, another entry.

Quality should come "after" the load of the raw data to the data warehouse, or "after" the load of the raw data into the EII engine, it should be secondary, and applied only if there is an audit trail mechanism in place to trace back to the original data.

Thoughts and comments are welcome; I'll blog more on the subject if there's an interest.

Cheers,
Dan L

  Posted by Dan Linstedt at 6:33 AM | | Comments (2)


October 25, 2005

RNA and RNAi in Nanohousing

There has been renewed interest in RNAi and RNA lately in the biotech world (don't forget, biotech is a part of nanotech - or the other way around). RNA (or ribo-nucleaic acid) apparently has encoding and decoding instructions for gene sequences, RNAi apparently has the ability to block or inhibit specific gene sequences. See an introductory article here.

In this blog we will explore (theoretically anyhow) what this might mean to the nanohouse and DNA computing.

There are some neat pictures (simulated/generated) showing the DNA structure here. If you don't think nanohousing is being worked on, think again. Here's an IEEE link to a conference that occured in 2004.

Ok let's get started..
For quite a while I've blogged and written about convergence of form and function, along with convergence of industries: Bio, chemistry, technology, physics, etc.. Back in an early paper I wrote for B-Eye I predicted that the future technologist would have to have skills well beyond mere "technology" in order to survive (or face the threat of outsourcing). Well, form and function in Bio-tech are a BIG part of what make it work.

In the Nanohouse, we need to learn from this. The future nanohouse won't be JUST a data warehouse, or JUST an ODS, or JUST an OLTP system - no, it will be an "integrated data store" where the molecules collect "data" as history, when it pertains to the context in which it lives - assigned by "key" components of information that only it recognizes. Different parts of the DNA structure will represent different and distinct chemical keys - for storing different types of information.

Well that's all well and good, but we need functionality in the form of RNA and RNAi to act on the DNA strands that we "build". We also need catalyst type events to trigger interaction across the DNA sequences. Here's a quote from a Vienna RNA project that discusses this:

Biomolecules exhibit a close interplay between structure and function. Therefore the growing number of RNA molecules with complex functions, beyond that of encoding proteins, has brought increased demand for RNA structure prediction methods. While prediction of tertiary structure is usually infeasible, the area of RNA secondary structures is an example where computational methods have been highly successful.
http://nar.oxfordjournals.org/cgi/content/full/31/13/3429

Wow! So this means a Nanohouse is definitely feasible?
Yes - but it's still at least 5 to 10 years off before we understand enough to create one. However, the study of RNA and RNAi sequences along with the DNA strands is important, and will help build a foundation of knowledge from which the Nanohouse can be built.

Where does this impact my business today?
Quite frankly it doesn’t yet. How soon it does will depend on the advances in both Biotech and Nanotech sectors. I would speculate that if your top Information Technologists / Scientists and researchers are not yet involved in this field - they should be. The paradigm is already beginning to shift as we see applications of this technology be created in the labs around the world. Like any paradigm shift, this one will take time - and lots of it.

This is interesting, how does modeling take place in this element?
In order to answer this question, we must take a look at not just data visualization, but model visualization. Model visualization consists of putting data models into a 3D landscape, and combining them with hub-spoke like structures that resemble molecular connections (poor mans neural network), see the Data Vault data modeling references on this site.

How does this play with RNA and RNAi?
RNA can help with the interaction of the molecules, while RNAi can specifically block or inhibit interaction. But more than that, the dynamics of this interaction/blocking need to be scored and measured.

The first practical dynamic programming algorithms to predict the optimal secondary structure of an RNA sequence date back over 20 years (1). Since then they have been extended to allow prediction of suboptimal structures (2,3) and thermodynamic ensembles (4), which allow to assign a confidence level or ‘well definedness’ to the predictions (5).
http://nar.oxfordjournals.org/cgi/content/full/31/13/3429

So does this mean the Nanohouse is a "dynamic structure" model?
Interesting, the answer is it depends; dynamic structure in the sense of adding new DNA components, extracting, and connecting the DNA to other molecules, yes; but changing the core-structure underneath - no. RNA itself also has a structure, and the structure is rigid.

Recently, several methods have addressed the problem of predicting a consensus structure for a group of related RNA sequences (6–11). Such conserved structures are of particular interest, since conservation of structure in spite of sequence variation implies that the structure must be functionally important. By enhancing energy rules with sequence covariation these methods also obtain much better prediction accuracies.

In other words, the structure itself of the RNA stays the same - much the same as the structure of a neuron. Even though the memories change, the connections in the brain change, the thought patterns change, the basic structure of the neurons in the brain stay the same.

What does this mean to Nanohousing?
It means that the architecture of our structure must be consistent, repeatable, and redundant - but that the inter-relations, the functions, and the sequences can change (leading to a dynamic set of rules for inter-relationships, but a static structural based foundation from which to scale infinitely).

A stretch of the imagination might be to say:
The equivalent of ‘data mining activities’ has been found within the RNA and RNAi operations.

Can we beg, borrow and steal some of these concepts today?
Yes - we should be utilizing what we learn in these fields and applying it to our current modeling techniques and data warehouses.

* a close interplay between structure and function (the data model MUST be closely related to the functions in business)
* structure must be functionally important
* assign a confidence level or ‘well definedness’ to the relationship (dynamic relationships can be created, weighed, tested, and destroyed depending on viability to associative information)

You can find more on the Data Vault modeling technique (for free) here.

What do you think will happen in your Nanohouse?
Dan L

  Posted by Dan Linstedt at 7:16 AM | | Comments (0)


October 20, 2005

Push-Pull Pros and Cons

I've been asked about the pros and cons of ETL push-pull, I thought I'd generalize the issue a little more into the pros and cons of Push Pull technology in general. I'm including EII, and EAI in this posting. It's not that push or pull is necessarily bad by itself, its' more about using the right notion for the right data access at the right time.


Push-Pull, a direction I find myself pulled in, many different times during the day. (Seriously folks... :)

Ok - down to brass tacks, the nature of PUSH technology is basically the realm of EAI and Message Queuing. In this realm we deal with the publish/subscribe model, or maintaining a broadcast message to anyone listening.

Really "easy" technology until you get to the engineering underneath. The real work is deciding WHICH transactions are important, and WHICH are not. Then there's the decision on how often, how fast, and how to write the drivers to "plug" in to each of the applications, or legacy apps that service transactions to begin with. Ok - enough of the engineering talk, let's get back to the business aspects.

Push technology is GREAT when wanting to distribute transactions as-they-happen. Stock tickers, and other types of financial institution transactions are very important when it comes to push technology. How about disasters and notification? Again, important.

What about the different components?
For EAI: Push technology is it's life-blood, this is what it's built on, making the applications "talk" when the transactions are available.
For ETL/ELT: Not so important, even in an "Active Data Warehouse" it's not so important - ok, the PUSH of the transaction is important, but the ETL component? Gets in the way of getting the data in the right time to the warehouse for analysis.

Now wait just a minute - Aren't ETL/ELT engines getting stronger and faster? Yes - they are. But they still aren't "architected" for real-time dynamic data integration. The worlds BEST ETL/ELT engine will focus on transforming as many transactions as possible (in batch) in the shortest amount of time, that's their strength - and they should STICK to it (Stick to your ticket Harry, very important that you STICK to your ticket... - Harry Potter) We could learn a few things from this line; no really!

ETL/ELT is GREAT at PULL technology - go get the data on a scheduled timing interval, not just the data - but ALL the data, en masse. Bring me everything that meets criteria X, across ALL disparate systems, then integrate it all en masse (batch style) - and do it as fast as possible so that I can replicate the system with new information, and transformed information.

Ok - well, ETL/ELT engines will HAVE to process near real time in the near future in order to survive, while batch will not go away any time soon, the windows are shrinking, and the data sets are growing, and the timeliness of critical data is becoming more important. ETL/ELT are GREAT at static rules, parallelism, partitioning, and performance - they require huge amounts of processing power to get the job done right (with very large data sets). This is the nature of PULL. I guess one could speculate that PULL technologies require a place to "land" the data once it's been transformed.

Not something that PUSH technology needs, nor wants. PUSH technology wants to ACT on the transaction as it stands, once it reaches it's destination. This is a primary difference between PUSH and PULL.

Now let's not get confused! There's such a thing as IMMEDIATE PULL, or PULL ON DEMAND, this is new - it's called EII (as a paradigm).

EII in this nature offers many different things and is a _complimentary_ technology to EAI and ETL/ELT. Pull on demand isn't (usually) interested in massive history sets, nor is it interested in "doing" something with the transaction, such as applying it to another system based on business process workflow (although this could change in the near future). It is more interested in managing the metadata layers in between the business and data set, it is more interested in immediate access, immediate integration of CURRENT state than it is in history.

Now hold on! Don't get me wrong - EII can be used to access warehouses just the same as it can be used to access current OLTP/ODS, Staging areas, and Stock Tickers. It's the FOCUS of what EII does that makes PULL ON DEMAND different than PULL on batch schedule. The focus is much different. That same focus makes it a complimentary technology to the EAI and ETL/ELT world.

Using the right tool for the right job makes all the difference. EII also can transform/conform, and write-back. Something that EAI does (write-back), but ETL frequently is not "architected" for. Mostly because the "work" that ETL does must be checked before it is re-integrated with the source systems.

Now take Active or Right-Time Data Warehousing, there's a combination of technologies being utilized to get the data into the warehouse at the right-time, and there's a combination (including data mining, and scoring analysis) to re-deliver the transactions back to the source systems at the right time. Of course this is neither push nor pull, but rather "closed loop processing." Ok - it uses push to get the transaction to the warehouse, and push to get it back from the warehouse to the OLTP system.

So at the bottom of this blog entry, we are still left with the question, what are the pros and cons of push and pull? Let's see if we can sum it up (forgive me, I may forget a few):
Push Pros:
1. Instant transaction communication
2. Feedback on the transaction after the business processes are invoked.
3. Transaction by Transaction / Guaranteed delivery mechanisms
4. Mass Distribution, or publish subscribe to those that want it.
5. Visual Business Rule Processing Engines (are usually in place).
6. TACTICAL in nature (for solving business problems)
7. New sources can come on-line and push out new transactions (integrating with ease into existing layers).

Push Cons:
1. Independent transactions - meaning can't rely on "history", can't rely on "trends", and can’t rely on an understanding.
2. Difficult to establish context
3. Can't transform "massive sets of data" at once - technology just isn't fast enough yet - this may change with Nanotech and DNA computing.
4. Once a transaction is sent - it's gone. No "recorded history", although some EAI engines actually have mitigated this point over the years.
5. Sometimes tends to be a highly code-driven environment under the covers.
6. The number of crisscrossing attachments to transactions means it's harder to "unhook" legacy systems that are providing the information...

Pull Pros:
1. Massive sets of transactions in parallel/partitioned can be handled in ever smaller execution windows.
2. Increase in processing power means increase in data set that can be dealt with.
3. We can get what we want when we want it via scheduling.
4. Predictive support, predictive failures, predictive model - leading to standardization, and automation.
5. STRATEGIC IN NATURE.

Pull Cons:
1. Requires a Landing Area for the transformed data sets.
2. Requires massive sets of processing power (for large data)
3. Batch Windows are continually shrinking while data sets are ever growing.
4. No "NOW" data available, in other words, little to no visibility into the transactions occuring RIGHT NOW.
5. Once a source, always a source (static SOURCING, static TARGETING)

PULL ON DEMAND Pros:
1. Focus on the metadata integration layer
2. Focus on the business rules of integration
3. Utilized by services to conform NOW transactions, WHEN requested (as opposed to WHEN they happen)
4. Provides access to previously inaccessible systems (like word docs, emails, power points, and so on).
5. Dynamic and Distributed query sets mean the queries and their plans can change in accordance with the data set changes (straight PULL is STATIC QUERY BASED - unless the RDBMS engine tunes the query under the covers).
6. Dynamic Sourcing, Dynamic Targeting - if one source isn't available, the metadata layer and engine can determine the "next source in line" and fire the query just the same.
7. TACTICAL IN NATURE!!

PULL ON DEMAND Cons:
1. Requires STRICT adherence and agreement by the enterprise to metadata management, and development.
2. Requires (or forces the hand of) data quality initiatives ON THE SOURCE SYSTEMS.
3. Increases management costs, and required processing power. BUT DECREASES Long-Term costs of implementation of "Services", be-it B2B, B2C and so on.
4. Requires sources be defined and setup ahead of time (before accessing), but PULL strategic has the same requirement.

Ok, none of these are Complete lists by any stretch of the imagination (*some might say I have none :) But hopefully they give a peek into what might be some of the top differentiators across these technologies.

Thoughts? Comments? Have some pros/cons you'd like to add? Please, feel free.

Thanks,
Dan L

  Posted by Dan Linstedt at 2:42 PM | | Comments (1)


Standards, Compliance, and Successes

I've been asked about standards, and what they contribute to the success of a project within business. Particularly from the entry on Architecture, Standards, and Business. Standards contribute quite a bit actually. But standards can also be overkill. There are some neat comments on Agile Modeling forum regarding the use of standards, and I've spoken with Scott Ambler about some of these things (but not yet in detail). Grady Booch and I have discussed the nature of useful standards in brief conversations, of which we still have to draw some conclusions - with that let me continue my entry.

http://www.MyersHolum.comWhat kinds of standards do we have in industry?
There are hundreds, if not thousands of standards all over mature industries. Some of the ones I can think of right now include: ISO, HIPPA, BASIL II, ANSI, SEI/CMM, PMBOK, ASCII, RS-232C, FireWire, Encryption, Security, and so on.

When an industry or business is NEW or yet-undefined, there are no real standards or accepted methodologies for build out. Take Data Warehousing for example: when it was first discussed (in the early 70’s) there were no best practices, no standards, no suggested ways of completing projects. However as time went on and practitioners built data warehouses, they discovered that when best practices and standards were applied, the businesses reaped significantly more benefits (lower cost, reduced risk, easier implementation, faster build-out) and so on.

Now I can tell you some hairy stories about standards and over-kill. When we first introduced SEI/CMM standards (lock stock and barrel) to our manufacturing organization, we had severe trouble implementing CMM Level 3 – too many standards for tiny little projects which had small impacts on the state of business. In other words, the standards were too thick, too heavy to implement. Then we applied “standards thinners” (like paint thinners) which didn’t destroy the quality of the standards, but rather reduced them to a working plan. The project still followed standards and best practices, only less of them.

Of course SEI was first proposing CMM as a level of software engineering, as they still do. What we did was apply SEI and CMM best practices to a blend of data warehousing best practices and standards. We wanted a system that was repeatable (in architecture and design), easy to build, consistent, with reduced risk and rapid build out. We quickly reached CMM Level 5 with our organization and our data warehouse as we followed this hybrid paradigm.

We combined Spiral methodologies with Waterfall checkpoints at major steps to reduce risk, we trained individuals before engaging them on projects, we undertook versioning, and centralized store of documentation, we also put together risk analysis spreadsheets and project size estimates by using FUNCTION POINTS. We also continued to label and number ALL requirements, refocusing the requirements as needed to be specific reachable, and measurable goals (using RUPP processes). Finally we attached the project plan numbered items to each specific requirement, so we could produce business process metrics and answer user questions on progress and risk at any given point. We had a team of 3 people working on this project at any given time.

So you see, even with small projects, a certain level of standards help the team achieve what they need to do – with quality. We helped befriend the business users, upon completion they threw MORE work at us than we could handle. We helped turn the business belief from: IT can’t deliver; we’ll build it ourselves, TO: IT has done a tremendous job AND saved us tons of money and time.

By the way, here’s something I want you to walk away with (I teach this in my VLDW class at TDWI): The larger the data set, OR the larger the project, the less likely you are to produce a success WITHOUT standards! In other words: The larger the project, the larger the data set – the more likely you are to succeed with standards and best practices. Without standards and best practices – your project will fall into chaos and disarray and quickly succumb to unforeseen / unmitigated risks.

I created something called The Matrix Methodology for data warehousing, now I’m creating a new methodology much more advanced and incorporating ideas like Agile Data Modeling (process wise), Data Vault Data Modeling (physical), Master Data Management, and SEI/CMM components for the market place. My current company puts these principles to work in the projects, and RFI’s that we assist with.

By the way, while I didn’t focus on it very much, standards certainly assist those in need of compliant projects, and compliant data stores – get to where they need to go, but that’s an entry for another time.

Hope this helps,
Dan Linstedt

  Posted by Dan Linstedt at 5:51 AM | | Comments (0)


October 18, 2005

Information Valuation - Part 2

I've been asked if there was a way to quantify information as an asset on the books, and since then have been asked what % of companies may be doing this lately. It's hard to quantify something that companies have long considered an intangible asset. However in this blog we will explore the base possibilities and present a single scenario which may begin the process of accounting for data (quantifying data). This is an experimental topic, any thoughts or feedback is welcome.

What is data valuation as an asset?
To understand this, we have to first accept the fact that data in and of itself is valuable to the organization. Then we must take steps to measure parts of the value of the organization. In my mind there are two major methods by which valuation of the data set can begin.

The first method: Top down valuation
Top down valuation (from my perspective) means lumping the entire data set together as a single asset, or maybe each of the OLTP systems, and data integration (ok - data stores) as their own assets. The questions we have to ask with top down valuation might include the following:
1. Would I lose business if I lost this data store?
2. How much money per minute would I owe to partners/customers based on SLA's if this data set were UNAVAILABLE?
3. What tangible profits have been received partly as a result of this data set being built?

We've been doing this for years in quantifying the cost of hardware on which these data sets run, but we just haven't carried this into the data itself - some have assumed that the data is too intangible an element to be measured. Notice we didn't ask any questions about the quality or compliance of the data set... We only asked about tangible and measurable loss of the data set. One might say that Top Down measurement is akin to itemizing the entire set of data with the hardware for disaster recovery classification. True. Now would be a good time to insure against these failures and unavailability’s.

Well, a big part of that is getting insurance for the LOSS of an entire data store. Maybe rates for insurance of data are LOWER for those with a proven and tested (audited by the Insurance adjusters) disaster recovery program. Maybe the rates are lower for those businesses that have business definitions and attached understanding for the uses of their data - in relation to the actual SLA's signed by the business. Maybe the lower rates indicate a better handle on understanding and quantifiable results of the data and what it means to the business.

Once the data is insured - it should be attributed as an asset on the books for the amount it is insured for. Maybe insurance rates go up for those companies that DON'T meet their SLA's, and aren't accountable for failure recovery. Hmmm, talk about Information Quality Improvement efforts.

Ok - so we've quantified (somewhat) what the BLOB of data represents to the business, we can justify it's loss, and account for it's recovery. What about depreciation?

That's a VERY interesting question. Think of data in a Data Warehouse or Enterprise Data Integration store as different as that in an ODS, or OLTP system (which it is). Even an Active Data Warehouse falls into this category. Depreciation (hypothesis) can only be measured by the loss of historical data, and what that translates to in a quantifiable manner to the business.

In other words: does the SLA cover "old" data; if so, how old? Where is the cut-off point at which the "old" information is not valuable to the organization anymore? if the "old" data is always valuable, just less valuable, then ask this question: If my systems were to lose data that is X months/years old - what would it cost me to recover?

What SLA's are in place to force me to be back up and running? Are my data mining engines relying on the history to produce active responses today? If so, then the "old" data is just as valuable as the current or new data. If not, then the "old" data depreciates in value as it ages - it's up to you to negotiate that value point with the data insurance adjusters.

The Second Method: Bottom Up Valuation
What does this mean? Bottom up valuation (again my hypothesis) is the manner in which EACH data element is weighed in on the grand scale of value to the business. Here, data quality IS measured on a row by row, cell by cell level. It is tedious and much of the work can be done by utilizing a data profiling tool, or a data mining tool engaged to profile for trends of missing information or bad information.

Suppose you had a customer table, well - forgive me - but if you don't have customers, you don't have revenue, and if you don't have revenue, you aren't in business to make money. In this customer table you had 3 million customers. Each customer row has a specific value. Assuming you're a direct marketing firm, or even a banking firm - each customer has a specific investment or makes a specific amount of "money" for you by being in your customer table.

As a marketing firm, you want to offer free subscriptions - and the gender of the customer becomes important as ONE of 32 elements that make a difference in determining WHAT to offer them. If you send the wrong type of magazine to the wrong individual because you're missing the gender field, and they unsubscribe from ALL your services - you've just lost a good customer. What does that equate to in revenue dollars?

The bottom up method requires attaching a "row-score" to each row, weighting the importance of each row - of course this weighting or scoring mechanism must change every time the customer information changes (apply this concept to EVERY table within your data store). Then there is a general or average dollar amount, a mean, and median, max and min dollar amount for each row.

Some customers are outliers and we DON'T want to adjust their outlying investments to the average numbers. Now, based on the weighting and the statistical dollar amounts calculate an overall value for each table - please take into account the EMPTY fields and the importance of having GOOD data, this should affect the weighting factors. Finally, add up the weighted dollar amounts for each row, and ask yourself the business questions: is the value of this "data set" really the total value to the business? Add an overall adjustment factor to the final dollar amounts, and test it with top-down recovery costs for an approximate range.

Now you should have a fairly decent idea what your data sets are worth. Now it's up to you to negotiate with the data insurance investors for actual valuation.

Ok, there's a lot of talk here about valuation of data, and insuring the data - some of which we may already do. But what about listing it as an asset for tax purposes? Well - that requires a change in the laws around the world, unfortunately that's wide-open speculation that I do not engage in. If you can insure your data, maybe just maybe, you can convince your tax auditors and your local government to at least look at the issues seriously.

In regards to what I see in the field (purely speculation - not backed by any scientific studies of any kind): I see on average 10% of the fortune 500 engaging in these activities. However, I see 60% to 80% of the businesses today working hard to have fault-tolerance and disaster recovery programs in place. What I don't see is the follow through with tangible valuation of data as an asset.

Good luck, hope this helps – thoughts or comments? Love to hear from you.
Dan L

  Posted by Dan Linstedt at 6:34 AM | | Comments (0)


Data to Information, Architectural Roles for Business

Dave Wells, Director of Education, TDWI and I have had several discussions on this topic: Turning your Data In to Business Information

In light of this discussion we discussed the Business Dimensions and Business points of pivoting which take place when layering the data for presentation. Data is often overlaid with additional business dimensions to make it usable. I'm not talking about the technical dimensions that we produce within the data marts, I'm talking about individual columns labeled as dimensional aspects of the data.

This isn't to say that parts of these ideas aren't available today; it is merely to say that some level of automation and underlying base data architecture are missing from the scene today.

For instance, there are the common and major dimensions: Sales, Finance, HR, Manufacturing, etc.. There are the other common dimensions such as: hours worked revenue, taxes paid, cost of goods, etc... But hidden within these are additional layers of business dimensions which we frequently ignore.

These dimensions are the most powerful - allowing the business user to slice and dice the data by column to reach a single cell of information. It's N-DIMENSIONAL information, something that could be utilized by data visualization engines. In this N-DIMENSIONAL space, we have all the other columns or data elements - but they are arranged in the manner in which we USE the data within business.

Wait a minute! Just hold on there partner, are you telling me that one of these dimensions could just be equated to a Satellite?
YEP! You got it, in the Data Vault modeling architecture; each Satellite could be created to become a (business) dimensional breakdown of the data itself. This might just be the business definition of a Satellite that we've all been waiting for.

Keep in mind that Satellites will still split by Type of Data and Rate of Change - this will help define the Business Dimensional aspect of the information housed within. Each column within the Satellite (at that point) becomes a pivot objective if desired.

So where's the challenge?
The challenge, in turning data into information, is in the nature of the utilization of this data and how it's organized for the business when presented. These multiple layers / independent stacks of data housed within satellites must be formulated to be extremely flat and wide, as if in an N-DIMENSIONAL CUBE, associated by snapshot in time. In other words, RDBMS engines with Cube-Views are ahead of the curve, BI vendors and middle-ware such as EII which can build cubes are also ahead of the curve.

The challenge is to connect EACH of these dimensional definitions in an X, Y, or Z axis for viewing when desired - allow the end-user to pivot on each of these dimensions, allow each of the dimensions to move in the hierarchy (up or down), and define them as full and complete metadata for the business users. In other words, a metadata repository for all elements in the Data Vault model, then make it accessible through cube-views or some such delivery mechanism.

The ultimate goal would be to have database technology powerful enough to collapse the business delivery into a virtually defined layer or layers, driven by metadata and virtual definition of the web of connectivity across the metadata. Then to have this layer be the single point of access by all BI and delivery mechanisms.

More about the Data Vault data modeling architecture can be found here, on free forum discussions.

Thoughts?
Dan L

  Posted by Dan Linstedt at 6:30 AM | | Comments (0)


October 16, 2005

A funny idea: Slower Melting Snow

I've been thinking, with all the advancements that are being made in nanotech, why can't we create a molecule that melts more slowly, and lasts longer in warmer temperatures? This blog is a hypothetical look at an idea I would love to see discussed...

Imagine, slower melting snow - made from nanotech. Snow that still melts, so it doesn't harm the environment and nature still can experience the seasons - but something that might be able to be created on the ski-slopes on the first of September every year, and doesn't melt until late may or June.

Maybe it's a silly idea, but maybe just maybe there might be something to it. Imagine if we could keep a water molecule crystallized for just a little bit longer than usual - get it to release heat less quickly, or get it to absorb heat slower. What kinds of applications could this lead to?

Let's speculate for a moment - I know nothing (other than what I've seen on Nova) about avalanches, and how they are caused by melting sheets of snow - turning top layers of snow to water on warm days, freezing at night into ice sheets - then new snow fall on the ice sheets; eventually the weight causing the layer of snow to "slide" off the ice, starting an avalanche.

Suppose this type of extreme crystallization could be stopped or prolonged - in other words, suppose the snow melts more slowly, less water, less ice at night. Do you think that the sheets could be "thinned" out enough to be crushed under the weight of new snow rather than cause a slide? Maybe.

Or how about slow-melting ice in drinks, (but still melts); let’s just say I'm in to the old-fashioned ice cubes, rather than the plastic re-freezable ones. Well, back to snow. If we could construct slow melting snow molecules we might have longer lasting ski seasons.

What are some of the dangers?
If the clouds are seeded, we'd have one heck of a time getting this snow off the roads, or it would require the fast-melting molecule to be introduced on the roads. In this case, we'd want to make sure that the chemical reaction caused by fast-melting applied to slow-melting doesn't cause a toxic reaction, and that it doesn't rust metal.

Another possible danger is the slow-melting snow, if applied to bare-ground, might actually trap heat in the ground - because it doesn't absorb the heat as fast as regular snow. I'm not sure of all the impacts, but at first glance, this doesn't sound good.

Well, here's one more possibility: Slower melting snow may actually hold a colder internal temperature than regular snow, so if you got it on your hands or down your back - it would be colder to the touch. However - that requires a heat absorption rate within the snow molecule itself. Now that I think about it, to the touch - this slower melting snow may not feel as cold (not sure about this one).

Here's an interesting (possibly dangerous) use: applying slower melting snow (or some offshoot) to warmer ocean waters that have traditionally been "cold". What if it could be used to slowly lower the temperature of what are supposed to be cold regions of water?

Of course this is a silly idea, and one made from a fictional thought - but I just thought maybe, someone was daydreaming (like me) about a longer ski-season.

Thoughts? What do you see as the dangers, or possibilities of this type of idea? What makes it infeasible/feasible?

  Posted by Dan Linstedt at 6:52 AM | | Comments (0)


October 14, 2005

What is the TRUTH anyway?

We've all heard of "Single Version of the Truth", and we've all seen shows, presentations, and even read a paper or two on this topic. In fact, search Google for this phrase results in 36,900 hits! But this begs the question: what the heck is "TRUTH"? How can you hold "TRUTH" accountable? Is a "Single version of the Truth" compliant?

I stand here sure as the sun will rise today to tell you that I believe TRUTH is subjective in nature. Of course I think it also depends on how you define "Truth" in your enterprise integration efforts.

My friend and mentor Bill Inmon wrote about this here. I would tend to say after reading and re-reading this article that indeed, SVT (single version of the truth) is in fact a GOAL, nothing more. I would also tend to say that truth itself is in the eye of the beholder (or money-holder as the case may be), because as we all know - one persons truth is not necessarily coherent with another’s.

If all truth were equal we would have discussions about metadata, common meta definitions. Nor would we fight over what the master system is for customer information, nor would we re-state or alter data in the warehouse when one major money holder leaves, and another takes their place.

The operational systems' data tells one story (which for the most part matches the business expectations), while the metadata and business requirements usually tell a different story. The "gold" or real profitability within the organization is to resolve the discrepancies find and fix the business issues causing the discrepancies - this will also put you on the road to full compliance, auditability, and end-user accountability.

I would tend to say that SVT is a wonderful goal - but ONLY achievable within a specific point in time as viewed and presented to the current money holder who agrees with and accepts the definitions set before them.

What then really makes a truly robust, scalable integration platform? Be-it for metadata, data, or business rules?
The real answer can only be found by a close approximation: for lack of a better term: "single integrated statement of data facts (SISDF)", certainly not as clean and "saleable" as SVT, but it works none-the-less. In order to define this let me clarify for a moment the differences between SVT and SISDF.

SVT says: we have truth when data is munged and agrees with whatever the current business rules state it should match.

SISDF says: we have truth when data is integrated by common semantically defined sets of business keys, and is the raw data that arrived in accordance with the source system feed - both on the same level of detail, and with exacting traceability.

SVT goes on to say: cleansed, merged, mixed and matched data should be put in the integration store.

SISDF takes a different approach: data will be loaded to the integration store AS IT STANDS 1 for 1 match with what arrived on the source feeds.

This pushes the notion of SVT downstream from the SISDF Integration store - and puts the burden of proof (to be labled an SVT) in the delivery mechanisms - which are usually data marts. In other words: Sales and Finance don't agree with each other on what "revenue" means, so one data mart has a sales SVT, and the other data mart contains finance SVT.

Here we have a case where both SVT's are correct at the same time, but depending on who's using the data - it could be wrong, meaning that "truth" falls apart because of interpretation. Typically though, this case is overcome through hard-work, data stewards, metadata management, and SLA's with the end-user base to agree on two definitions: SALES REVENUE and FINANCE REVENUE, each calculated differently.

On the other hand I've seen this happen first hand: SVT is built "on the way in" to the enterprise data warehouse. All is fine and dandy until the current money-holder is replaced with a new money-holder. The new money-holder says: hey, my sky isn't blue - it's white, and the SVT you think you have is WRONG until you change it. They are henceforth left with "restating" the data within the enterprise integration store (data warehouse).

They've not only broken the SVT, they've broken compliance, traceability, auditability, and strangely enough - the SVT that WAS in place? It WAS correct for the time it WAS in place. A conundrum I think is what they call this.

Well, with the onslaught of SOA, enterprise integration efforts are hotter than ever (as is metadata definition and management), right along with data quality. It's high time we fed ourselves with an architecture that presents SISDF rather than SVT, and moved our SVT's downstream (in the place of loading data delivery platforms such as data marts).

In other words - it's high time we STOPPED all this transformation, cleansing and data alteration on the way in to our enterprise warehouses, and STARTED all the transformation, cleansing, and data alteration on the way OUT of our EDW and in to our marts.

I'm not saying "don't deliver SVT", I'm not saying "don't cleanse or quality check your data", I'm not saying "it's bad to integrate or complete your metadata efforts."

I AM saying: if you want a real, compliant, single/consistent integrated version of the DATA within your enterprise - you need to move the SVT notions down stream - and produce the "truth" as you produce the delivery mechanisms. I am also saying: take a hard look at producing an ERROR MART or two, or more - move the dirty data through the system, and put it in the hands of savvy business users who will begin seeing "who can clean up their data fastest" as a game.

Your enterprise might just be surprised what they uncover when this type of effort is implemented. I know of several companies that found and saved $15M to $45M in the first six months of operation of a SISDF.

You can read more about the data architecture behind these notions on my web site. Thoughts?

  Posted by Dan Linstedt at 1:39 AM | | Comments (0)


October 13, 2005

Oracle - Fully Loaded? Or Dried up like dust?

I suppose it's all in how you look at it, but take a look at these two new E-Week stories: (This is an extremely opinionated entry, would love your feedback).

Oracle Scores Open-Source InnoDB Storage Engine
Oracle Lets Fly Zippy TimesTen Database

I like Oracle Database (for specific projects), but let's take a walk on the wild side... What do you suppose is happening? Did they (Oracle) decide their core-engine is too big, too cumbersome and can't take the heat anymore? Or do you think it's a Microsoft-like move to squash extremely great technology never to be heard of again?

For the sake of discussion, let's talk about both sides of the coin. In the first situation, we have to make some assumptions: (these assumptions are only 1/2 based in reality - the rest are what I've personally experienced at customer sites).

1. Oracle's engine has been "added on and added on and added on" over the years, it's grown up like a big huge ball of band-aids. They've done some serious modification to parts of the core engine and in their latest releases (10g and on), they've finally added some LONG overdue functionality.
2. Oracle's been losing market share (at least in the data warehousing / BI space), until it bought up competition in the analytics sectors to make up for it.
3. IBM has been gunning for Oracle market share in the DW/BI space for years, it's 8.2 release and price-points really make it attractive, along with the features.
4. Microsoft SQLServer 2005 is also gunning for this space, and will make a huge splash - it's a tremendous advance.
5. MySQL has rapidly become a database of choice among the OLTP transactional community (Oracle's biggest market share $$ is OLTP - that's what they're REALLY GOOD at).
6. InnoDB has been a mainstay for MySQL.
7. Oracle has taken a beating at the hands of Teradata, Netezza, and DatAllegro lately.

Well that leads me to say:
a) Oracle MUST back-up, tear-down, and completely re-build the entire core architecture of their engine (well most of it anyhow) if they are to compete going forward
OR
b) purchase technology that will "gut" their own engine, and automatically replace the "bad parts" under the covers, slowly dissolving the old "Oracle core engine" and replacing it with the "new" oracle engines - nimble, FAST, and scalable, all at cheaper price points.

Think about it, Oracle has to charge huge fees in order to pay it's army of core engineers. With new smaller, leaner, and faster core engines - away with some or more of this massive expense!!

Now it's not wise (and I'm not suggesting) that the ENTIRE Oracle core engine be tossed, although Hmmmm.... I am saying that re-engineering, cost reduction, and smaller faster/leaner meaner engineering needs to take it's place if Oracle is to compete in the new market place.

That still leaves the second question: Is this just another attempt by a large corporation (like Oracle) to squash upcoming technology?
This doesn't make very much sense - Oracle has millions of man-hours in applications technology on their stack, MySQL had a lot of development work in this area - but most of it was outside the MySQL/InnoDB range.

I think honestly, Oracle needs to breathe new life into their old technology engines, and simply bought the expertise - now if they're really smart, they'll learn from the existing company at InnoDB, instead of squashing it into the Oracle Culture.

What's your opinion?

  Posted by Dan Linstedt at 5:01 PM | | Comments (0)


October 11, 2005

Data Visualization, just a flash in the pan?

A couple of comments on an entry I made a while ago: "What does your Dashboard look like?" lead me to continue this dive into visualization. Some of the comments are interesting, and ask questions like: what's the business value of visualization? Is it really needed in our industry? It might be nice, but it might be a niche product too... In this entry we'll explore a few of these things, and see what kinds of answers (if any) we can dig up.

Is data visualization just another fad?
Well, I have to go back to quite a few articles over the past several years which mention that there is convergence afoot. There is convergence in Data Integration, Warehousing, BI, there is convergence across customers, finance, and sales. There is convergence with manufacturing, service brokers, and software. Most everywhere we look - things are converging.

There's even convergence across scientific areas and business in the form of nanotech. New standards are being born, and as one paradigm slowly tails off, new ones spring up. It's the ever changing nature of change. "What doesn't change - dies. What evolves - grows and adapts. The only thing constant is change."

In order to understand if Visualization is just a fad, we should look at the existing technologies and delivery mechanisms and ask, how have they changed in the past 10 years?
Bar, pie, and spreadsheets are indeed a valuable tool - if you want details and to drill into specifics. It is important not to belittle their significance or their contribution as a part of the "visualization" of data.

However, they haven't changed much since their inception. Businesses change, paradigms shift - albeit slowly. The way businesses view and review their data should also be changing, a natural extension of the current graphics is 3D landscapes, and interactive scenarios laid out in new ways. In other words, if I want to view my business in a new way, or force myself to think differently, I should be looking for different ways in which to experience my data. The real world is not made up of just 2D surfaces, or numbers (addresses on a mail box), it is made up of interactive experiences.

There have been many experiments conducted across many well respected institutions that show: experiential settings appear to be one of the better ways of learning new things and thinking in new manners. From this we arrive at 3D interactive graphics for new and different ways to visualize data. Again, I make the point: the BI vendors have made tremendous strides to make their engines solid, to bring the value proposition of their engines to the fore-front, to include data mining and other mechanisms of retrieving data, all I'm suggesting is that "visualization" include the next step, the next layer, the experiential 3D learning environment.

I don't think visualization is just a fad, I think it's a broad range of delivery mechanisms that include bar, pie, line-graphs, and Excel spreadsheets - but I think the next "change" is again, to include the new interactive means of examining data.

"If you don't change what you're doing, you're going to end-up right where you're headed."

Is this new visualization really needed in our industry?
Ok - so scientists, and biologists, and chemists really have a use for this technology, but what about CxO's, executive staff, senior management? Is the business really in tune with having a need to drive this technology?

I'm not so sure how to answer this question, but I do know that sometimes showing new ways of seeing and interacting with information can spur new ways of application. I also know that sharp business minds are already experimenting with this technology as a competitive advantage. So I hypothesize that it's a symbiotic need, both parties (technology and business) need to come to the table to really spur the movement. That sitting back on our laurels and waiting (on either side of the fence) won't get us anywhere.

I suppose I could go back to Oil & Gas exploration industry, they use land-maps and geological studies, and earth core samples to figure out where the best place is to house tank systems, drill for new oil, and run pipes on solid ground. In the financial sector, what if a representative model could be developed? Different levels of transactions representing different levels and strength of ground layers, different levels of revenue and aggregation points representing pipes and flow valves, different divisions representing different tanks. Then map this to "find the leak", or "see which tank isn't producing enough", what's the optimal or maximum flow capacity of our financials? It could raise some eyebrows.

It's funny, we say business needs to drive technology, true. But sometimes business doesn't know what's possible until technology says: Hey, look at me, this is a new way of thinking - do you have a use?
It's a symbiotic relationship - and a delicate dance, how does technology justify new creations and investment without a potential customer? How do potential customers know what they might be able to accomplish without knowing what technology can create? Hence market studies, what-if with trusted customers, beta production of ideas, seed money, and investment capital from VC's and angels.

Personally I believe interactive landscapes and 3D modeling are just an evolution of data visualization, because in a way - bar charts, pie graphs and even spreadsheets are visualization of information too.

Thoughts and comments are welcome.

  Posted by Dan Linstedt at 8:23 AM | | Comments (2)


October 6, 2005

US govt spends $3.7 Billion on Nanotech

Nanotech is coming, and the government is spending billions of dollars a year - but it's not just the US government, it's happening all over the world! We think compliance is big, security is big money, well you haven't seen anything yet. Nanotech spending tops 'em all, and the spending is only due to increase.

The following is compliments of Lux Research:
Attention tends to focus on widely-publicized efforts like the U.S.'s $3.7 billion, four-year National Nanotechnology Initiative. But competitive nanotech efforts also appear in unexpected countries. Consider that:

* China has moved from also-ran to power player when it comes to nanoscience. China's share of academic publications on nanoscale science and engineering topics rose from 7.5% in 1995 to 18.3% in 2004, taking the country from fifth to second in the world.
* Iran's NanoTechnology Initiative was ordered by none other than former President Mohammed Khatami. Applications focus on fields ranging from textiles to agriculture; the country's Agriculture-Jihad ministry recently launched a nanotech web site.
* Thailand's Ministry of Science and Technology devised a nanotechnology plan that went before the country's cabinet in June, recognizing that the nation has missed past tech waves but can get in early on this one. It calls for training 2,500 researchers, registering 300 patents, and spending $294 million over ten years.

What would you do with $3.7 Billion budget? I might go skiing... on NanoSnow that never melts ;-)

  Posted by Dan Linstedt at 8:56 AM | | Comments (0)


October 5, 2005

EII and Unstructured Data - Blowout Party of the year!

Ok, so maybe a piece of software can't really party - but we can! :) Claudia just posted a blog on the need for garnering semi-structured and unstructured data within the enterprise warehouse. Bill Inmon has got an unstructured/semi structured data retrieval and visualization tool, we see more information being pushed under the compliance umbrella.

That leaves us asking many questions, like: Do I need to monitor all e-mails? How do I decide what's important and what's not in my "sea of word-docs?" How and what impact does it have on my EDW?

It will take a long time to answer all those questions, but one thing in the EII world that has been overlooked is it's ability to access, reference, and integrate semi-structured and unstructured data.

Someone somewhere once said: "only 20% of the worlds data lives in the structured realm, the other 80% lives in semi-structured and unstructured content." Well, if this is really true, and we've seen ROI's for EDW's as high as 400%, then what do you think the ROI could be when integrating the other 80% of our business? It certainly should raise some eyebrows.

Now I'm not suggesting that EII replace ETL, and in fact there are some misunderstandings out there about ETL - one of which says: ETL handles only Batch, and is used for only historical data - this simply is not true. Alright, 80% of the time this may be true, but there are times when an Active Data Warehouse has been built and ETL is utilized on a 5 minute or 3 minute refresh increment. I've also seen ETL utilized with Queuing mechanisms for real-time transformation (by no means an easy task). There's another customer using ETL to synchronize all their source systems across the enterprise and they don't even have a warehouse.

But: ETL also works with only STRUCTURED data. To make ETL "fit" a real-time integration paradigm is like a round peg in a square hole, challenging, costly, and increases complexity.

Now this is where EII really begins to shine, EII can make it much easier to integrate real-time data - not to mention unstructured and semi-structured data. Let's focus on the following two components: e-mail and documents. What if the metadata for my warehouse was stored in an "appendix" or glossary of terms in a word-doc? What if I had answered 4 or 5 key questions about how certain elements are computed through emails?

Would this information be helpful to a) know that it exists, b) have it catalogued in the warehouse c) be able to integrate these elements within my BI reporting solution as "pop-overs" or pop-ups? This is all fine and dandy, by now the old-timer ETL jockeys say: I can write perl to conform this stuff to structured data, and load it in - why do I need EII?

Well, here's the case: What if over the following two minutes I answer two more questions (and the class is training) - EII can easily detect the new emails and provide the information in real-time to the training class. If I then add a word-doc to the central library that has FAQ's, then the class can make use of that information as well (immediately).

Granted, this is just one small case of solving a very specific problem - EII can solve many more problems like this, and much larger in scope, but it demonstrates a differentiator between EII and ETL.

Utilizing EII to access unstructured data will drive up ROI on integration projects at a much faster rate. Besides which, the ETL jockeys could use EII to help "discover" information about their integration projects - it may even help speed up the build-out process for EDW efforts.

Thoughts?
Dan L

  Posted by Dan Linstedt at 10:40 AM | | Comments (1)


October 3, 2005

Personal Security and your information

I've blogged about this recently, the judge in SF who basically ruled that credit card companies don't have to be accountable for telling you if your information is stolen right? We'll here's the flip side to this story. Turns out CardSystems is having stock trouble, on-line card processing merchants have seen sales fall a couple percentage points since the breech.

Maybe they'll begin paying attention?

Check out these stories on e-week:

Visa USA Delays Plan to Cut Ties with CardSystems
CardSystems to Congress: We Face 'Imminent Extinction'
Major Card Vendors Stay Mum on Data Breach
Lawsuit Seeks Payback for Major Credit Card Breach

And on and on. The government can't agree on how to solve these problems, yet the justice system seems quite content on "letting these breaches slip on through". At least the credit card companies are stepping up to the plate, but is it too little to late?

Let's look at this another way: a small vendor (mom & pop shop) is breached, their credit card storage is stolen, and all the cards are erroneously charged. The owners of the cards report these bogus charges, and the credit card company says: Due to the number of chargebacks that the small vendor experiences, their account will be "immediately discontinued."

I don't see any waiting period or grace period for the small companies, why then does such a large company like "CardSystems" get a break of several months after the breach? Can you say double standard? This is absurd. They'll punish the little guys at the first sign of trouble, but the big-boys get a break??

Ok, so the mom & pop shops are always told: never keep the credit card numbers on file anyhow. Most of the shops abide by this rule, so what makes CardSystems any different? One word: Money.

The problem is: we've got issues when we can't even control our own personal information, nor hold the vendors liable for breaches that they and their sub-contractors are responsible for.

It's just a sad story.

Cheers,
Dan L