Business Intelligence Network

Blog: Dan E. Linstedt

« February 2006 | Main | April 2006 »

March 21, 2006

Demystifying the MDM definition

MDM = Master Data Management, why should you care what it means? There are many vendors out there who've defined which part of MDM they implement, unfortunately they've called it MDM; it's just one piece of MDM that they are tackling. MDM or true Master Data Management is a much larger umbrella than just "master conformed dimensions", or "Master lists of quality/cleansed information." MDM includes the term: Data Management and we all agree (at least for the most part) that Data Management is all encompassing right? So why the fuss over defining such narrow implementations and then titling them MDM?

What is Master Data Management anyway?
Let's break it down in to two pieces: Master Data, and Data Management - I think "Master Data" invokes quite a number of elements that need to be considered. Below is a check-list which my firm uses during implementation of Data Governance and best practices:

Master Data Questions
1. Does the data need to be compliant and Auditable?
If yes, then their are two copies of Master Data in the integration arena, the first copy of the data is compliant, and auditable, and merged only according to the business key. Merged is the wrong word here, by merged I mean attached to the same key defined to be the same semantic level across the business, that essentially means the same thing to the business. This is the only level of integration that takes place, without a loss of grain. Each target zone contains a record source that allows maximum traceability and compliance of the data. The data set is housed _exactly_ as it stood on the source system. Read the Data Vault data modeling for more information.

The second copy of the Master Data is in the data mart side of the house, remember: any time the data is truly merged, quality cleansed, altered, or prepared for user-utilization, that I call that: a data mart (it can be a single table, a conformed dimension, or a master data table with "quality" data) - it's all a data mart, where end-users can get their information quickly, and the model is tuned for speed and quality of data. In this case, the second copy of Master Data might be a "Master Dimension".

2. Does the data require existing as a System Of Record?
If yes, then the data set must be a "master" single consolidated instance of the data at the lowest level of grain, again see my response above about compliance and auditability.

If no, then maybe all you have is the notion of "Master Data Sets" embedded in conformed dimensions. Maybe you've defined your warehouse as a Conformed Dimension quality-cleansed warehouse, again this is ok if you don't want, or don't need an SoR in your warehouse, or don't have compliance and accountability to deal with.

3. Is your Master Data housed on your source system?
If so, then you might have a single copy of the data set. If not, then the data set might be spread across multiple source systems.

Data Management Questions
Note: this area is huge, and due to time and space constraints I will not address the entire segment of Data Management, hopefully I will present the feeling that MDM is a much larger umbrella than the "vendor hype" would lead you to believe.

1. What is Data Management?
That is a huge question with a huge answer. Data Management encompasses the act of understanding, monitoring, managing, maintaining, moving, integrating, synchronizing, dis-integrating, and delivering your data. It also includes items like security and access (data Governance), growth (metrics), value (KPA's & KPI's), and mining (asking unknown questions). Master Data Management encompasses the "master data notions", with Data Management notions - it is bigger (much bigger) than just data warehousing, or data integration.

2. How does Data Management affect Master Data?
Data Management is an initiative, a journey - not a destination. It is an ongoing effort to track, manage, and unravel the information across the enterprise. Data Management stretches across all data sets, but in particular, the Master Data shows high-value and high return (when done properly). Managing that data set over time will prove to keep the high-value proposition in tact. Without data management strategies and best practices, the "master data" sets decay over time, their usefulness slowly fades as the enterprise changes business rules and business processes. What's a master data set today, may not constitute the "right" master data set tomorrow. Be careful not to fall in the trap of letting your Master Data Set be set in stone from this day forward.

Master data sets need to be managed. Ahh but don't forget semi and unstructured data sets, those too must be managed, and believe it or not there are Master Data sets lurking in these areas as well.

Do you have an MDM initiative? What's worked for you and your organization? What differs from the vendor definitions?

Thanks,
Dan Linstedt

  Posted by Dan Linstedt at 5:49 AM | | Comments (0)


March 20, 2006

Golden Rules of Performance and Tuning

I've been applying performance and tuning techniques to systems world-wide for the past 15 years. I grew up as an assembly level programmer on a CPM-Z-80/Z-8080 Digital VT-180 computer, along with Unix System 5, and a few other machines. It used to be that many would say Performance and Tuning is an art more than a science, well, these days - the science part of it is what really makes this work. In this entry I'm going to introduce to you the top golden rules for performance and tuning your systems, and architectures - this is just a peek at what I'm going to be teaching at the TDWI conference in San Diego in August. In my assessments I cover everything from hardware, to architecture, to systems, to overloading, and platform sizing.

Have you ever wondered how to reduce ETL/ELT processing times from 72 hours to 23 hours? or 18 hours to 2 hours? Have you wondered how to gain 400% to 4000% performance improvements from the systems you have? How do you know if your RAM/CPU, Disk, IP/Network, Applications and RDBMS are in balanced and peak performance modes? Have you questioned when to buy new hardware, and what platform to move to?

The following golden rules are the top tips of performance across systems architecture - these are all part of a workshop course, and assessment that I offer on-site - which tailors the responses to your organization.

The top rules to performance and tuning any system are as follows:
1. Reduce the amount of data you're dealing with
2. Increase Parallelism

The rest of the rules are assigned to meet different categories and can include:
3. Balance load across the applications
4. Re-Architect the processes
5. Limit the RDBMS engines to their use of hardware
6. Partition, partition, partition
7. Manage restartability, fault-tolerance, fail-over
8. Classify Error categories
9. Do not overload hardware with too much parallelism
10. Tune disk access

There are about 250 such recommendations which go in to tuning any sort of system ranging from midrange to client/server based. Mainframes work slightly differently and require (sometimes) a lifting of the CU limitation put on the login. But let me talk about the first two rules: 1 & 2.

Decreasing the data set:
There are a multitude of ways in which to decrease the data set:
1. The first is identify which data you will actually be using during processing, and then ensures that only that data actually passes through the process. Sometimes this requires re-architecture in order to see the performance gains or to be able to reduce the data set.
2. The second is to partition the data, and apply the second rule - increase parallelism. Once partitioned, each partition within the parallel set of processes deals with "less data", therefore if the hardware can handle it, performance will increase.
3. Vertical and horizontal partitioning are two kinds of partitioning available: Vertical is split by number of "columns" or precision of the data set, horizontal is what we are used to with RDBMS table partitioning. These two are NOT mutually exclusive, unfortunately most RDBMS engines today do NOT do a good job of vertical partitioning.

Increasing Parallelism:
1. Remember this: DBA's often make the mistake of setting only "1" switch in the RDBMS engine to engage parallelism, then if the performance gain isn't seen, they change the switch back. This approach will NOT work. Most RDBMS engines require 10 to 16 switches be set just to engage parallelism the proper way, and allow the engine to rewrite queries, perform inserts and updates in parallel, and so on.
2. By simplifying (re-architecting) the processes, most processes can be created to run in parallel. Large complex processes are bound (in most cases) to serialize unless they are constructed to execute block style SQL. There are some RDBMS vendors that don't allow any other kind of processing because they execute everything in parallel for you.
3. WATCH YOUR PARALLELISM - too much of a good thing can overload your system, again, balance must be achieved. Watch your system resources - there are ways to baseline your system to gain the maximum performance for the minimum amount of changes. I can help you identify the quick changes to be made.
4. Remember: most RDBMS engines these days have parallel insert, update, and delete - but to take advantage of parallel updates and parallel deletes usually requires a script be executed within the RDBMS (as opposed to a direct connect). This is most of the RDBMS vendors don't offer PARALLELISM for updates/deletes in their API / SDK's for applications to use (this should change in the near future).

I/O's can kill performance, balancing I/O's and caching activity can be a huge performance gain (or loss if done improperly). One day when we have nanotech storage devices, the "disk" I/O will disappear. Until then, we must live with it.

I'd love to hear what you've done to tune your environments, if I use your story at TDWI I'll quote you as the source. Please let me know if you'd like to be quoted, feel free to drop me an email in private as well. This entry is just a glimpse into the P&T world.

Thanks,
Dan L

  Posted by Dan Linstedt at 5:45 AM | | Comments (0)


March 13, 2006

Does MDM include Data Visualization?

From where I stand (ok - sit).... I was on a plane this morning, and had the opportunity to view the captain’s cockpit for a brief while, while they ran through some of their pre-flight checks. As usual, my mind began to wander and ask the "what-if" questions, what if they didn't have a history of best-practices, how would they know what to check for pre-flight? Are all the gauges real-time or do some gauges offer "historical" data? How many of these gauges "manage data" for a single context? And then it hit me, all the gauges and knobs are really a "visualization" of the information they need to prepare for flight, fly, land, and do all the things a captain and co-captain need to do to move an air-plane through the air safely.

This entry is more about unanswered questions than it is about speculation. I'd love to hear about your experiences as management, executive level, or otherwise - and what you might do in this situation.

Well, that got me to thinking. I know cockpits are complicated, I can see that. It takes hundreds of hours (if not thousands) to learn to fly a commercial jet safely, to understand all the switches and knobs, and "heads-up" displays that constantly stream information at them. I started to reason: if getting a commercial pilots license requires all this training, should CEO's, executives, and board-of-directors also go through rigorous training? Where are the instructors for "running a company?"

I also began to wonder: what would happen if some of these fancy "real-time read-out displays" were not computerized, or visual? Maybe there's a pilot out there who can comment on what it's like to fly through a storm without visual aids, knowing what's up/down, or broken gauges that needed to be repaired.

I began to wonder - why isn't there a "cockpit" approach to running corporations? Would it or could it become that standardized? Is there a way to visualize all the information in a corporation? If you could visualize corporate business management in a cockpit manner, how would you describe the nature of the graphs, charts, landscape / horizon layouts? What kinds of knobs and dials would you have?

I began to think of the cockpit as Master Data Management (all data in the right place at the right time, attuned to the right purpose) for an airplane. Share with us how this might affect your visualization or MDM efforts.

Thanks,
Dan Linstedt

  Posted by Dan Linstedt at 8:58 PM | | Comments (2)


March 9, 2006

DW2.0 - Introductory Thoughts

I've been granted permission by Bill to discuss DW2.0 on the blog, and in other articles that I write. This entry is an introductory look at DW2.0, the overall definition, sections, and components. If you wish to use the terms you will need to contact Bill directly. I've included Bill Inmon's stringent legal ramifications below:

"The definition of DW2.0 is intended for the non commercial use of anyone who wants to use the material. However, any commercial use of the material and the trademark is strictly forbidden and will be vigorously monitored and prosecuted. Commercial usage of DW2.0 specifically pertains to (but is not limited to) commercial usage in seminars, presentations, books, articles, speeches, web sites, white papers, panel discussions, reports, and other written and oral forms is forbidden. If you wish to use material about DW2.0 commercially, licensing can be arranged for a fee."

There are 4 sectors of DW 2.0 which comprise the "data warehouse" in a disciplined format: (note: all quoted material is from Bill Inmon’s site and description of DW2.0)

Interactive Sector - The place where high performance data warehouse processing occurs Integrated Sector - The place where integrated data resides Near Line Sector - The place where data with a lower probability of access resides Archival Sector - The place where data with a truly low probability of access resides

From a 3000 ft perspective, each "sector" looks to be (at first) like separate copies of data, this may not turn out to be the case. In fact, these can be made into logical divisions - particularly if the data model underneath supports the logical architecture in a physical format. I've created a public domain (freely available) data modeling architecture called the Data Vault which supports both the interactive and integrated sectors. The notion of Near Line and Archival Sectors appear (at first glance) to be more physically related to storage. I'll dive into these in future blog entries.

In my opinion, the RDBMS vendors should be the first to stand up and take notice (along with the appliance vendors). They should be rushing to the table to support DW2.0 from a mechanical standpoint - offering the developers "seamless" integration across each of the four sectors. That would bring the reality of a logical model and metadata management to the implementation cycles. I long for the day when I can "logically model" the data and no longer care (or know) how the physical implementation takes place - the only addition to the logical model might be data types and field lengths from the physical world.

Let's switch gears and discuss DW2.0 Compliance, audit ability, and SOR (system of record) for a moment. Below is Bill's definition of SOR and the best place to identify data as arriving from an SOR.

Because the data that enters DW2.0 has its first appearance in the operational environment, great care needs to be taken with the data. In a word, the data that eventually finds its way into DW2.0 needs to be as accurate, up to date, and complete as possible. There needs to be defined what can be determined the source data system of record. The source data system of record is the data that is the best source of data.

I often ponder the question: what does SOR truly mean? Hmmm - by that I wonder about the following case study (which actually happened to me 10 years ago on a government data warehouse).

We built a data warehouse, it contained a master parts list, and a few other master lists (hence my recent entries on Master Data Management). Our warehouse also contained integrated data organized by business key, but stored at the lowest level of grain. Furthermore the information was not "transformed" except in raw data type, and defaults were assigned in specific cases documented by SLA's with the business.

Three things happened. Auditors were brought in because naysayer’s were stating that the warehouse was "wrong", and they wanted the project stopped. The first thing that happened was around data audit ability. The auditors asked: why do the reports from the data marts not match the operational reports? Our team demonstrated the value of raw integrated data (both bad and good) stored within the warehouse, and that the warehouse reflected what was in the source system - the auditor passed the warehouse, and then proceeded to tell the business that the operational report (financial calculation) was wrong and needed to be corrected. The business would not have had "accountability" much less found or fixed the problem if our data warehouse was not deemed a "reliable and compliant" source of data.

The second thing that happened (at the same time): the auditor saw the parts list, employee list, work order list, and so on... and then asked: does this "vision of integrated data" exist in any one source system? The answer was clearly no. The auditor then checked the individual data elements for audit ability and traced them back to their source systems, once satisfied he labeled the warehouse suitable to become a "system of record" as it was the only place that data existed.

The third thing that happened: the auditor then asked for a source system that was called "the master system" for bill of materials to be re-loaded with 5 year old data. But the business had changed, the models in the source system had changed and the restore could not take place - making it impossible for the "master system" to be a system of record for historical data. The only place that data could be loaded was in the warehouse.

As I read through DW2.0 specification I believe there is a place for accountability, SOR, and compliance within the warehouse, again it has a lot to do with the traceability of the data sets and creating audit trails where they didn't exist before. We'll dive into this more later.

For now, if you have thoughts or comments - I'd love to hear about them. What part of DW2.0 would you like to know about?

Thank-you,
Dan L

  Posted by Dan Linstedt at 6:59 PM | | Comments (0)


March 8, 2006

DNA Computing - Control over DNA Molecules

DNA computing is rapidly making strides in the nanotech industry. There is an interesting evolution with absolutely profound implications: control over a single DNA molecule via nano crystal antennae. The presentation is available for a small fee, but shows just what is possible. Imagine, a massively parallel computing engine at phenomenal speeds, controlling millions or billions of DNA molecules via radio signals.. Wow! How about a thumb drive with 10^8 terabytes of computing power in a couple grams of DNA solution? Searching this solution in less than 3 seconds for answers, computing within the solution in 3 to 10 seconds...

The presentation is on the MIT web site.
The implications are profound. The notion of controlling a single DNA molecule from a radio wave is incredible. Let's step off the edge, and look into the future, over the horizon - let's see if we can think of applications and implications of this technology within the DW / BI space. Beyond the obvious applications in bio-tech, and medical science, let's see what we can come up with.

The web blurb talks about the following:

Anyone can imagine controlling a model car or airplane with radio signals, remotely guiding the machine along a prescribed pathway. In this Knowledge Update, readers learn that the same is being done with DNA and other molecules. This Update describes the tools behind this molecular control, which relies on nanotechnology. In addition, readers learn how this technique can control the binding of DNA, which governs biological processes from cell division to switching genes on and off. Consequently, controlling bimolecular operations opens many possibilities, such as using this nano-control for genetic testing, building molecule-size devices that move on command, and much more.

Now, lets' dive into nano-computing for a moment: imagine a computing system containing a few grams of DNA - say within the size of a thumb drive for a USB port. Within that thumb drive are two things: modified DNA with nano crystal antennae, and a computing system that produces super short, very "weak" radio transmission waves; just enough of a wave to reach the localized DNA. Of course the frequency must be localized as well, and the radio wave must be too weak to travel outside the bounds of the thumb drive - maybe the inside of the thumb drive is coated with a shielding material that keeps the radio waves within the device.

Power consumption is low for this kind of thing. It would be very easy to "program" the DNA, especially since the radio waves cut, splice, and control on/off of the molecules. The challenge would be in reading the DNA results. Suppose there are two mechanisms available to "read results", one possibility might be based on a solution, encouraging and discouraging bonding based on ionization of the molecules - then the reading mechanism might be a segment of light that passes through the entire solution, and either shadow and/or intensity of shadow can produce a read-out of the result, or instead of light and colors, maybe additional radio waves are passed through the solution - ones that don't interact with the antennae, what bounces is read into an "imaging" device - the image is then interpreted by standard programmatic methods.

It is possible then, by combining existing technology with nanotechnology into a single device, to see how "exponentially hard" computational problems can be solved through a simple USB plug and play, and that existing technology can be used to "read" the answers, and send the signals in parallel to the actual computation engine. However, now that I think of it, why not use this for simple solutions too? Solved in parallel, all the DNA strands and programmable DNA molecules should come up with the same answer, every time.

Radio waves offer the dynamics of the same signal to each programmable element at the same time, using imaging and light/color/shadowing techniques - the solution could be "read". Localizing the radio waves and shielding the cover would minimize interference.

I'd love to hear from you, and see what you think of this future vision.

Thank-you,
Dan Linstedt

  Posted by Dan Linstedt at 8:01 AM | | Comments (0)


March 7, 2006

Is it time to re-define your Data Warehouse?

I've commented in the past on my definition of the data warehouse, and recently, based on that definition I've been commenting on Master Data Management. In this blog I take a step back, and post the pro's and con's of constructing a compliant (active) data warehouse. I would love to have everyone weigh in, and tell us what kind of a data warehouse your organization is implementing and why. I'd like to clear the air and see if compliance within a data warehouse is really an issue for the enterprise.

What exactly does a COMPLIANT data warehouse mean to you? Please tell us, we'd love to hear about it.

I've grown up in the industry believing that constructing auditable historical data stores is the proper way to build "data warehouses." I've had huge successes in passing audits, proving the warehouse contains correct data according to the source, and producing data marts of all shapes and sizes. In the environment we were in, with this approach, we've shown time and time again: the flaws in the operational systems (including operational reports) which were costing the company millions of dollars a year. Without auditable historical data stores (what I call a compliant data warehouse), the nay-sayers would've been right when the blamed the warehouse for being "wrong" and our team would have been put "out of business."

This approach to defining data warehouses and the process of data warehousing has lead me to new architectures (like the Data Vault data model), new methods of loading data and validating utilizing ETL/ELT routines, and writing articles on compliance and the nature of the data loads. However, I understand from a number of sources that Not all data warehouses need to be compliant - but is this really true? I'd like to hear from those who don't need the warehouse to be compliant nor auditable within their organization. I'd like to know exactly what the enterprise is using the warehouse for, and how they justify the data within.

With that, let's take a look at the pros' and cons' (from my opinionated stance) of compliant versus non-compliant warehouses:

Compliant:
Pros:
* Provides accurate data, data that matches (good / bad or indifferent) the source systems
* Provides integrated (at the semantic business key level) data sets, with non-altered details, and the same grain as the source system. Again, the data is hooked together through defined relationships across business keys (see the Data Vault modeling concepts))
* By bringing both the good and the bad data into the warehouse, can show where business processes are truly broken, can often show when they broke (as long as history is available to demonstrate this)
* Allows extremely rapid build-out of any data mart desired for the organization. Once a standard data model for compliant / historical data store has been established, data mart build out can be done quickly. We had a process that allowed architecture and design (loaded with a percentage sample data set) within an hour of the request.
* Begins to shift the vision from "warehouse" to SoR (system of record)
* Places accountability into the hands of the business users with the use of "error marts"
* Increases visibility into broken source systems, broken business processes.
* Increases security of the data set in the warehouse
* Increases metadata and provides additional data lineage discovery points.
* Business value in the grain of the data supports data mining activities, along with value to produce "what's broken" and "what's working" across business functional units.
* Master Lists are clearly produced as a "mart" or a delivery mechanism directly from the compliant warehouse.
* Compliant warehouses offer an easier path to "near-real-time" and/or "active" data warehousing, because the complex business rules are applied downstream, from the warehouse TO the mart.

Cons:
* Introduces dirty data into the warehouse.
* Begins to shift the vision from "warehouse" to SoR (system of record)
* Moves the "logic" of cleansing, grain shift, and master data production down-stream.
* Requires data marts for delivery of "cleansed/merged/mixed" data according to the business.
* Raises questions about the warehouse being / acting similar to an operational system.
* Requires (at a minimum) an added layer of data storage before end-user utilization of the data set.
* Requires additional funding for development effort (nothing above what you wouldn't do for a normalized warehouse).

Now, let's take a look at a traditionally defined data warehouse, or a non-compliant data warehouse.
Pros:
* Single storage area for all "data warehousing activity"
* Data in the warehouse is cleansed, altered, and heavily integrated (loss of grain in many cases) - producing what we like to call Master Data sets.
* Inherant business value built in to the transformation and integration layer on the way in to the warehouse.
* Doesn't necessarily require separate marts to deliver the data to the business user.
* No question that the System of Record is one or more source (operational) system.
* Master "lists" of cleansed and "fixed" or altered data become a source of business revenue, and often drive operational systems.

Cons:
* When the business changes their definition of "master data", all the transformation layers must change on the way in to the warehouse, usually resulting in data model changes, and huge impacts to accommodate the change.
* Changes often cause heartburn in IT and business - because the grain of the previously rolled up data may shift, therefore interpretation of old historical data may change on the way out. This can lead to confusing financial figures, and numerous questions about the "warehouse being right".
* It's easy for nay-sayers to prove the warehouse "is wrong", because it doesn't follow their interpretation of the business rules.
* Audits are difficult (if not impossible) to pass, WITHOUT extra production of Audit Trails along the way of the ETL / ELT routines, in other words, the data before, after, and when - of transformation - must be recorded in order to pass an audit.
* Master "lists" of cleansed and "fixed" or altered data become a source of business revenue, and often drive operational systems. (This is both a pro and a con - depending on your viewpoint).
* "real-time" or "Active" warehouses are more difficult and more costly to produce, all the data arriving must go through complex data integration rules and cleansing before landing in the warehouse. Often times the processing of these rules rely on alternate data sets which may not be available within a 1 minute or less refresh cycle.

These are just my thoughts, I'd love to hear what you would add to the pros and cons of each of these lists - I want to know what you are experiencing in the market place. Many of the warehouses built with compliance in mind (as I've described it above), have had 10+ years of success and are in fact growing today, with buy-in from finance, HR, sales, and even the corporate board of directors.

Please let me know what you think, I'm also curious to know how many of you are seeing a request for a compliant data warehouse - and just what does that mean?

Hope to hear from you soon,
Dan Linstedt

  Posted by Dan Linstedt at 6:25 AM | | Comments (1)


March 6, 2006

Hidden in the un-structured information...

Welcome again, unstructured data is a hard thing to grasp, let alone to process; but if we (businesses) are going after it, then we MUST have a reason. That reason? There must be value in the information hidden in the unstructured layers - after-all, what is "unstructured" data anyway? I think free-form text, is still semi-structured, images are semi-structured, emails, word-docs, and other such elements - they are all structured to some degree, otherwise programmatic approaches would not be able to display the documents, search the images, allow alterations, perform matches. I think what we should be focusing on in the Data Warehousing / Data Integration industry is how to best leverage the "unstructured information" programs and algorithms already built.

Think about it, with images there are all kinds of image processing programs, image matching, alteration, consolidation, over-lay, resizing, colorization, and so on. For drawings, there are cad-programs, element tags at the end or in the middle of the image that explain all the components. For chemical images there are sets of commands and tags that explain how to build a rotating 3D visual of the chemical elements and their associative parts. For word-docs, and other docs there are "parsing and processing programs" like Microsoft Word, and KDE KOffice (open Source), Star Office, and so on. For e-mails, there are many different programs - but most of the email traffic can actually be "sniffed" off TCP/IP packets without much damage to the content (if any today).

Given this definition, the question I have truly, is WHAT IS Unstructured data? I'm not so sure it's such a good term to use, but let's just accept (for the purposes of this entry) that unstructured data is everything that isn't defined (easily) by a standard RDBMS table structure - without blobs and CLOBS of course; let's pretend that everything defined by a BLOB or CLOB is considered "unstructured" for a minute and then return to the question above.

Ten years ago (or more) I worked as an employee for a government manufacturing corporation, big money, big contracts, compliance, and unstructured data. Our manufacturing plan was filled with unstructured data. At that time we needed (as a part of our effort) to integrate parts drawings, and to look for text within the CAD drawings to figure out what impact it had on the plan; in other words, annotations for specific parts drawings. Now back then, the CAD images were just that, CAD images - and picking the text out wasn't as simple as "looking for the text attached to the image". We literally had to process vector graphic commands.

Why? What was hidden in our unstructured information?
In our case, instructions, and plan estimations. The company was going through SEI/CMM, lean-initiatives, SAP implementation, business process re-engineering, compliance and so on. They were trying to help improve the efficiency of the planners and ensure the right image was attached to the right descriptive paragraphs which explained the build process. There was (and still is) inherent value to the business to process the unstructured information.

Why is this important for us?
Because unstructured information processing is hot now for the commercial world. There's value hidden in these documents, and we need to understand (as a corporation) where that value exists, and how it can impact our business. Bill Inmon shows a wonderful demonstration of finding "gas-pipeline" problems by providing topographical maps or manufactured landscapes based on word-association and frequency, from scanning unstructured (semi-structured) documents across the organization. Improving communication and spotting problems before they occur is a huge benefit.

So how do we access this information easily today?
Well, if you're like me and you don't want to actually launch word, excel, or graphics editing programs in order to "scan the screen to capture content", then you'll want to investigate the use of an EII tool. EII tools bring with them the ability to process unstructured and semi-structured information, through the use of SQL queries, XQueries, and other potential mechanisms.

What you do with the information after you discovered it should actually be pre-determined by the business case, or the reason for purchasing and installing EII in the beginning. As usual, the business needs to drive the need for IT to solve the problem of accessibility. Establish the value of "finding" and "using" the data in the unstructured world before you set out to implement.

What are some of the EII's strengths today?
* Ability to access XML documents
* Accessibility to word docs, excel, power-point
* Ability to access emails

What are some of the features that EII will need in the near future?
* Ability to parse, access, and pull text from various image formats
* Ability to use image match and compare algorithms (widely available on the market), say for matching thumbprints, and retina scan images.
* Ability to query CAD images, layered images, and process "statistics" about the images. Making use of the statistics about an image can be much more powerful than making use of the image itself. EII of the future will focus on providing high quality access to "summarization" of existing images in a standardized format.

Remember, summarization and what is done with that summarization of unstructured and semi-structured information can often shed light on "how" these documents are utilized, or meet the business requirements set before them. EII is a tool that can and should help in these areas, don't forget unstructured Search tools as well - EII should partner up with these vendors in order to have a wider grasp of "tagging" technology and summarization/scoring technology.

The best use of Unstructured/Semi-Structured data is the one that has a predefined business question/business case to answer to.

Are you accessing unstructured/semi-structured data? I'd love to hear from you - what are your challenges or successes with what you've done?

Thanks,
Dan L

  Posted by Dan Linstedt at 6:05 AM | | Comments (0)


March 3, 2006

VLDW: What happens in a scaled cluster?

I wrote a blog on this a while back, about MPP vs Clustering, now I'm going to discuss what happens in an Active cluster (to use an MS term) that usually causes problems. I'll also talk about clustering within a single node-group under an MPP option. While there are many issues surrounding clustering and volume, some are more prevalent than others. The golden rule is: Volume and Latency change everything! Come see me in August, at TDWI - where I teach a VLDW class and it's technical aspects.

In my last entry on this topic, I stated that it is better to run with MPP than with Clustering and that the more volume the clusters contain, the costlier it is to keep them going. Here's what's happening under the covers.

This pertains to a completely clustered SMP system:
1. Active clustering means to have all nodes sharing the entire RAM, all the Data, and running copies of the same process at the same time.
2. Each node "knows" about the other.
3. Assuming I have 5 clustered SMP nodes, and each node has 4 GB of RAM with 8 CPUs, the applications running on each node believe they have access to 20 GB of RAM and 40 CPU's.
4. In order to make "20GB" of RAM addressable, each machine must share a master memory allocation table, along with that allocation table, it shares semaphores - or locking mechanisms.

The network traffic between the nodes MUST be dedicated to server to server communication, if the network between the servers is mixed with disk traffic, or client traffic, or other server traffic, the communication layers begin to break down. Maintaining or keeping "up with the nodes", in other words synchronizing each node in the cluster every millisecond for access to the shared master memory allocation table becomes a bear. The more "nodes" that are added to the SMP cluster, the more network traffic there will be, the harder it becomes (mathematically) to keep them all in synch - due to the limitations of speed of the network, speed of the CPU's, speed of RAM, speed of disk. These upper limits are constantly being raised as speed of hardware increases, however - they lower back down with addition of RAM on the machines, or addition of data to manage on shared disk.

In a clustered environment such as this, everything is shared across all machines. The next thing that happens is the sharing of disk. The sharing of disk introduces I/O collisions across the I/O network (which should also be independent of every other kind of traffic). I/O contention must be managed, all nodes have access to the same data at the same time, the trick is to (again) setup a master data access table, just like a master RAM access table, and then synchronize the master data access table across all nodes in the cluster. The problem comes (again) as the data set grows, and localization of the information becomes a burden. In other words, the database that runs a cluster needs to scan 50 Million rows, and run computations.

The table is partitioned, but it becomes a shared job - the process starts on a single node in the cluster, the database thinks it has 40GB of RAM to access, so it begins to load the RAM in each of the clustered machines with different data sets - as it does, this exponentially increases the network traffic between the machines (in order to synchronize the RAM and CPU actions across the machines), and increases the network traffic between the machines and the disk device. Ultimately a second request for large data comes in, the first request hasn't finished yet - there's not much RAM left on each clustered node, so SWAPPING ensues. This is again an exponential increase in I/O (I/O includes everything from network to RAM to CPU to disk access). Again, the synchronization routines take over, and every single node in the cluster tries it's best to balance the resources.

This of course leads to extremely slow response times for both the first and the second access points, and so forth. The synchronization routines slow this process down, way down to a crawl. Operations begin to take on sequential nature as opposed to parallel nature because they run out of RAM, run out of CPU computing power, and the network gets' so bogged down that it cannot handle any further requests. Now, we think by adding a new clustered node that we'll solve the problem - but instead it only makes the problem worse.

I think by now it's evident that clustering for a very large data warehousing solution is NOT desirable. Can you put small numbers (2 to 4) clustered nodes on a single MPP solution? Yes, if you architect the nodes to operate independently, clustered nodes in a single MPP solution is one way to handle this kind of volume growth, adding another MPP node of clusters is ok - because it maintains autonomy, and scales linearly. BUT if you put too much data on a single "clustered" node within the MPP, you run into the same problems that large clusters present. Large "clustering" of machines (in my opinion) won't necessarily be feasible until we have speed-of-light communication between the clusters and, we are using RAM-based or nanotech based data storage rather than physical mechanical disk.

MPP On the other hand splits the load, and the trick with MPP is to avoid a "hot-node" which acts like a cluster in trouble. Balance of the data and the processing in the MPP world is EVERYTHING. But with balance and appropriate "split" of the data sets, near-linear scalability can be achieved.

Today, I nearly always choose the MPP option for data warehousing, in another entry at another time, we will explore MPP versus Clustering from an operational standpoint.

If you have success stories about clustering, I'd like to hear about them - please also include the estimated size of the data set, number of nodes, amount of RAM on each clustered node, and number of CPUs' on each clustered node. If you have horror stories, I welcome those too. By sharing your experiences we can begin to shed light on this subject.

Hope to hear from you,
Dan L

  Posted by Dan Linstedt at 5:19 AM | | Comments (0)


March 1, 2006

Data Warehouse Appliance, another look

Appliance based data warehousing is on the rise, and no wonder - the costs per terabyte are cheaper, and for specific applications of the warehouse - sometimes these platforms are blazingly fast. They offer plug and play technology with HA (high availability) and Fail Over just by plugging in another appliance. They offer remote management, self-updates to the BIOS, and firmware, and most of them run on open operating systems like Linux. In this blog entry I'll discuss both the pros and cons of Appliance Based warehousing, but I still believe that this will be a market segment to watch, and will eventually flood the market with the backbone for high availability data integration and warehouses.

There was a comment a while back that discussed an article in DMReview about appliances. It was written by Roger Gaskell of White-Cross systems; they build hardware for high-performance MPP and low and behold - produce a PROPRIETARY appliance.

What do appliances bring to the business?
They bring a number of wonderful features all pre-packaged in a single domain: (this is by no means a complete list)
* High Availability
* Fast loading capabilities
* Compression and Encryption (native in some cases)
* Plug and Play MPP units
* SQL Query interfaces
* Super Fast Data Access
* Low cost per terabyte options
* Plug and Play Fail-Over
* Automatic self-updating (in some cases)
* Remote Monitoring
* Complaince for data (in some cases, they include data versioning by date/time)

We saw it with the disk market in the 80's, we saw it with other devices in the 90’s like consolidation of the cell phone, with podcasts, downloads and now music on demand - appliances are everywhere. I've written on this subject of "CONVERGENCE" on B-Eye before, convergence is everywhere. With the disk manufacturers they've now grown up - the disks are no longer "just simply storage", they contain CPU's, RAM, caching algorithms, load-balancing mechanisms, reformatting (under the covers), hot-swapping, fail-over, dynamic traffic re-routing, hot-spot contention resolution, self-monitoring, remote updates, and more than that, they all adhere to common SAN or NASD standards, meaning we can plug in an IBM device next to an EMC device, and they don't care - they'll talk to each other over standard DISK I/O protocol.

What is missing from the Appliance today?
There are advances in the data warehouse appliance that must begin to take shape for this market to really grab market share. They include some of the following:
* Standards based HA and Fail-Over. For the LARGE organizations (fortune 50), they will end up with more than "one" data warehouse appliance vendor over time, this is invetable. They will require that plug and play be orchestrated across multiple vendors' devices - that they can plug and play them together in a grid fashion or over a WAN, and have them talk to each other.
* Development of a standard high-speed data exchange interface that can bridge multiple vendors together. The vendor today that "opens" the architecture to this sort of component will have a majority of the market share tomorrow.
* Partnerships with software vendors that do data integration. I've said it before, I'll say it again - establishing a low-cost option that is OEM'd inside the DW appliance to get people off the ground would be a huge boost to off-the-shelf productivity. It's also possible that partnerships with vendors of "registry solutions" and "web-based management portals" would also be a huge boost to sales and market share. Further reducing the cost of getting data integration in the door and standardized, particularly if the appliance vendor can "standardize" basic integration or web-services efforts.

I do not believe that proprietary hardware will "stop" the flow of appliances, nor do I believe it's necessarily a bad thing, EMC has it, IBM has it, Fujitsu has it, just about every disk manufacturer out there has it in their appliance and they are well-received today. It's a matter of opening up the architecture to a STANDARDS BASED service exchange, one that is obviously high-speed. These standards do not exist today, but they will - particularly as companies purchase these solutions from multiple vendors. Just look at IBM's DB2 UDB - MPP option that sits on Data Blades, it shares similar concepts - although maybe not quite an "appliance" just yet.

Feel free to contact me for more information, I would also love to hear what you think - both positive and negative, add your bullets to the list of why or why not - appliances in the future.

Thanks,
Dan L

  Posted by Dan Linstedt at 6:42 AM | | Comments (3)