Business Intelligence Network business intelligence resources

Blog: Dan E. Linstedt

« High volume and low performance - what to do? | Main | ETLT or ELT - Either way, pull back the sheets. »

High Volume, Low Performance - Can CDC Help?

In my last entry, I blogged on High Volume and Low Performance issues that you might run into. In this entry we'll talk a little about CDC (Change Data Capture) and how this is paramount to the success of your systems moving forward. If you've got high volume issues, and you don't have CDC in place today, you may be fighting an ever growing data set that will at some point become un-manageable. If CDC and Real-time / Right-Time processing (you know how I feel about the term Real-Time) are not implemented together, your right-time delivery system runs a high risk of pushing all kinds of extra traffic across the wire.

Change Data Capture, or CDC should be a vital part to any back-office BI solution that is put in place today. It may mean getting over the hurdles of signing SLA's with your data service providers, but believe me, it will be worth it. The question as is often missed is: what is my ever-growing cost of NOT implementing a CDC solution?

When we think about it this way, we end up at the right conclusion. Why? Because the data sets continue to grow, and when data sets grow, traffic on our network grows, the logic to decipher changes and transform / remove / record duplicates becomes more complex. With complexity comes system slow-down, with more network traffic also comes system slow-down. All in all, not implementing CDC causes costs to rise - and the faster the business changes / moves - the quicker the costs of not having CDC at the source, rises.

Wait a minute! CDC At the SOURCE? How in the world can I do that? I don't even own all my sources...
Ahh - ok, let's put it this way, even if you are paying for data from the outside world, you are the customer, and as a customer - aren't you entitled to "being right?" Aren't you in control over the negotiations with the data provider as to what and how they will provide it? More-over if you can get your data from another provider who is willing to go the extra mile and implement CDC, well-then what are you waiting for? It may be initial cost-outlay up front, but in the long run the overall costs will be much smaller than dealing with copies of data you already have, and eating up precious band-width (to which there's a cost as well).

CDC is required on ALL source systems, and by the way - if you are BUILDING an SOA, or an MDM solution, or you're setting up data governance or a governance initiative, putting CDC in place will sooner or later become a necessity. Not just the source systems you own, but also with the data and service providers you don't. Let's take sales force for example. If you outsource your sales management to Sales Force, then you'll want them to implement CDC on any of the "changes" that take place.

Change Data Capture systems become the "expert" logic for providing traceability and auditability demanded by auditors and compliance initiatives around the world. They provide safe and consistent means to extract every data set that changes, when it changed, and what it was versus what it changed to. The overhead on the source systems is often used as an excuse NOT to engage in CDC - this is the wrong way to look at the cost. A better question to ask is what is the cost of all that extra traffic on my network, traveling through my Transformation tools (be it: EAI, EII, or ETLT). I'm sure the cost of all that extra traffic is much much higher than the overhead cost of CDC on source systems, especially when the data set grows again, or when the frequency of delivery is reduced again.

Now, what do you want CDC feeding?
I would suggest, EAI, EII, ETL, ELT, JMS, any tool that MOVES data, transforms data, and extracts/loads data. There are different benefits to using CDC upstream for each of these.

What kind of features should I look for in my CDC offering?
Well - that all depends on the business purposes of CDC. Here are a few of the criteria I look for in a CDC engine (by the way, watch our web-site for a CDC Scorecard which will help you evaluate CDC solutions):
1. Does it have compression on the source side?
2. Does it keep a user-defined limit of the data on the source in a compressed format? In other words, can I as a user tell it to keep 3 months of one set, and 2 weeks of a different set of CDC?
3. If it crashes, or the system goes down, can it keep with the transaction consistency and restore from where it left off?
4. Do I have a pure read only SQL interface to the CDC? Meaning, can I use ANSI standard SELECT statements to get the data out?
5. Can I setup the CDC to push changes into Queue's or EAI software?
6. Can I setup the CDC engine to cache the changes (as indicated above in #2), then provide them on-demand?
7. Can I setup the CDC engine to "clear the cache" on Commit? Particularly if I'm using EAI or EII to transmit the CDC to "the other side" / a data warehouse.
8. Will the CDC work on all my source system platforms?
9. Is the overhead of the CDC engine something consistent and measurable? Can I limit the overhead in accordance with the feature set I choose?
10. Does the CDC engine have the capacity to run a snapshot-backup / restore of the compressed changes?
11. Can the compressed change log be archived / uncompressed in total, and replicated entirely to hot-stand by?

Ultimately, if a record changes and has 360 fields (let's say it's a mainframe record or a Cobol based structure), can I gear the CDC to issue an UPDATE transaction with JUST THE PRIMARY KEY, and JUST THE CHANGES? maybe only 10% or 36 fields changed, I don't want all 360 fields running across my network...

These are just some of the questions I would ask of CDC vendors, there are others - but this is a start. If you have CDC installed, I'd love to hear your comments as to how it helped your business, and what your headaches may have been in putting CDC in place. If you don't have CDC, and your business is fighting the concepts, I'd love to hear the arguments used (post anonymously if you wish) against CDC.

Thank-you very much for your time,
Daniel Linstedt
CTO, Myers-Holum, Inc
http://www.MyersHolum.com

  Posted by Dan Linstedt on December 4, 2006 4:50 AM |

Comments

I have never seen CDC in a source system done properly. Usually it is something thrown together around log scraping or database triggers (shudder). I have certainly never seen any implementation get close to the 11 functions you mentioned.

I don't think people know what CDC can do and I don't see it pushed by vendors. I find the ETL vendors for example being quite passive about it CDC addons and hardly promoting them at all.

Post a comment