Data Catalogs: Connecting Data Across the Enterprise
by Ron Powell
Originally published July 16, 2019
Claudia, you gave a presentation on data catalogs at TDWI in Chicago, explaining that today’s organizations often are suffering from analytic chaos and are wasting a lot of time and money trying to find the data and answers they need, not to mention the resulting redundancy in data and analytics. Let’s start by having you describe what a data catalog is.
Claudia Imhoff: Before we get to the definition, I’m actually going to read a quote that led me down the path of why we need a data catalog. “Any enterprise CEO really ought to be able to ask a question that involves connecting data across the organization, be able to run a company effectively, and especially to be able to respond to unexpected events. Most organizations are missing this ability to connect all the data together.” That was a quote from Tim Berners-Lee. He was the co-inventor of the World Wide Web, and I think he has his finger on the pulse of why a data catalog is needed.
So let’s talk about what a data catalog is. A really good definition came from Dave Stodder. He wrote an article for TDWI entitled “Data Cataloging Comes of Age.” In that article, he gave a very good definition. He wrote, “It is kind of a Rosetta stone that enables users, developers, and administrators to find and learn about data – and for information professionals to properly organize, integrate and curate data for users.” That’s really good. The data catalog is your guide to the often mystifying world of data, analytics and reports. It gives you information on how to navigate this world.
The other person I’d like to bring into this definition is Stan Christiaens, co-founder and CTO at Collibra. He gave me a terrific analogy of what a data catalog is. He starts out talking about Amazon. Amazon started out selling just books. What they had is probably the world’s biggest book catalog. Any book you wanted, you could undoubtedly find on Amazon. Well, Amazon figured out pretty quickly that it wasn’t just books that people wanted. Today Amazon sells almost anything you can think of, and the Amazon catalog tells you so much about each of the products it’s selling – the cost, availability, a detailed description and even reviews of the offerings and the manufacturer.
That’s what a data catalog should be. It started out as just data – understanding what the data was. But a data catalog is so much more than just data lineage. It now contains analytic products, context, meanings, and who to contact for more information. It describes how to find a particular set of data, an analytic, a report or a visualization – and can even suggest similar data, analytics or visualizations, much like Amazon does.
So the data catalog is the Amazon catalog for analytics.
That is a great definition. What are the specific benefits of having a data catalog?
Claudia Imhoff: There are actually many. I’m only going to talk about five. The first one is better, more accurate and reliable analytic results coming from the ability to quickly search and easily find the data, analytic or report that you need, which results in better business insights like faster time to market and so forth. That leads to a much more empowered analytics team and organization.
The next one is cost savings – tremendous savings from the reduced time to provide analysis. If I can find the data faster, I can analyze it faster. It means faster decisions and, of course, reduced efforts for locating the data. Everyone is an analyst and needs to make decisions at some point in their job, so that means they have to be able to find the data, perform the analysis and make the decision.
The third one is quickly locating existing data, queries and analysis. As I mentioned, a significant amount of time goes into just finding something that already exists. Unfortunately, if a business user can’t find the data, then they begin to reinvent the wheel, over and over and over. And that is wasted time, effort and money. So a data catalog can highlight redundancy and inconsistency and support the streamlining of a very complex environment.
The next one is to enable better compliance – and that’s compliance with internal policies like security and privacy and certainly external regulations like GDPR. The ability to do usage tracking, for example, can determine potential access or usage problems such as somebody using data that perhaps they shouldn’t use or they’re using it in a way that they shouldn’t be using it. That certainly reduces the growing concerns around data privacy and security.
The last one – the data catalog supports collaboration. We can annotate things. We can speed up the analysis tremendously by giving people tips on how to use the data or whether or not a report or analytic already exists. They can quickly find the experts in their areas of interest and avoid past mistakes. That’s huge in terms of a benefit.
Those are just a few of the benefits that I see in the data catalog.
Those are quite a few benefits. Do data catalogs incorporate advanced capabilities to enhance the benefits received by the enterprise?
Claudia Imhoff: Yes, indeed! They are taking full benefit of all the wonderful new advances and innovations – things like collaboration and crowd-sourced collaboration. As I mentioned, data catalogs allow notes and suggestions to be entered regarding the usage of the assets. They can share context and leverage experiences. They can even write warnings or reviews about the accuracy and usefulness of data. It is most helpful for someone that is new to the environment or future users of the environment.
In terms of advanced capabilities, like machine learning and artificial intelligence, the data catalogs can now use those capabilities to make educated guesses about the logical meaning of names, for example. Obscure, oddball, abbreviated titles of fields can be assumed to mean monthly revenue or daily revenue from the abbreviations. They can also recommend alternative or similar data, analytics and reports, in addition to the closest match or the highest ranked. What's really cool about data catalogs these days is that they can identify sensitive data – for example, a person’s social security number. Especially if we’re in a compliance situation, a GDPR situation, knowing that data is sensitive is critical.
The other capability that they’re starting to use is natural language processing. For most users, natural language processing is the best for search interfaces. They can write something in plain English – not in some obscure SQL or other language – and be able to get results back.
Data catalogs are indeed taking full advantage of all of the advanced capabilities that we see today.
What is the difference between a data catalog like Tableau and an enterprise-wide data catalog like Alation?
Claudia Imhoff: They have similar capabilities – obviously they’re trying to figure out what the data is and where it is and what kinds of analytics exist in these environments. But the difference with a tool-specific embedded catalog like Tableau is that it is only within that particular environment. Certainly that improves the data usability, trust and sharing of that environment, but only of that environment.
Alation is considered a general purpose or enterprise data catalog. Alation scales beyond tactical use cases. For example, they can connect to enterprise-level metadata across technologies – across Tableau, Qlik, Spotfire, Cognos and anybody else. They can bring the data and analytics from any analytic tool into their environment, and that gives them the ability to look across the enterprise, not just within a single toolset.
Is it just the IT department that benefits from having a data catalog, or does a data catalog benefit business analysts, data scientists and others in the enterprise?
Claudia Imhoff: Good question. I feel as if the IT people are probably the best off. They already know the data. They’re very technically oriented already, so they pretty much know where the data is and many times they know what reports exist and so forth. I think the data catalog is most beneficial for the business users. A business user in finance, for example, may not even know that somebody in sales has exactly the right data for their particular query or has already created the right analytic or report, and they don’t need to do it again. There are these walls that we’ve put up arbitrarily between departments, and many times the left hand doesn’t know what the right hand is doing in the business. So to a certain extent the business users will get much more benefit out of a data catalog. That’s not to say that the IT department doesn’t get benefit from it. Of course they do, especially if you consider line of business IT people versus the IT department. Now they do have an enterprise view of what’s available.
So I think both will benefit from a data catalog.
So how would a company get started with a data catalog and do you have any best practices?
Claudia Imhoff: Oh, I have a bunch of them! The first one is probably the most important and that is to know your organization. You need to determine the use cases within your organization and prioritize them. I can talk about many different use cases. For example, are you dealing with compliance issues or governance issues? Are you dealing with self-service analytic issues? Are you dealing with data lineage issues? If you can identify and prioritize the use cases, then you can map them to the requirements – the data catalog requirements that are mandatory or “nice to have” to support these use cases. Once you’ve made that determination, you can begin the team selection and you can begin the development and education. But you must have the requirements first, and then you can pick your approach. Do you want to go with a tool-specific data catalog? Do you want to use the more enterprise-focused data catalog? There’s even a virtual offering in the market today.
Then you start selecting the vendors – and I provide a list at the end of my class of about 40 different companies that have different variations of data catalogs. You need to start developing that short list and perform the proof of concept, as always. Then you begin to bring the metadata into your selected vendor. You build a business glossary. You can also start to enrich, curate and organize the data catalog, and then ultimately you put the catalog into production. And then you’re good.
That’s the whole idea. I will suggest as a best practice that once it’s in production, you still need to monitor its usage. You need to develop the statistics, the reports and the dashboards that provide visibility into the users and the trends in data usage, and don’t lose sight of the enterprise. Many times, we are not creating this data catalog for a single set of users, but for the entire enterprise. So it’s just the first of many projects to implement a fully functioning data catalog.
Claudia, do you have any real-life examples or use cases from companies that have implemented a data catalog?
Claudia Imhoff: I sure do. I’m going to talk to you about two of them; but if you go to any data catalog company, you’re going to find many use cases. They are just sprinkled all over their websites.
The first one is from Alation, and it’s a self-service analytics problem that they had to solve. This is a large grocery store chain. They have supermarkets all over, and technology was playing an important role with self-checkout, shop from home through their website, using purchasing history to customize offers and coupons and so forth. The business problem was the last one – being able to build a personal strategy for every channel, starting with email for customers. No more generic paper coupon booklets that we all get in our mailboxes and immediately throw away. The challenge was that the team was unable to find the data that they needed for these specific analyses, and they couldn’t respond to external events quickly enough. They couldn’t document their findings so they had to start over for each search – which is a tremendous waste of time. So they turned to a data catalog, and they selected Alation. They implemented the catalog and could easily and quickly find the specific data they needed. It certainly improved their productivity tremendously by discovering existing queries. No more reinventing the wheel. Collaboration was also a significant benefit. Multiple people can work on the analytic process. They can change it. They can comment on it. And it all is documented within the catalog itself to clarify why something was done in a certain fashion. The interesting thing about this case study was that they had a very good return on investment. The total ROI varied between 5 and 30 to one. On the high end, they had a $30 return for every dollar spent on technology. And new users, of course, were quickly trained on the catalog, and that made a huge difference. For the future, the company is migrating to an even larger data storage environment. They’re giving the catalog access to as much data as can possibly be accessed. They’re giving the team the ability to be more creative and innovative. They’re going to begin to use their data catalog for their Hadoop environment, driving more automation and balancing processing costs with data storage costs. The data catalog’s data usage monitoring will also help keep data in the right environment. In other words, if it is something that is commonly accessed, then it will reside in the best performing technology. But if they find that some of the data is rarely used, it will be put on a less expensive storage media. So those monitoring statistics are going to become very important to them.
The second case study is a data discovery and a data lineage one, and it’s from Io-Tahoe. The company is Centrica. I’m not familiar with it, but it’s a British multi-national energy and services company. They supply energy and services to more than 25 million accounts. They have 15,000 engineers and technicians, so it’s a big company. The business problem was that they had huge amounts of customer data globally and a huge amount of confidential internal data as well. And, of course, being British, they are now in a GDPR-driven business environment, so it meant that they had to figure out precisely what they had and where it resided. Implementing a data discovery strategy became a key focus. They had to prove to the regulatory body that Centrica applied science to resolve their data challenge. They needed significant data discovery, especially given the amount of information that they had. So, they brought in a data catalog. They used the data catalog’s sensitive data discovery capability, and they were able to process 30 billion records and 1.7 million columns of data very quickly in 1,200 databases and 1,500 apps. That’s pretty impressive. Centrica knows what sensitive data resides where. They can rank it. They can classify the apps by potential risk, and they can triage any remediation activities. They know where to look for any data, establish who has access to it and for what purpose, and determine who has responsibility for keeping data valid and pertinent.
The last thing I want to talk about involves some common traits that I found between all of the different case studies that I’ve read. The first one was the goal: a self-service analytics environment that provides easy access to the right data to everyone with minimal help from IT. That was the goal of almost every case study that I’ve read and reported on.
The second trait was the business problems. Management directive was for the enterprise to become more data-driven. Well, how do you do that without a data catalog? The solution had to scale, make knowledge accessible, and be easily understood by everybody. They needed technology that would let them share the business knowledge. And the starting point for each company was that recognition that their technological environments were huge, were complex, and that they had many different technologies distributed throughout the organization. The bottom line was they needed a cohesive architecture and a central place to store all of that tribal knowledge for finding the data in all of the technologies.
And the last common trait was the solution. All of the companies did choose a data catalog to solve the need to bring data and analytics closer to the business users to gather that tribal knowledge, to simplify analytics and scale the solution across the organization. The data catalog also enabled the newcomer – the new person coming into the organization – to enter the analytics environment with little or no understanding of it and quickly be able to use it correctly. That’s a huge plus. And all users got the complete lineage of any data – even finding experts based on how often a specific data element is used in queries.
Claudia, you’ve really given us the need for why enterprises need a data catalog today, and the ROI and the benefits you’ve given us are just amazing. You know, I’d be remiss if before we conclude we didn’t talk about the BBBT, which is the Boulder BI Brain Trust. Could you give us a little background on the BBBT and how it’s going?
Claudia Imhoff: Sure – I’d be happy to. It is my pride and joy in the IT world. The Boulder BI Brain Trust (BBBT) is a consortium of independent analysts, consultants and experts with an interest in business intelligence, analytics and the advanced capabilities and innovations that support those initiatives. It was formed about 13 years ago. We now have 240 members representing 25 countries. We bring vendors to our members. That’s the main goal of the BBBT. What I have found in being an industry analyst is that a lot of people were ignored by vendors. If they weren’t famous enough or didn’t have a big enough name, they were ignored by vendors. They didn’t get the briefings. They didn’t get to see what was new. And I decided to create the BBBT to give those people – those independent analysts, consultants and experts – a voice that would bring the vendors to them and show how important it is to keep these critical members of the BI and analytics environment up to date on what technology is doing. So the vendors come in, and they give us about a three-hour deep dive, including a demo – we must have a demo – to tell us everything that we could possibly know about their technology. And it’s a two-way street. We ask a lot of questions. We give them advice. They show us what they have. They answer our questions. I love the BBBT, and if anyone is interested, they can go to www.bbbt.us and become a member or you can become a subscriber if you’re not an independent analyst, consultant or expert. Both are free – so I hope there is a lot of interest in becoming a member or a subscriber. Thanks for asking me Ron!
Thank you so much, Claudia, for sharing your expertise with us today.
Recent articles by Ron Powell
Copyright 2004 — 2020. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC