Data Lineage: An Important First Step for Data Governance

Originally published August 1, 2013

Ninety percent (90%) of the world’s data has been created in the last two years alone. This explosion of data is the result of the ever-growing number of systems and automation at all levels in all sizes of  organizations. While this data has made it easier to access information in the working world, it has also lead to a new set of problems.

Picture this: A leading finance firm had an issue when its key measure Net Proceeds from Sale had different numbers from different divisions. Diving deeper, it was identified that the definition of the measure differed across key divisions. Despite having an enterprise data warehouse in the organization, there were multiple reporting systems. The numbers varied due to differences in business rules and filters. Lack of business taxonomy standardization was just the beginning. The problems went even deeper because despite the common data source, the data flow and destination reporting systems differed, and this led to serious issues with data integrity.

This is not an exceptional case. With growing number of systems and with the same systems acting as source and target, information quality and governance are becoming a huge challenge. Figure 1 represents a typical data flow for any medium to large size organization data warehouse system.




Figure 1: Typical Data Flow

Users need “clean” and conformed data to make informed decisions. Lack of trust in data makes users move away from using information systems. The solution to data integrity, uniformity and correctness is matured data governance. And the first step to achieving it is to get a visual on the existing data flow and data lineage.

A visual representation of data lineage helps to track data from its origin to its destination. It explains the different processes involved in the data flow and their dependencies. Metadata management is the key input to capturing enterprise data flow and presenting data lineage. It consists of metadata collection, integration, usage and repository maintenance. It captures enterprise data flow and presents the data lineage through the metadata abstraction layer. The metadata captured here is not specific to a particular ETL flow, but it is important to have a metadata repository for all the data that flows from source to target.

The metadata abstraction layer brings this information together so that complete lineage information is available and can be consumed by different user groups. In an enterprise, the user groups for data lineage consist of business users, senior management of IT, the data governance team, the data modelers, the business analysts, the development team and the support team. The seamless presentation of data lineage information can help them all function in a better, more efficient way. Data lineage captured from metadata can provide:
Data flow view from origin to destination (see Figure 2):
  • Lists rules and transformation for each flow
  • Enables “what-if analysis” for change in  an ETL flow
  • Helps identify right source and optimum data flow for any new requirement
  • Provides meaning of specific field in a report
  • Eliminates data redundancy and ensure completeness
  • Provides information on report usage
  • Identifies data quality index associated with a data element, thus increasing the trust factor
  • Provides operational metadata


Figure 2: Data Flow View from Origin to Destination

An effective data governance program needs information about the existing data continuously. A visual representation of data lineage provides this information and effortlessly raises the maturity of data governance program. It also helps data governance by providing following benefits:
  • An end-to-end view eases identification of business rules discrepancy and data incompleteness
  • An end-to-end view improves data governance in response to regulations like Sarbanes-Oxley, HIPAA, Basel II 
  • BI data lineage shows who accesses what information and can prevent possible security breaches and exposure of sensitive data 
  • BI data lineage eliminates redundant data flows and introduction of new reporting systems
  • BI data lineage helps data stewards make decisions and react to issues before they become a problem
  • BI data lineage ensures introduction of new system is controlled and enforces data governance
  • BI data lineage  enables effective reuse of existing information 
  • BI data lineage helps to define strategies to improve data quality
  • BI data lineage facilitates effortless root-cause analysis
  • BI data lineage improves business and IT collaboration to run the data management programs
As the volumes of data multiply, information about data becomes even more critical. Data lineage methodology works like an x-ray for data flow in an organization. It captures information from source to destination along with the various processes and rules involved and shows how the data is used. This knowledge about what data is available, its quality, correctness and completeness leads to a mature data governance process.

  • Saurabh JainSaurabh Jain
    Saurabh is a Senior Director with Mindtree's Data and Analytics Solutions practice in New Jersey. He has more than 13 years of industry experience and has worked in a wide range of roles in the business intelligence (BI) and the data warehousing spaces to include end-to-end BI roll outs, master data management implementation, BI consulting, requirements elicitation and analysis. In his current role, Saurabh is focusing on large BI implementations, emerging technologies and analytics.
  • Binu ThomsonBinu Thomson
    Binu is a Principal Consultant at Mindtree's Data and Analytics Solutions practice in New Jersey. He has more than 10 years of industry experience and has worked in a wide range of roles in business intelligence (BI) and the data warehousing space such as end-to-end BI roll outs, data architecture, source system analysis, ETL architecture and data modeling. In his current role, Binu is working with the enterprise data architect team in a leading bank holding company, focusing on building enterprise artifacts for data integration and management.

Recent articles by Saurabh Jain, Binu Thomson

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!