We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Data Selection for Business Analytics

Originally published August 25, 2011

When the key stakeholders in the organization have agreed to pursue a business analytics strategy, the strongest urge is to immediately embark upon a quest to evaluate and acquire tools. The business intelligence (BI) tools acquisition process is a well-defined one, with clearly defined goals, tasks, and measurable outcomes. The problem, though, is that while a successful acquisition process gives the appearance of progress, when you are done, all you have is a set of tools, but you still have no analysis.

As we have examined in previous articles, the alternative approach concentrates on the business results, then considers the analyses to be performed, and then identifies the data sets that are needed to perform the analyses. This will largely center on the information that reflects the way the business is being run, namely the data sets that capture the operational/transactional aspects of the business process. While some organizations already use these data sets to establish baseline measures of performance, there may be additional inputs and influencers that impact the ability to get visibility into opportunities for improvement, such as geographic data, demographic data, as well as a multitude of additional data sets (drawn from internal data sources as well as external data sources) and streams that may add value.

Identifying the data requirements for your business analytics needs will guide the BI process beyond the selection of end-user tools. Since data acquisition, transformation, alignment, and delivery all factor into the ability to provide actionable insight, one must view data selection and acquisition to be as important as tools acquisition in developing the business analytics capability. Here are some concepts to keep in mind when identifying data requirements and assessing suitability of data sources:

  • Measured variables: Providing performance reports and scorecards or dashboards with specific performance measures implies the collection of information needed for the computation of reported variables; enumerating those calculated variables is the first task, as the dependencies will be tracked backwards to identify which data sources can satisfy the needs of the downstream consumers.

  • Qualifiers and hierarchies: Having already identified the variables to be scrutinized, the next step is to determine the different ways they could be sliced and diced. For example, if we’d like to monitor customer complaints as our main measured variable, there might be an interest in understanding customer complaints by time period, by geographic location, by customer type, by customer income, etc. These criteria are not only differentiated from each other, but also there are relationships and hierarchies within each qualification facet; geographical regions can be mapped at a high level of precision (“continent”) or low-level precision (“ZIP+4”), and can be inclusive.

  • Computations: Knowing the desired measures also drives the determination of ways those measures are calculated – what are the inputs, how many inputs, are there direct calculations or are aggregations or reductions (sum as averages or sums) involved? What are the further dependent variables? For example “total corporate sales” may be accumulated from the totals for each area of the business, each of which in turn relates to specific product family sales. Not only does this require understanding how the measures are rolled up and computed, but also it exposes the chain of dependent variables looking back through the information production flow.

  • Business process dependencies: The business processes that create or update the dependent variables are of interest in order to determine whether there are data subsystems in which dependent variables live. By identifying which business processes touch the dependent variables, one can begin to identify candidate source systems for the analytical environment.

  • Data accessibility and availability: Just because dependent variables are managed within siloed business applications, the data instances themselves are not always available for use. One must not only find the candidate data sources, but also determine any limitations on their availability and accessibility. For example, the data sets may be classified as “protected personal information,” in which there may be a need for additional security techniques to be applied to ensure that private data is not inadvertently exposed, that those seeing the data have the appropriate access rights, and that in the event of an exposure, the right encryption is applied to reduce data usability.
This provides a good starting point in the data selection process. By the end of these exercises (which may require multiple iterations), one may be able to identify source applications whose data subsystems contain instances that are suitable for integration into a business analytics environment. Yet there are still other considerations: Just because the data sets are available and accessible does not mean they can satisfy the analytics consumers’ needs, especially if the data sets are not of a high enough level of quality. Therefore, the next step, which will be the topic of an upcoming article, is to assess the data quality expectations and apply a validation process to determine if the quality levels of candidate data sources can meet the collected downstream user needs.

Recent articles by David Loshin



Want to post a comment? Login or become a member today!

Be the first to comment!