Blog: William McKnight http://www.b-eye-network.com/blogs/mcknight/ Hello and welcome to my blog! I will periodically be sharing my thoughts and observations on information management here in the blog. I am passionate about the effective creation, management and distribution of information for the benefit of company goals, and I'm thrilled to be a part of my clients' growth plans and connect what the industry provides to those goals. I have played many roles, but the perspective I come from is benefit to the end client. I hope the entries can be of some modest benefit to that goal. Please share your thoughts and input to the topics. Copyright 2012 Thu, 10 Nov 2011 02:23:28 -0700 http://www.movabletype.org/?v=4.261 http://blogs.law.harvard.edu/tech/rss Thoughts from the Master Data Management Course, Days 2 & 3

I have completed teaching the Master Data Management Course in Sydney.  Thank you to my wonderful students.  Some memorable learning the last 2 days was done around some of these points:

  • Master data, with MDM, can be left where it is or, more commonly, placed in a separate hub
  • Product MDM tends to be more Governance-heavy than Customer
  • In a ragged hierarchy, a node can belong to multiple parents
  • Be selective about the fields you apply change management to
  • Customer lifetime value should ideally look forward, not behind, and should use profit instead of spend
  • Customer analytics can be calculated in MDM or CRM, the debate continues
  • Complex subject areas require multiple group input
  • Critical elements in MDM data security include confidentiality, integrity, non-repudiation, authentication and authorization
  • Syndicated data is becoming increasingly important and MDM is the most leveragable place to put that data
  • The web is also a source of syndicated data
  • Data quality is a value proposition
  • Do you have a data problem or a customer data problem or a product data problem?  It affects your tool selection
  • Care about what matters to your shop when you evaluate vendors
  • The program methodology should be balanced between rigor and creativity
  • In the design phase, you develop your test strategy, data migration plan, non-functional requirements, functional design, interface specifications, workflow design and logical data model
  • Don't mess up by staffing the team with only technicians
  • The purpose of the data conversion maps is to document the requirements for transforming source data into target data
  • Organizational change management is highly correlated to project success
  • Stakeholder management is not a one-time activity
If you're interested in hosting the class in 2012, please contact me.

 

]]>
http://www.b-eye-network.com/blogs/mcknight/archives/2011/11/thoughts_from_t.php http://www.b-eye-network.com/blogs/mcknight/archives/2011/11/thoughts_from_t.php Thu, 10 Nov 2011 02:23:28 -0700
Lessons from the Master Data Management Course IMG_0908.jpg

Day 1 of The 3-day Master Data Management course is in the books here in beautiful Sydney, Australia.  It's been an outstanding day of learning and sharing about the emerging, important discipline of master data management.

Here are my most vivid recollections from today:

  • MDM is highly misunderstood due to the wide range of benefits provided
  • MDM is part of major changes in how we handle data and to information chaos, which will get more complex before it gets less complex
  • MDM can and should support Hadoop data and all manner of data marts
  • Lack of a subject-area orientation in the culture is a challenge for MDM
  • Some MDM is analytical, most is operational
  • MDM subject areas can mix or hybrid across factors of analytical/operational, physical/virtual and the degree of governance needed
  • Often many systems build components of a master record, few work on the same attributes
  • MDM returns are in the improved efficacy of projects targeting business objectives
  • To do a return on investment justification, all project benefits must be converted to cash flow
  • MDM should be tightly aligned with successful projects, creating benefits for the MDM program
  • Personal motivators must be understood and are important in building an MDM roadmap
  • Vendor solutions may be subject area-focused or support multiple subject areas
  • Tactical MDM supports an individual project, enterprise MDM supports the organization for the subject area
  • Strong project management discipline can be more important in that role than MDM domain knowledge
  • The data warehouse will remain relevant in organizations, but many of its functions are moving operational, such as those to MDM
  • You can mix a subject are with the hub persisting frequently used data elements and pointing to source systems with the rest of the data
  • Do not count on the data warehouse for what MDM provides
  • Governance workflows provide the ability to escalate if actions are not taken in a timely manner
  • External sources like EPCID are becoming relevant in the product subject area

More to come on days 2 and 3.


]]>
http://www.b-eye-network.com/blogs/mcknight/archives/2011/11/lessons_from_th.php http://www.b-eye-network.com/blogs/mcknight/archives/2011/11/lessons_from_th.php Master Data Management Sat, 05 Nov 2011 23:57:02 -0700
Microsoft and Hippy-Made Hadoop: A Marriage Made with Windows

This week, at the PASS Summit, Microsoft unveiled its inevitable "big data" strategy.  The world of big data is the new unchartered land in information management and the big vendors are jumping on board.  "New economy" giants like eBay, twitter, FaceBook and Google are the early adopters - and many even built the big data tools that everything is based on. 

 

It would be too easy to dismiss big data as a Valley-only phenomenon, and you shouldn't.  Microsoft's information management tools serve perhaps the widest ranging set of clients anywhere.  They've either made their move to "keep up with the Joneses" (Oracle had some big data announcements last week) or there must be some Global 2000 budgets in it.  The industry will not thrive without some of the latter and that's what I'm betting on.

 

There's vast utility in unstructured and machine-generated data (somehow tweets count in this category) and many reasons, starting with monetary, why, once a company finds some use for it, they will choose a big data tool like Hadoop rather than a relational database management system to store the data.  Yes, and even live with the tradeoffs of lack of ACID compliance, lack of transactions, lack of SQL (although this is eroding by the day), lack of schema sharing, the need to user-assemble (although this is also eroding) and node failures being a way of life.  Indeed, the "secret sauce" of Hadoop is the distribution of data and node recovery failure - RAID-like, but less costly.

 

It's better to play with this "hippy developed" (as one skeptic referred to it as) Hadoop than ignore it at this point.  That's what Microsoft has done.  Microsoft is working to deploy Hadoop on Windows and cloud-based Azure.  This could really work in Microsoft's big data land grab.  It's a hedge against going too hard-core into the open-source world.  It's comfortable Windows combined with Hadoop.  For the many, many fence-sitters out there, this is good timing.  Many want to trace movements of physical objects, trace web clicks and other Web 2.0 activity.  They want to do this without sacrificing enterprise standards they are used to with products like Windows and its management toolset.

 

]]>
http://www.b-eye-network.com/blogs/mcknight/archives/2011/10/microsoft_and_h.php http://www.b-eye-network.com/blogs/mcknight/archives/2011/10/microsoft_and_h.php Microsoft Sat, 15 Oct 2011 14:12:39 -0700
Introducing Teradata Columnar

Potentially Teradata's most significant enhancement in a decade will be on display next week at the Teradata Partners conference.  And that is Teradata Columnar.  Few leading database players have altered the fundamental structure of having all of the columns of the table stored consecutively on disk for each record.  The innovations and practical use cases of "columnar databases" have come from the independent vendor world, where it has proven to be quite effective in the performance of an increasingly important class of analytic query.  Here is the first in a series of blogs where I discussed columnar databases. 

Teradata obviously is not a "columnar database" but would now be considered a hybrid, exhibiting columnar features upon those columns that are chosen to participate.  Teradata combines columnar capabilities with a feature-rich and requirements-matching DBMS already deployed by many large clients for their enterprise data warehouse.  Columnar is available in all Teradata platforms - Teradata Active Enterprise Data Warehouse, Teradata Data Warehouse Appliance, Teradata Extreme Data Appliance and Teradata Extreme Performance Appliance.

Teradata's approach allows for the mixing of row structure, column structures and multi-column structures directly in the DBMS in "containers."  The physical structure of each container can also be in row- (extensive page metadata including a map to offsets) which is referred to as "row storage format" or columnar- (the row "number" is implied by the value's relative position) format.  All rows of the table will be treated the same way, i.e., there is no column structure/columnar-format for the first 1 million rows and row structure for the rest.  However, (row) partition elimination is still very alive and, when combined with column structures, creates I/O that can now retrieve a very focused set of data for the price of a few metadata reads to facilitate the eliminations.

Each column goes in one container.  A container can have one or multiple columns.  Columns that are frequently access together should be put into the same container.  Physically, multiple container structures are possible for columns with a large number of rows. ]]>
http://www.b-eye-network.com/blogs/mcknight/archives/2011/09/introducing_ter.php http://www.b-eye-network.com/blogs/mcknight/archives/2011/09/introducing_ter.php Business Intelligence/Data Warehousing Fri, 30 Sep 2011 15:17:11 -0700
NoSQL is Yes Key-Value, Document, Column and Graph Stores

NoSQL solutions are solutions that do not accept the SQL language against their data stores.   Ancillary to this is the fact that most do not store data in the structure SQL was built for - tables.  Though the solutions are "no SQL", the idea is that "not only" SQL solutions are needed to solve information needs today.  The Wikipedia article states "Carlo Strozzi first used the term NoSQL in 1998 as a name for his open source relational database that did not offer a SQL interface".  Some of these NoSQL solutions are already becoming perilously close to accepting broad parts of the SQL language.  Soon, NoSQL may be an inappropriate label, but I suppose that's what happens when a label refers to something that it is NOT.


So what is it?  It must be worth being part of.  There are currently at least 122 products claiming the space.  As fine-grained as my information management assessments have had to be in the past year routing workloads across relational databases, cubes, stream processing, data warehouse appliances, columnar databases, master data management and Hadoop (one of the NoSQL solutions), there are many more viable categories and products in NoSQL that actually do meet real business needs for data storage and retrieval.

 

Commonalities across NoSQL solutions include high volume data which lends itself to a distributed architecture.  The typical data stored is not the typical alphanumeric data.  Hence the synonymous nature of NoSQL with "Big Data".  Lacking full SQL generally corresponds to a decreased need for real-time query.  And many use HDFS for data storage.  Technically, though columnar databases such as Vertica, InfiniDB, ParAccel, InfoBright and the extensions by Teradata 14, Oracle (Exadata), SQL Server (Denali) and Informix Warehouse Accelerator deviate from the "norm" of full-row-together storage, they are not NoSQL by most definitions (since they accept SQL and the data is still stored in tables).

 

They all require specialized skill sets quite dissimilar to traditional business intelligence.  This dichotomy in the people who perform SQL and NoSQL within an organization has already led to high walls between the two classes of projects and an influx of software connectors between "traditional" product data and NoSQL data.  At the least, a partnership with CloudEra and a connector to Hadoop seems to be the ticket to claiming Hadoop integration.

]]>
http://www.b-eye-network.com/blogs/mcknight/archives/2011/09/nosql_is_yes_ke.php http://www.b-eye-network.com/blogs/mcknight/archives/2011/09/nosql_is_yes_ke.php NoSQL Wed, 14 Sep 2011 18:22:15 -0700
Self-Service Business Intelligence vs. Outsourced BI

In business intelligence, we all know and espouse the fact that data integration is the most time-consuming part of the build process.  This is undeniably true.  However, if one were to look at the long-term (me: not a full-time analyst, but observant of the implementations I've been in for a full lifecycle over the past few years), I believe most long-term costs clearly fall into the data access layer.   This is where the reports, dashboards, alerts, etc. are built.


This is true for a variety of reasons, not the least of which is a short-cutting of the data modeling process, which, when done well, minimizes the gap between design and usage.  This aspect of BI is receiving only modest recognition.  The focus instead is on a new breed of disruptive data access tools that are architecturally doing side-runs around the legacy tools in how they use memory and advanced visualization.  Specifically, these tools are Tableau, QlikTech, and Spotfire.  These tools attack a very important component of the long-term cost of BI - the cost of IT having to continue to do everything post-production.


There are a few areas where these tools are getting recognition:


  1. They perform faster - this allows a user, in the 30 minutes of time he has to do an analysis, to get to a deeper level of root cause analysis
  2. They are seen as more intuitive - this empowers the end user so they can do more, versus getting IT involved, which stalls a thought stream and introduces delay which can obliterate the relevancy
  3. They visualize data differently - I won't expound on it here and I don't think it's necessarily due to the tool architecture, but many claim it's better

So why do I bring it up in opposition to outsourced business intelligence?  Because to truly set up business intelligence to work in a self-service capacity, you would overweigh the idea of working closely with users in the build process, which is a lever that gets deemphasized in outsourced BI.  You would see business intelligence as less a technical exercise and more as an empowerment exercise.   You would keep the build closer to home, where the support would be.  And you would not gear up an offshore group to handle a laborious process of maintaining the data layer over the years in the way users desire.  You would invest in users - culture, education, information use - instead of outsourced groups.  And this is just what many are doing now. 

]]>
http://www.b-eye-network.com/blogs/mcknight/archives/2011/08/self-service_bu_1.php http://www.b-eye-network.com/blogs/mcknight/archives/2011/08/self-service_bu_1.php Business Intelligence/Data Warehousing Sun, 14 Aug 2011 10:52:32 -0700
Perception Change Follows Product Line Updates at Teradata

I was at Teradata Influencer's Days this week, an annual 3-day invitation-only event where Teradata catches us up on the latest offerings and company strategy.  We were in Las Vegas this year and we had a fascinating visit to the Switch data center where eBay stores their Teradata EDW, Hadoop clusters and another large system where the thousands of jobs run daily to keep eBay on top of their game.

Teradata is undoubtedly a long-standing leader in information management.  They have been preparing for the heterogeneous future (or is it the heterogeneous present?) and diversifying their offerings for several years.  Teradata's moves should have everyone reconsidering any notion of Teradata as a high-hurdle company that wants you to put everything online in a single data warehouse.  And it seems to be working.  Teradata released earnings Wednesday showing revenue growth of 24 percent in 2Q11.

Aster Data - A "big data" acquisition for the management of the multi-structured data with patented SQL/MapReduce

Active Data Warehousing - Abilities built into the Teradata 5000 EDW series that support and promote fast, active, intra-day loading of the data warehouse as opposed to a batch-loaded warehouse

Aprimo - Marketing applications that put the information to work and a software-as-a-service model to build some of their future on

Master Data Management - The "system of record" for subject areas that need governance and need to be integrated in real-time, operationally

Hot-Cold Data Placement - Less-used data placed into lower-cost storage, with accompanying degraded performance

Appliance Family - Pre-loaded machines of varying specification according to workload that can get your data access up and running quickly; some are using the appliance for their data warehouse

I noted still something could be done where many analytics are going - to the operational world.  Something in complex event processing would further an information ecosystem.  

We discussed Teradata 14 and it will continue this theme of providing the range of platform options necessary today.

Now that some of these acquisitions are assimilated, we are seeing a reflection in the marketing.  With "Teradata Everywhere" as the imperative, the reference architecture is now the "Analytic Ecosystem" which is an environment that includes, but is not all-consumed by, the Enterprise Data Warehouse.  Consider the market sizes of the markets Teradata is going after, as shared by Teradata: Data Warehousing ($27B), Business Applications ($15B) and Big Data Analytics ($2B).  Teradata is embracing the heterogeneous future as a focused leader in information management.

]]>
http://www.b-eye-network.com/blogs/mcknight/archives/2011/08/perception_chan.php http://www.b-eye-network.com/blogs/mcknight/archives/2011/08/perception_chan.php Business Intelligence/Data Warehousing Sat, 06 Aug 2011 08:45:57 -0700
Self-Service Business Intelligence: Come and Get It

What do you think about when you hear the term "self-service"?  To some, it's a positive term connoting the removal of barriers to a goal.  I can, for example, go through the self-service checkout line at the grocery store and I'm limited only by my own scanning (and re-scanning) speed to getting out the door.  However, as we've seen with some chains eliminating self-service lines recently, self-service is not always desired by either party.  To some, "self-service" is a negative term, euphemistically meaning "no service" or "you're on your own."

As defined in Claudia Imhoff and Colin White's excellent report, "Self-Service Business Intelligence: Empowering Users to Generate Insights", self-service BI is defined as "the facilities within the BI environment that enable BI users to become more self-reliant and less dependent on the IT organization."

If you put up a poor data warehouse, it is a copy of operational data, only lightly remodeled from source and usually carrying many of the same data quality flaws from the source.  It solves a big problem - making the data available - but after this copy of data, the fun begins with each new query being a new adventure into data sources, tools, models, etc.  What has inevitably happened in some environments is that users take what they need, like it's raw data, and do the further processing required for the business department or function. 

This post-warehouse processing is frequently very valuable to the rest of the organization, if the organization could only get access to it.  However, data that is generated and calculated post-data warehouse has little hope of reaching any kind of shared state.  This data warehouse is not ready for self-service BI.

According to Imhoff and White, the BI environment needs to achieve four main objectives for self-service BI:

1.       Make BI tools easy to use

2.       Make BI results easy to consume and enhance

3.       Make DW solutions fast to deploy and easy to manage

4.       Make it easy to access source data

To achieve these goals, you need a solid foundation and solid processes.  Take account of your BI environment.  While IT and consultancy practices have coined "self-service business intelligence" to put some discipline to the idea of user empowerment, some of it is mere re-labeling of "no service" BI and does not attain and maintain a healthy relationship with the user community and healthy exploitation of the data produced in the systems.  We all know that IT budgets are under pressure, but this is not the time to cut vital services of support that maintain multi-million dollar investments.

]]>
http://www.b-eye-network.com/blogs/mcknight/archives/2011/07/self-service_bu.php http://www.b-eye-network.com/blogs/mcknight/archives/2011/07/self-service_bu.php Business Intelligence/Data Warehousing Thu, 28 Jul 2011 19:07:19 -0700
Reducing Credit Card Fraud

I was part of one of the pioneer credit card fraud detection projects.  It was at Visa and, together with all the similar projects taking advantage of early-stage data mining that were going on about the same time throughout the financial industry, drove credit card fraud down dramatically to all-time lows.  In recent years, as the technology changes, fraud has increased once again.  The financial industry has the online problem to deal with in addition to the ramifications from identity theft and the card skimming that was once falling.  Employees are compromising the data they come into contact with as well. 

Mass compromises occur routinely since thieves can divide and conquer - some can focus on getting the card numbers and others commit the fraud.  There is a robust, efficient black market for card numbers.  Consider the huge breach at Heartland Payment Systems in 2009.  Committing fraud is done with the detection systems in mind.  They often occur in "blitz" mode to overwhelm the system before it has a chance to react and stop transactions.

A recent study by Ovum studied 120 banks and found that counterfeit card fraud is the top issue, with wire fraud second.  Card readers can be purchased much more easily (i.e., on the iPhone) and the number of cards has proliferated, increasing potential for fraud.  While the UK has adopted "chip and pin" technology on the card, the US has not.  This may one day make it more difficult for criminals to cash in on credit card fraud in the US.

Personally, I just count on having to change my credit card numbers at least yearly either on account of outright fraud, the bank (I'll use "bank", but am referring to all financial companies in this article) being compromised or me making legitimate charges where the bank panics and decides to cancel the card.   All that good fraud detection comes with a price to the card holder.

I've worked on the fraud issue since then.  Other than the fact that it's working on the prevention of a negative to the company, these actually are fun, detective-work projects.  For those who have not had the opportunity, today I decided to share some of the architecture behind fraud prevention utilizing the approach of one of the leading international providers of payment systems, ACI Worldwide (Nasdaq: ACIW) and their product, ACI Proactive Risk Managerâ„¢ 8.0 (PRM). 

As the last step in the authorization process, PRM shares a score with the bank and, based on the tolerance the bank has set for the customer (balancing potential fraud with false positives), the bank's system decides whether to authorize or not.

Although the bank may have a data warehouse, all customer transaction sources feed PRM.  Some customers extend PRM's capabilities to make it their data warehouse.  One year's worth of backlogged transactions is recommended to start with - even though most are legally required to store seven years of data.

]]>
http://www.b-eye-network.com/blogs/mcknight/archives/2011/07/reducing_credit.php http://www.b-eye-network.com/blogs/mcknight/archives/2011/07/reducing_credit.php Other Wed, 13 Jul 2011 12:19:47 -0700
Index Data Page Layouts and Index Compression One of my favorite blog entries was the one about the relational data page.  In that entry, I talked about how so much of the data allocated to a database is formatted.  Some people agreed but pointed out that also much storage is dedicated to index pages.  And they are correct.  It depends on your index strategy.  If you add up all the index sizes on some tables, it can exceed the row size itself.  Then, likewise the index pages would outnumber the data pages in the database.  What about them and what do they look like?

There are two basic page formats for any index page, which, like the data page, has size options for the user.  I'll repeat my earlier admonition that knowing what goes on at the page level will help you understand better how your decisions affect your performance.   

I'll start with the most prominent format - that of the leaf page.  The leaf page contains a series of key/RID(s) combinations (called entries.)  The key is the actual value of the column.  RID stands for row/record ID and is comprised of the data page number and the record number within the page.  The RID was explained in this post.  The RID is how the index is connected to the data pages.  All RIDs that connect to the 3rd record on the 123rd data page would be "123-3".  If the record there was a customer record for Brad Smith who lives in Texas and there was an index on state, the key/RID combination would be "Texas-(123-3)".

Naturally, you would have multiple customers who live in Texas so there would be multiple RIDs in the state index associated with Texas.  It might look like "Texas-(123-3),(123-4),(125-6),(125-19),(127-10), etc.".  Any index key that shows up multiple times in the table would have multiple RIDs.  A unique index would only have one RID associated to each key.

Successive entries in an index would not be in order except for the one clustered index on the table.  For example, an index on last name could have entries of:

Chamberlin-(234-2),(234-5),(336-3)

Chambers-(67-9)

Chambless-(900-33)

 

This is NOT for a clustered index. If it were, the RIDs would be in numerical order across entries.  Most indexes are non-clustered and it is normal for the RIDs to jump all around the table.  If you navigate quickly to the Chambers entry, data page 67, record 9 is where you would find "the rest of the record".  This is excellent for a query like "Select * from table where lastname = 'Chambers'".

But what about that navigation?  That comes about from the other index page format - called creatively the non-leaf page.  The non-leaf pages contain key ranges of the leaf pages so that an index navigation, which always begins at the root node, can quickly navigate to the correct leaf page.  That is the function of the non-leaf pages. 

In practice, this quickly fills up a single index page (of a few "K" bytes) and then the entries are split into two non-leaf pages and a "root" page now points to the ranges on those pages.  Eventually the root ]]>
http://www.b-eye-network.com/blogs/mcknight/archives/2011/06/index_data_page.php http://www.b-eye-network.com/blogs/mcknight/archives/2011/06/index_data_page.php DBMS Selection Wed, 15 Jun 2011 07:31:41 -0700
Applying different standards to cloud data warehouses Increasingly data warehouse components as well as many operational systems are moving to the cloud.  By the cloud, I mean systems that conform to the NIST definition of on-demand with self-service, have broad network access, resource pooling, rapid elasticity and measured service.  The cloud has lowered barriers to entry in terms of IT competencies that need to be employed as well as hardware, software, power, floor space, storage, network, Procurement and Accounting. In addition cloud providers provide more professional chargeback capabilities.  Obviously, much personal software has gone to the cloud - i.e. Microsoft 365, Dropbox, Google Docs, MobileMe and the impending Google Chrome device.  But what about core enterprise systems like data warehouses?

 

The cloud can be resisted there due to the loss of control. However, we must lose the fear and facilitate the right cloud strategy for data warehousing.  Some of my clients are eagerly moving various systems components to the cloud and in so doing are going to apply different standards for cloud data warehouses then they would for in-house data warehouses.  For example, the usual 99.99% availability mark gets compromised in Amazon's public cloud which offers 99.95% availability. There have also been many public cloud relationships derailed such as the one between Eli Lilly and Amazon.  However, even with Amazon's recent outage, I find that companies, despite the media FUD about it, are responding not by moving away from the cloud but by doubling down with high availability systems.  Upon further inspection, much availability, security and performance can be considered better than in house systems.

 

So which type of cloud and what services for data warehousing? As would be expected, initially there is a preference for the private cloud as it gives clients more comfort and more control over security compliance and integration - even though some accountants would prefer the public option so that expenses are fully capitalized without question.

 

]]>
http://www.b-eye-network.com/blogs/mcknight/archives/2011/05/applying_differ.php http://www.b-eye-network.com/blogs/mcknight/archives/2011/05/applying_differ.php Business Intelligence/Data Warehousing Thu, 26 May 2011 09:50:05 -0700
Break Free Tour: Build Your Information Strategy Due to increasing data volume and data's high utility, there has been an explosion of capabilities in the possibilities in the past few years brought into the early mainstream. While stalwarts of our information, like the relational row-based enterprise data warehouse, remain highly supported, it is widely acknowledged that no single solution will satisfy all enterprise data management needs. 

Many are confused by the value of Hadoop, data warehouse appliances and stream processing. Their value propositions seemingly conflict with current information management infrastructure.

Costs for keeping "all data for all time" in an EDW are still escalating, even though storage remains historically inexpensive. That is driving some heterogeneity as well.  

The key to making the correct data storage selection is an understanding of your workloads - current, projected and envisioned.

Join me for the Break Free Tour session in a city near you. This practical session will organize and explore the major categories of information stores available and help you make the best choices to ensure information remains an unparalleled corporate asset.

You will learn:

  • The place for Relational Row-Oriented Data Warehouse and Data Marts
  • Efficient operation of RDBMS with I/O Bottleneck alleviation
  • How multidimensional databases fit into an organization
  • When data streams make an information store
  • Hadoop basics for Big Data, webscale and unstructured workloads
  • Cloud considerations for information storage and interaction

I'll be joined by IBM leaders drilling in on these topics and others relevant to the perspective of the IT decision maker. Register for one of these sessions, and bring your questions. We're looking forward to a fascinating series.

Tuesday, June 7 - Bellevue, WA
Thursday, June 9 - Boston, MA
Tuesday, June 14 - Los Angeles, CA
Wednesday, June 15 - Denver, CO
Thursday, June 16 - Atlanta, GA

Additional analysts and colleagues will be hosting Break Free sessions in Toronto and New York.

See full details on any of these sessions and register now.  I hope you can join us! 

And if you're near Bellevue, Boston, Los Anglees, Denver or Atlanta and you want to find some time together (end user organizations, software companies) while I'm in town, let me know. 

]]>
http://www.b-eye-network.com/blogs/mcknight/archives/2011/05/break_free_tour.php http://www.b-eye-network.com/blogs/mcknight/archives/2011/05/break_free_tour.php Business Intelligence/Data Warehousing Wed, 25 May 2011 12:30:29 -0700
Netezza: Pioneer Appliance for Large Data Management Discovery Days kicked off last week in Indianapolis and part of the focus was on Netezza.  I gave a talk on the origins of appliances, based in part on the linear progression from uniprocessing to SMP to Clusters to MPP and made a point that I see appliances in that lineage.  However, it was with the caveat that it's no longer linear and each appliance is putting different nuance to MPP.   Appliances do represent something different than just bundled MPP systems. 

 

Alan Edwards of IBM filled in some of the details of the Netezza story and value proposition.  Netezza is very much a strong part of the IBM information management story today.  Below are some points from Alan's talk.  Netezza-aware professionals will already understand most, but keep in mind the lack of familiarity of the audience.  Everyone can at least be reminded of these high points for Netezza.

 

Nearly 70% of data warehouses experience performance-constrained issues

Traditional systems are just too complex, too long to get answers

Netezza means results in Urdu.

Twinfin 3rd gen is a line of surfboards (and the latest line of Netezza appliances)

Netezza is purpose built for analytics

There a no "hints", etc. (none necessary)

They have an analytics package that runs in hardware (I believe this is a reference to the FPGA)

Appliances start at 1 tb and go up to 1 pb+

  nz.jpg

]]>
http://www.b-eye-network.com/blogs/mcknight/archives/2011/05/netezza_pioneer.php http://www.b-eye-network.com/blogs/mcknight/archives/2011/05/netezza_pioneer.php Business Intelligence/Data Warehousing Sun, 08 May 2011 18:02:27 -0700
Dallas-based Open Source Columnar vendor Calpont latest release

Open Source.  Check.  Columnar.   Check.  Dallas' answer to a couple of others checking these boxes out there is InfiniDB from Calpont.  I had a chance to catch up with them this week as they announced their latest release, 2.1, about 15 months after their commercial launch.

 

InfiniDB does not have indexes, does late materialization (or "just in time" as they call it) and multi-table hash joins.

 

Some of the items they were stressing were scalability, SQL extension, compression and performance.

 

The modules that perform user functions and performance functions can be scaled individually to accommodate need in either area.  This is part of their plan to provide linear scalability.

 

Partitioning can be vertical or horizontal and comes by default with CREATE TABLE.

 

There was a lot of stressing of the predictable, linear performance.

 

  calpont1.jpg 

Here are some SQL extensions that were added in last few releases that were stressed:

 

Subqueries

Limit keyword

User Defined Functions

STDDEV and related functions

Views

Auto incrementing

Partition drop - to take individual partitions offline

Insert into table select from... where the from can be to/from InfiniDB and MySQL

Vertical and horizontal partitioning

 

Most of these are available in the enterprise edition.  If you want their syntax guide, drop me an email.

 

Compression is also new and improved.   Using the Piwik.org data set, they achieved 3x to 9x compression.  Compression can be set on individual columns.

 

Performance with compression was a key takeaway: 

calpont2.jpg

Several recommendations for effectively using InfiniDB were given:

 

]]>
http://www.b-eye-network.com/blogs/mcknight/archives/2011/05/dallas-based_op.php http://www.b-eye-network.com/blogs/mcknight/archives/2011/05/dallas-based_op.php Business Intelligence/Data Warehousing Fri, 06 May 2011 07:35:57 -0700
Indianapolis, St. Louis, Phoenix, New York: Come out to Discovery Days I'm about to keynote a fun series of talks with IBM called Discovery Days.  We'll be covering topics around optimizing systems for speed, efficiency and analytics.  If you're in any of the below named cities I hope to see you there!  And if you're in St. Louis, Phoenixor New York City and you want me to consult or speak to you or your team (end user organizations, software companies) while I'm in town, let me know as I can plan some time before or after the conference.

 

The known days so far are:

 

May 5

Indianapolis, IN

JW Marriott Indianapolis

 

May 10

St. Louis, MO

Renaissance St. Louis Grand Hotel

 

 

4/29 update: St. Louis event rescheduled due to tornado at airport; new date is 5/24

 

May 12

Phoenix, AZ

Ritz Carlton Phoenix 

 

June 9

Columbus, OH

Westin Columbus

 

5/22 update: I'm going to Boston June 9 to give a keynote for Break Free 2011 instead.  I'll post about those events soon. 

 

October 4

New York, NY

Tbd

 

There will be a few more.  I will post as they become known.

 

Here is the link to the first event in Indianapolis.

 

 


 

]]>
http://www.b-eye-network.com/blogs/mcknight/archives/2011/04/indianapolis_st.php http://www.b-eye-network.com/blogs/mcknight/archives/2011/04/indianapolis_st.php Business Intelligence/Data Warehousing Tue, 19 Apr 2011 12:54:45 -0700