A Star Schema Model for Narrative DataOriginally published July 21, 2005 This article discusses the integration of iterative data—commonly known as “structured data,” and narrative data—commonly referred to as “unstructured data.” In short, iterative data repeats, whereas narrative data tells a story. In a collection of iterative data such as a relational database table, the meaning of the data is iterative. But in narrative data—books, web pages, emails, astute articles such as this one—the meaning across a set of instances is not repetitive—at least an author hopes not! Unified management of iterative and narrative data is somewhat like the Six Million Dollar Man—“we have the technology”—the challenging part that we are now faced with is applying it. Taxonomies for Data Modelers A taxonomy is a grouping of similar things—an entity class—into a hierarchical structure. It’s therefore not too much of a stretch to think of a taxonomy as a dimension. If one thinks of taxonomies as dimensions, then a set of taxonomies and the facts they classify can be perceived of as a star schema. The representation of narrative facts in a star schema configuration has the potential to allow its users to “read less and learn more,” by supporting very targeted search capabilities on data that, in its native form, might need to be processed sequentially (i.e., read), a very time-consuming access method. Indexed or random access can enable a user to get to points of fact quickly, even across multiple sources. Wait a minute, the reader may protest—you’re really stretching it here. Granted, there are differences between this Narrative Star and the conventional star schema built for analysis of iterative data. For example, the fact table in the Narrative Star would be a "fact less fact table," per Kimball, because it would have no additive numeric measures, with the possible exception of a value of 1 in each row, enabling a tally of references available for all possible combinations of dimension members. Also, in the Narrative Star schema, any fact can occur at any level of detail for any taxonomy/dimension, in contrast to the more familiar iterative-data star schema where all facts are typically stored at the same level of detail for each dimension. Where the power of the dimension hierarchies in a typical star schema lies in their capabilities for numeric aggregation, in contrast, the power of taxonomic dimensions in a narrative star schema lie in search—specifically, horizontal and vertical navigation of an information space—horizontally in terms of cross-references and vertically in terms of inheritance of "aboutness.” For example, a reference to diseases in Pennsylvania is also a reference to disease in the United States—the United States, as a superclass of states, may "inherit" this reference. In a narrative-data star schema, each fact is attributed to a narrative artifact (e.g., document); each narrative is a reference attesting to one or more facts. Here is an example of how a narrative-data star schema can work in practice. Narratives often record things that happen to people or groups, within a certain time and place—they speak of “who, what, where, when, why and how”. This star schema can be populated with fragments or facts mined from a collection of narratives. Depending on the particular domain, these facts do not necessarily need to be “factual” or true; specifically if this schema is populated from fictional narratives or even plans or predictions. Of course, if fact and fiction are to be combined in the same domain, a clear delineation between the two would be recommended!
Figure 1. A “Universal” Star Schema for Narrative Data. All these dimensions are, as is typical, hierarchical. Each dimension also can potentially contain multiple alternate hierarchies, each based on a different classification scheme (“Have it Your Way”), which could be thought of as a “snowflake” configuration. To hopefully clarify, here are some entity definitions for the dimensions and fact that comprise this schema. Narrative Fact A Narrative Fact is something that is known about the world that has been reported at least once in narrative form. It’s the intersection of people, places, time, happenings and reasons that these things have been captured in the narrative record. Recording each of these details (dimensions) is not possible or relevant for every fact; most facts will have only a subset of dimension foreign keys populated, or non-null. Event The Event dimension is the what. An Event is anything that has happened, or is projected or imagined, that has been recorded. Events can be classified hierarchically according to an organization’s requirements; for example, a law enforcement organization could group events into criminal or suspicious activities, and divide the criminal activities into types of crimes. Location The Location dimension is the Where. Location can be described in geospatial or geographical coordinates. Location can also include addresses, geo codes, municipalities, states, counties, regions, countries and continents. Media Artifact The Medial Artifact dimension includes instances and groupings of narrative “documents.” The leaf level represents actual instances of narratives—for example, addressable, retrievable physical files. Party Party is the Who that the narrative fact is about, closely analogous to the well-known Party entity. The most granular level of the Party dimension is the individual person. It also represents hierarchical generalizations of individuals and groups of individuals. Time Interval This dimension is of course the When—any duration of time; the widely known Time dimension. Topic The Topic dimension is the catch-all, including any classifications by which Narrative Facts need to be sliced and diced that are not covered by the other more “generic” dimensions. This is where the “universality” of this model requires the most work to be customized to a particular organization. Topics are unique to the market that an organization competes in. The definition of an organization’s Topics is most critical for an effective analysis of its performance and competitive position. Case in point: for a for-profit organization, foremost among its Topics should be the Products/Services that it and its competitors offer to the marketplace. Because of the broadness of this dimension, in most if not all cases it will include multiple dimensions. Fact Linkage A Fact Linkage represents a relationship between two facts, and as a many-to-many fact relationship, is another departure from the familiar iterative-data star schema. A Fact Linkage can indicate, for example, a non-hierarchical relationship between two parties—e.g., Jeb Bush is-the-brother-of George W. Bush. Dimension Analysis and Identification A large proportion of the “reference entities” in an enterprise data model (actual or virtual) of a given organization may logically fall into one of the “pre-defined” dimensions of Party, Event, Location or Time Interval. Those distinguishing reference entities that are left can be thought of as subtypes of Topic. Media Artifact is a special case, and not too difficult; it is nothing more or less than an inventory of the narrative information resources of the enterprise. Conformed Marts, and Integrating Narrative and Iterative Data In The Data Administration Newsletter (TDAN), Bob Seiner has proposed a model for narrative artifacts that is related and similar to the model described here. However, whereas the facts in Bob’s model correspond to the level of Narrative Artifacts themselves, the facts in the model described here are at a lower level of granularity—the level of the actual content of narrative artifacts. It would be quite practical to conform Bob’s model to the model in Figure 1, by intersecting the Narrative Artifact dimension in the Narrative Star with the Artifact entity in Bob’s model. The resulting combined model would resemble a snowflake or constellation schema. If the taxonomies typically used to manage narrative data are integrated (or “conformed”) with the dimensions and reference data used to manage iterative data, the resulting environment can allow users to easily access and combine facts from both domains. What’s Next? Stay Tuned So if this is starting to look like data warehousing for narrative data—which is a good thing—how then is this schema to be populated? In my next article, we’ll see that, just like your garden-variety star schema for iterative data, populating this schema requires ETL. To better understand ETL specifically for narrative data, we will start with the search engine. SOURCE: A Star Schema Model for Narrative Data Recent articles by William Lewis |
Copyright 2004 — 2012. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC
Comments
Want to post a comment? Login or become a member today!
Be the first to comment!