Simple Semi-Structured Data

Originally published October 17, 2005

We are mostly familiar with structured data—the data that has been neatly modeled, organized, formed, and formatted into ways that are easy for us to manipulate and manage. The most frequent examples include databases, as well as more mundane frameworks such as spreadsheets, fixed-format files, log files, etc. Fortunately, structured data is relatively easy to work with. Here, we can write programs that easily work with the data by organizing, analyzing and displaying it.  

You are probably familiar with unstructured data as well. In fact, you are reading it right now. Unstructured data incorporates the mass of information that does not fit easily into a set of database tables. The most recognizable form of unstructured data is text in documents, such as articles, slide presentations or the message components of emails.

There is an intermediate classification of content called “semi-structured data.” This refers to sets of data in which there is some implicit structure that is generally followed, but not enough of a regular structure to “qualify” for the kinds of management and automation usually applied to structured data. We are bombarded by semi-structured data on a daily basis, both in technical and non-technical environments. For example, web pages follow certain typical forms, and content embedded within HTML often have some degree of metadata within the tags. This automatically implies certain details about the data being presented. A non-technical example would be traffic signs posted along highways. While different areas use their own local protocols, you will probably figure out which exit is yours after reviewing a few highway signs.

This is what makes semi-structured data interesting—while there is no strict formatting rule, there is enough regularity that some interesting information can be extracted. Often, the interesting knowledge involves entity identification and entity relationships. For example, consider this piece of semi-structured text (adapted from a real example):

“John A. Smith of Salem, MA died Friday at Deaconess Medical Center in Boston after a bout with cancer. He was 67.

Born in Revere, he was raised and educated in Salem, MA. He was a member of St. Mary’s Church in Salem, MA, and is survived by his wife, Jane N., and two children, John A., Jr., and Lily C., both of Winchester, MA.

A memorial service will be held at 10:00 AM at St. Mary’s Church in Salem.”

This death notice contains a great deal of information—names of people, names of places, relationships between people, affiliations between people and places, affiliations between people and organizations and timing of events related to those people. Realize that not only is this death notice much like others from the same newspaper, but that it is reasonably similar to death notices in any newspaper in the US. This clipping reflects the characteristics of a semi-structured form. For the most part, we could figure out who the individuals are, what the organizations are, and the relationships between them. Instead of using sophisticated data and text mining techniques, we can use straightforward keywords in context parsing.

I consider these forms “simple” semi-structured data, since the formatting has slowly evolved into a generally accepted protocol with reasonably definable characteristics. In fact, you might say that this simple semi-structured data has its own “semi-structured metadata.” This describes which entity objects are discussed within the text, as well as relationships that can be inferred. Other examples include engagement and wedding announcements, job postings, real estate listings, legal notices and bank account names.

In the world of the “360-degree view,” “Customer Relationship Management” and “Customer Data Integration,” we often forget that much of the knowledge we want to capture may not be explicit in our structured data archives. Yet a large part of that information might be extractable from data sources not normally accessed. Furthermore, the wealth of that data in usable information might provide valuable insight.

Here is a brief example: If you were an executive recruiter, you might have collected reams of resumes over time. For each resume, each candidate would describe previous experiences: where they worked, when they worked there and what they did. Generally, when you had an open position, you might browse through your resumes for a good match based on skills. Instead, it would be clever to configure an “intelligence” network that allowed you to find both good candidates and key influencers within your network. Such a system would help you better qualify those candidates. One approach is to connect individuals based on degrees of relationships—determining which candidates worked for the same company on the same project at the same time. By combining this network with the information about individuals listed as references, you will begin to see patterns. These patterns could allow you to classify the candidates in relation to others, and whether some are more suited for particular roles.

Some organizations are already working on this type of model. One particularly successful approach is collecting and linking information about key government personnel involved in acquisitions and procurement. Since government contracting is a huge business, any information can reduce the complexity of a contractor’s sales process. This might be done by finding the right person to talk to or the right project to bid on. Such actions can significantly improve the sales cycle and close more sales faster. 

There is a tremendous amount of value in semi-structured data. In a future article, I will examine ways to both capture and manipulate value. As always, I am interested in hearing your questions and comments. Please feel free to e-mail me at

Recent articles by David Loshin



Want to post a comment? Login or become a member today!

Be the first to comment!