I was attending a conference session in which the speaker was discussing entity resolution, and he was discussing different algorithms applied for determining and consequently employing similarity scoring as part of the process for determining whether two representative records refer to the same entity. The talk centered on the use of identity attributes (such as name, identification numbers, eye color, etc.) whose values could be compared, weighted and scored accordingly. And while the concentration was on the use of relatively inherent attributes for comparison, an interesting idea popped up regarding the use of calculated or analytical data elements as attributes that could contribute to the similarity score.
As an example, there are data aggregators who collect information regarding behavioral aspects of individuals, such as their educational achievement, their salaries, the types of cars they like to drive, their most frequently visited hotel chain, or the types of magazines to which they subscribe. Appending this information to an existing set of identity attributes can only add to the differentiation process. If we have two records that we suspect refer to the same entity, and had additional behavior characteristics appended to each of the records (even if they were provided from different sources), the likelihood of a match is increased.
To some extent, this suggests a conundrum. The data attributes we usually use for comparison are intrinsic or inherent to the objects we are trying to compare, and these attributes provide some grounding in the belief that our comparisons and similarity scoring processes are sound. Yet when the typical identity attributes are insufficient for differentiation, having some additional attributes actually adds some value. That being said, it is worth considering the potential for exploring the types of computed or calculated data elements associated with data subject areas that are subjected to entity resolution.
In the context of one application of entity resolution, householding, an interesting set of attributes to look at are relationships. There are many types of relationships that are discovered as a by-product of entity resolution, such as households or families. These terms take on different meaning depending on the subject area and the business situation. For example, we can examine parent-child and sibling relationships associated with individuals, we can look at components such as paper clips or screws that are in the same “family,” or we can look at corporate ownership relationships that reflect families of companies. Alternatively, we can look at other types of relationships – individuals belonging to the same health club, components manufactured from the same type of metal, or companies that share the same board members.
The explosive growth of social networking communities provides a fertile area for assembling these auxiliary attributes reflecting households and relationships. Each of these virtual community environments such as LinkedIn, Facebook, MySpace, etc. thrive on connectivity – direct connections established by the participants, potential connections discovered by scanning through your emails, or implied relationships based on existing links. In turn, you begin to see hierarchies of connectivity. There are those that are directly linked, then a greater set of individuals connected indirectly through a single point, then through two points, etc.
Alternatively, there are the self-organizing communities – those individuals who have identified themselves as belonging together. Examples include college alumni groups, those who share interest in their local communities, alumni of particular companies, or those sharing extramural or technology interests. Again, we see the organization of communities of interest within hierarchies. An example might be university alumni of a particular year, then of a particular college at the university, then the university as a whole. Each of these associations becomes a declared identity attribute, and similarity for the purposes of entity resolution or for householding or relationship analysis can exploit these characteristics as demographic or psychographic criteria that similar folks will share.
Yet another set of connectivity attributes can be computed as a result of actions taken by individuals participating within the social network. This time, environments such as twitter provide the model – especially in terms of the amplification or network traversal effects characterized by the frequency, tagging, and “retweeting” patterns that evolve in yet another self-organizing framework. Each inclusion of a specific tag within a message is equivalent to a declared affiliation with (yet another!) psychographic attribute. The volume and frequency of original messages, or repeated messages, the histogram of tags, the set of followers and the set of those being followed are non-intrinsic characteristics that can help shape an entity profile that can be used for the same set of purposes – entity resolution, householding, relationship analysis, etc. And yet again, we can configure hierarchies of connectivity – those who relay others’ messages, followed by those who re-relay those messages, etc.
One last thought regarding connectivity hierarchies and entity resolution: there are individuals who, for some reason, take on multiple virtual identities, such as those who have both professional and personal email addresses. At the same time, there are sets of individuals who share a single identity – a good example is a married couple who share a grocery store convenience card. So with the growing volume of created, computed, or materialized non-intrinsic attributes, we should be able to come up with an approach to determine when the same individual has multiple virtual entities or when a set of individuals are sharing a single entity. Thoughts, anyone?
SOURCE: Computed Attributes, Entity Resolution and Connectivity Hierarchies
Recent articles by David Loshin