Originally published February 27, 2014
The recent revelation of the degree to which the National Security Agency (NSA) has embedded its eavesdropping capabilities within the communications ecosystem seems to have shocked many people. However, as a data practitioner, am I not surprised at the NSA’s intrusion; rather, I am amazed at the absence of shock at the degree to which commercial businesses are easily siphoning off what seems to be private and sensitive information about the majority of the population. Many online businesses actively capture interaction histories that can be subjected to behavior analysis, and we can speculate the different ways that these companies capture and analyze data and extrapolate a variety of demographic or psychographic details.
A straightforward example is “search analysis.” I am not talking about the analysis done for the purposes of search engine optimization. Rather, I am referring to a process that search engine companies can perform to classify users in finer levels of granularity over extended periods of time based on the concepts sought after using the search tool.
Of course, using a person’s search phrases as a guide for integrating analytical results is not a new idea. Each time a person executes a search, the terms are parsed and related to an existing map of concepts for advertising purchases. The underlying advertising networks exploit this and couple that information with the assortment of cookies resulting from a browsing history to determine a set of advertisements to post, both with the search results as well as other pages whose advertising is fed by the same network.
However, one theory about long-term search analysis is that you can evolve a person’s characteristic profile based on that person’s interests. Each search that a person performs adds a little bit of information to that characteristic profile. From a macro standpoint, people can be classified in relation to their general search “signatures.”
Over time, that profile can be expanded to reflect a persona that represents what the individual’s interests are likely to be based on the concepts being searched, the frequency of the searches, the times of day and how those interactions map to defined persona profiles. For example, one person might search for information about baseball players, baseball teams, historical statistics, online prices for men’s athletic shoes, bats and baseballs. One could conclude that this is a person interested in baseball. Couple that with searches for help in high school geometry and chemistry, as well as information about books targeted at high school males and you can draw the conclusion that this person is a high school-aged male who really likes baseball.
Although this is a contrived example, it shows how a sequence of searches can be grouped together to incrementally shed light on a person’s demographic details and psychographic affinities. The goal, though, would not be to selectively look at any single person but to automate the classification in real time.
Profiles can be analyzed and clustered based on conceptual assignments and comparison of similarity measurements. In turn, each cluster represents a class of people who present themselves as having an interest in learning about specific kinds of topics. An ontological organization to these topics can be used to speculate about specific demographics. These can contribute to building an increasingly precise (and hopefully accurate) perception of the individual. To achieve this goal, the analysis must at least take these facets into account:
Unique Identification: Each individual must be identified and recognized each time the search engine is used. This becomes increasingly easier as people link their interactions to named accounts. Each time you log into one of your online forums that incorporates a search capability, your searches can be logged. Examples include checking your Gmail account, sending a tweet via a mobile Twitter app, or following a link in a LinkedIn email all “expose” your identity. That exposure can be linked to a device (such as your mobile smartphone), a machine (via its network MAC address) or a network location (represented by an IP address). Once your email address has been linked to a residential IP address, one can assume that most, if not all, searches emanating from the IP address are representative of one individual or a collection of individuals living at the same location.You might say that establishing these capabilities provides the staging area for the analysis, which can proceed over days, months or even, at this point, years. Developing or adapting algorithms for clustering, segmentation and classification using these data sets is the next step, and look to future TechTarget articles where we will explore these methods of analysis.
Management of massive data logs: The search engine company must log the search terms the individual queried. Standard approaches to logging web interaction transactions have been around for years, and savvy businesses will certainly have captured that information to facilitate real-time ad placement, as I mentioned earlier.
Concept Categorization: Each search phrase can be categorized in relation to the concept ontology (for example, a search for phrases like “symptoms of flu,” “flu vaccine,” “swine flu,” or “H1N1” might all be categorized within the “influenza” concept category).
Data organization: The person-search relationship logs are organized for analysis. This organization might take into account the number of times a person performed a search, the concepts embedded within the search, a capture of a relationship among the concepts sought after, the times the person performed the search, as well as other relevant variables.
SOURCE: Analyzing Search
Recent articles by David Loshin