<-- Back to full color view

Extracting Business Intelligence from Social Media Dealing with Reluctant Sources

Originally published April 22, 2014

Decision making under uncertainty has been a major topic of interest in operations research and statistics for many decades. In fact, one of the authors of this article dealt with this topic in a master’s thesis in graduate school many years ago. But these days the field takes on significant new challenges as business intelligence, which is at the heart of research, increasingly involves not just analyzing transactional data sources and records but looking at social media.

In fact, when it comes to conducting research, we no longer spend hours in a library walking through the book stacks or sifting through archives and microfilms with clunky floppy disks in our bags as we did in the past. Today we have the luxury of an easily accessible wealth of other sources fueled by Google, Bing and Yahoo, providing us with videos, news, gossip, opinion, insight, facts and fiction. Yet these open sources have their limitations: They can be difficult to authenticate, they may be incorrect, they may not tell the full story and the sheer volume of available data can be overwhelming. In addition, open sources are predominantly in English and hence may not adequately capture information coming from other countries, particularly those with limited Internet penetration or those ruled by authoritarian or repressive regimes tightly curtailing freedom of speech. Furthermore, and very relevant for decision making, these social media sources are fairly easy to manipulate or censor for those that have access to the tools to delete or alter important original postings or data.

In our experience conducting research from social media in order to extract business intelligence required the constant use of creativity and innovation, no matter what purposes, organizations or data were involved.

We can explain this best in the context of a research project one of us (Ashley) carried out in relation to an epidemic in a country whose government had attempted to control and curtail any public disclosure regarding the disease. Let’s call this country Oppressivestan for purposes of this article, a closed society where the government holds nearly total authority over its people and seeks to control most aspects of public and private life.

We wanted to collect information about the attitudes and realities of citizens in Oppressivestan who were living with this disease; but the government, through political pressure and actual repression, limited public speech in an effort to protect its image and also the actions of its political hierarchy.

First, we researched the state of the media and how things worked in Oppressivestan. We confirmed several instances of journalist intimidation and of government influence over most media in the country. The methods often involved reporter “self-censorship” as a way to avoid persecution. Many international human rights organizations reported that they were required to obtain permission from the government if an opposition party member wanted to speak in public. Moreover, we found that the government publicly accused commentators with defamation when they disagreed with official positions via statements made on the Internet and intimidated their sources by suggesting that they, or their family members, could suffer legal consequences for online actions. The standard media for political discourse such as blogs, forums and personal websites would all fall within this area of government control and censorship; hence, it became obvious that to reach valid conclusions – decision making under uncertainty – would require greater focus in our methods and the use of creative approaches in our research.

It is worth mentioning that the media, in its role of informing the public, has always been at odds with authoritarian governments. Even in countries that are very open and democratic, there are often confrontations between government and the press. Given that social media, the new media, is still focused on “informing the public,” it follows that there are going to be tensions. Moreover, because of the difficulties in controlling social media, there is a learning process and lots of cat-and-mouse games going on between bloggers, texters, tweeters and security forces around the world. And all of this is important as we return to the issue at hand, which is obtaining business intelligence through solid research from sources online.

So where does one begin in conducting and organizing that research?

To start, it is pertinent that we recognize the importance of creating a holistic corpus of investigation that constitutes a single entity. Within this entity we must construct a “spine” deriving strength from proven facts, or what could be called “vertebrae.” These form a solid backbone for the compilation of research on which summaries and hypotheses will ultimately rely.

In the case of Oppressivestan, that “spine” was a timeline that focused on the practical, factual evolution of the disease that listed years, quantifiable data and statistics, as well as laws passed and breaking news sources. This information was compiled as a set of time-stamped bullets that were later used to cross-reference against additional information. The bullets, in turn, were classified as depicting a brief history of the government’s stance on the widespread disease as identified on their own official website (and sub-sites such as the Ministry of Health) and any actions they had taken. We also identified international human rights organizations and noted how they described the situation in Oppressivestan and, in turn, the government’s reaction.

The next step was to identify and collate as many narratives as possible about the disease from individuals with strong ties to and/or within Oppressivestan. Most of these were, of course, within the social media. While it would have been desirable, in classic journalistic mode, to telephone and interview specific individuals whose names we had, it would have been very difficult and risky over a telephone system that was likely being monitored by the government. Hence, we compiled a list of alternative sources and resources to research:

  • Social Networks (Facebook, Myspace, Twitter, Bebo, PerfSpot, Tagged, MiGente, Twoo, Renren, etc.)

  • Media Sharing Websites (Digg, YouTube, Reddit, Flickr, etc.)

  • Blogs (http://www.blogcatalog.com/ provides an excellent array of blogs based on keywords)

  • Forums, Message Boards and Threads

  • RSS Feeds (http://ctrlq.org/rss/ provides an easy search engine to locate them)

  • Specialized Search Software (such as LexisNexis, which provided excellent results)

Soon we discovered that though the government had immense control and reach over mass media, dissenters always find a way to voice their opinions. As a result, we had to look for websites with the postings as they existed prior to having been taken down by the government, which was often happening in near real time, just as quickly as they went up. But if we think of the Internet somewhat as an enormous sandbox, it follows that if anyone ever makes a footprint or drops a rock in that sandbox, the imprint will somehow always be retrievable. That is where the social media came in so handy. While all of the above-listed resources were used to research the situation about the widespread disease and sentiment in Oppressivestan, another important alternate resource was the use of caches.

First a reminder of what a cache is in this context. The dictionary defines it as “a hiding place, especially one concealing and preserving provisions or implements.”  But in Internet parlance, it is a “a mechanism for the temporary storage of web documents, such as HTML pages and images…1

Furthermore, Google refers to caches as “a way of retrieving information from websites that have recently gone down.2 We like to think of a cache as “the impression of the rock that landed in our Internet sandbox.

During our social media research of people within Oppressivestan, we came upon numerous links that were unresponsive to being clicked with a message attributing “unforeseen circumstances” or some similar excuse for the malfunction. (Pointedly, in a previous article it was noted that often Chinese censors leave messages such as “Sorry, the host you were looking for does not exist, has been deleted or is being investigated” and even leave police cartoons on the site. See Business Intelligence from Censorship.

But let the mouse hover over the search result, and an arrow appears to the right, pointing to the next search result and revealing a thumbnail. One click of that thumbnail image and the cached page (snapshots of the page prior to not functioning) appears. Comparing the dates on those websites to the bulleted facts provided an interesting insight into what was occurring in Oppressivestan at the time. Since the time of our research, websites like Google Cache Browser and Internet Archive now provide an even easier way to access cached pages that have “disappeared” from the Internet. Cached information was essential to our research, allowing us to see the planning of protests, forum discussions on dissent and how websites were altered to reflect government agendas.

As we mentioned earlier, another important factor to consider is language. As if conducting research involving foreign intelligence and news sources wasn’t challenging enough in the case of the epidemic in Oppressivestan, their primary language is not English so there was an additional barrier to address with much of the information collected. It was important to obtain the correct translation as well as pertinent definitions to ensure a clear understanding of events. Translations of entire websites were needed; and though several websites provided utilities to do so, the Google Translate Web website application was very useful. To ensure we were getting the best translation possible, we compared versions of the data through both Google and Bing Translator.

Once the research was completed, the report had to be structured and a framework developed to present the collected information. With respect to our case study, first we recognized the major actors at play – primarily the government of Oppressivestan, the international organizations and community, the epidemic disease and the affected public. Once these were presented and positioned, the topic was introduced in depth and the data analyzed to demonstrate the impact on all the relevant points. Then the factual dates gathered early in the research were cross-referenced with social attitudes. In the second portion of the report, key public opinion was introduced to provide a picture of the problem buttressed with original analysis that did not exist elsewhere. The result of this process was the presentation of findings that provided a picture of the epidemic, its origin, lifecycle and impacts on the population and the economy that was as realistic as possible. We were able to dissect what the main actors were saying in multiple sources and contrast the narrative with facts and dates that laid out a clearer version of events.

Now, back to decision making under uncertainty. There is no one “correct answer” when drawing conclusions. One can only provide informed inferences and educated observations. Two readers could essentially arrive at two different conclusions, but the researcher’s job is to lay out enough detail about the data collected and the methodology so that any interpretation of the findings can be defended and documented. With respect to our study, the research identified years of unlawful discrimination by the government of Oppressivestan and showed that its people living with a certain disease faced censorship, sexism, abuse and an innate fear of official judgment or rejection. Yet this reality would not have been discovered had we not approached our research with tenacity and creativity.

Research shouldn’t be restricted by what we want the final outcome to be. We must always maintain flexibility and allow the data to speak for itself. It is our job as business intelligence practitioners to find the right data, no matter how challenging – and give it a voice.

End Notes:
  1. See Geoff Huston, T., September 1999, “Web Caching." The Internet Protocol Journal - Volume 2, No. 3.

  2. See Blachman, N. , December 28, 2011, "Google Cached Pages: What Are Cached Pages?"

SOURCE: Extracting Business Intelligence from Social Media

  • Dr. Ramon BarquinDr. Ramon Barquin

    Dr. Barquin is the President of Barquin International, a consulting firm, since 1994. He specializes in developing information systems strategies, particularly data warehousing, customer relationship management, business intelligence and knowledge management, for public and private sector enterprises. He has consulted for the U.S. Military, many government agencies and international governments and corporations.

    He had a long career in IBM with over 20 years covering both technical assignments and corporate management, including overseas postings and responsibilities. Afterwards he served as president of the Washington Consulting Group, where he had direct oversight for major U.S. Federal Government contracts.

    Dr. Barquin was elected a National Academy of Public Administration (NAPA) Fellow in 2012. He serves on the Cybersecurity Subcommittee of the Department of Homeland Security’s Data Privacy and Integrity Advisory Committee; is a Board Member of the Center for Internet Security and a member of the Steering Committee for the American Council for Technology-Industry Advisory Council’s (ACT-IAC) Quadrennial Government Technology Review Committee. He was also the co-founder and first president of The Data Warehousing Institute, and president of the Computer Ethics Institute. His PhD is from MIT. 

    Dr. Barquin can be reached at rbarquin@barquin.com.

    Editor's note: More articles from Dr. Barquin are available in the BeyeNETWORK's Government Channel


  • Ashley CruzAshley Cruz
    Ashley Cruz is a freelance researcher and analyst most recently employed by a European government administering immigration processes, with a role in the deterrence and intervention of prospective criminality, false identities, illegal narcotics, human trafficking and terrorism. She has also implemented and streamlined a Wall Street firm’s first web-based reporting system for managing and reporting data.

    A graduate from Stony Brook University in New York with a Bachelors in Political Science (Public Policy), she has a passion for the power of knowledge and its ability to create insight and influence change. She has actively worked for several nonprofit organizations and performed key grassroots campaign work on a federal protections bill addressing discrimination

    As a formally trained research analyst, Ashley has conducted investigations and authored reports outlining international governmental mistreatment towards citizens, addressed global health issues, and corporate spending and investment patterns in third-world countries. In particular, she specializes in developing approaches for gathering data from diverse sources.

Recent articles by Dr. Ramon Barquin, Ashley Cruz



Want to post a comment? Login or become a member today!

Be the first to comment!


Copyright 2004 — 2020. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC