Business Intelligence Network business intelligence resources

Blog: Pete Loshin

Main

January 25, 2008

Time Travel Maps!

I'm fascinated by the physics of time travel, but the type "time travel" instead of "travel time" caught my eye in the link I clicked on to get to Travel-time Maps and their Uses and this More travel-time maps and their uses. Not maps for time travelers, but maps that illustrate the amount of time it takes to travel.

Very interesting and even helpful if you're in the UK: you can use these maps, for example, to figure out whether it's quicker to drive or take a train to a destination, or the fastest mode of transportation for rush hour commuting. But it's also a very neat illustration of how big piles of data can be turned into intelligence. And you don't need me to explain how that kind of intelligence can become "business intelligence" for any business that needs to allocate resources to get people or things from one place to another.

It's all brought to you by mySociety, a charitable project that develops their software as open source; if you're interested in having them do custom mapping for your business, they seem to be willing to do that for a fee (or a donation, I'm not sure how that works in the UK).

November 5, 2007

Opening Up the Internet: Craigslist + Yahoo! Pipes = Better Data Searching

We've really come a long way with the web and the Internet over the past dozen years or so. Back then, it was kind of a big deal to run screen-scraping software that could pull data off websites, or access corporate legacy mainframe systems through a webified front end.

Now, we're seeing more and more of the web is instantiated in some seriously big data stores, and we're seeing more and more of the owners of those seriously big data stores making data processing tools and APIs available to anyone who wants them, so we can have some nice little mashup applications combining, for example, maps and data with geographical components.

But here's something sort of new: a way to make an already popular, useful and generally great website--in this case, Craigslist--with another popular, useful and great website--Yahoo! Pipes. The result is even better than either one.

Yahoo! Pipes is kind of like a web version of UNIX piping: a way to take the results of one command (output) and "pipe" it into another command as input. What you get is a very handy way to create very specific and powerful searches, and turn the results into useful information.

So, here's the article that got me hooked: How to Actually Search Craigslist. As great as Craigslist is, it has some drawbacks. James Aaron, who wrote the article, is a student at San Jose State's School of Library and Information Science, and is looking for a job currently. He likes Craigslist, but, as he explains, it could be even more helpful if there were ways to search better:

There is no way to truncate searches, such as "librar*" to include librarian, library, libraries, etc. There is no way to perform Boolean AND, OR, NOT searches. There is no way to remove frequently occuring irrelevant items. There is no way to search two sub-regions at once. So, unless I want to perform 20 searches a day and receive MANY completely irrelevant hits, I basically have to browse.

The answer, he tells us, is Yahoo! Pipes, and he explains just how to use Pipes with Craigslist to make Craigslist that much more useful.

In other words, more evidence of just how much the entire web is evolving into the world's biggest ever data store, with the most powerful ever set of tools for extracting business intelligence.

How could you use this kind of capability to extract actionable knowledge from the web?

November 2, 2007

Mining Valuable Intelligence From Random Numbers

Somewhere in my stack of obsolete 3.5" floppy diskettes I've got a spreadsheet that contains some interesting raw data. Long ago I was in the habit of buying a bag of M&Ms from a vending machine in the corporate cafeteria every afternoon: before eating any, I would open the bag, sort the colors, count the M&Ms of each color, and record the totals in a spreadsheet.

The primary benefit I got from that activity was a nice set of data, from which I could infer some general rules about which were the most and least common M&M colors> I also got something to do during the afternoon lull to keep me from falling asleep.

It was the kind of job where most of my co-workers were very bright, but we often had time on our hands; conversation topics included arguing different strategies for getting rich by inventing something really cool--and strategies for winning the lottery.

Now that we have the Internet, and there's an endless supply of data sets to play with, here's a guy who actually came up with something useful on that whole lottery thing: Pattern Analysis of MegaMillions Lottery Numbers.

Can you use the information in this article to increase your odds of winning the big bucks? It's not clear: if the lottery number selection process is truly random, the answer is no. But you could use the numbers, and the techniques, as described in the article, to discover hidden influences on the selection process that might skew the results.

For me, though, the best part of this article is that it takes the question of whether lottery drawings are truly random and then applies a scientific approach to it. And, that all the data is available on the New Jersey lottery website, both in HTML and delimited format for easier processing.

September 28, 2007

Guerilla Knowledge Processing

No matter what our job descriptions, most of what we do every day revolves around manipulating data and turning it into knowledge/intelligence. Enterprises routinely budget hundreds of millions of dollars for this kind of thing, so we all know how difficult it can be.

But here's an example of how one guy turned his credit card statements into a vehicle for generating and tracking information--and the surprising discovery he made when the results didn't match his expectations. In A low-bandwidth, high-latency, high-cost, and unreliable data channel, this fellow, Ian Hixie, starts by noting that when he eats at a restaurant in the US, "you never pay what the bill says" because he always adds a tip. Thus, he concludes:

The net effect of this is that you basically get to decide how much you pay. Indeed, credit card bills at restaurants have a space where you fill in how much you want to pay.

The aha moment for Hixie came when he noticed that, to make things simple, he usually rounded this amount to a full dollar:

... there are data bits there, lying unused! It struck me that with every single restaurant transaction I could set the cents field to some number under my control, thus allowing me to communicate with myself at a later date!

Ian goes on to describe a protocol for encoding several different pieces of information about the restaurant into the decimal values available ($.00 through $.99), and--if you want to know how it worked, you'll have to go read Ian's article.

But what's really relevant here is that Ian did some applied business information processing by:

  1. First noticing that there were some bits available and under his control
  2. Deciding that he could encode some information into those bits in a way that he could use...
  3. ...to recover and use that information later on.

Ian discovered a bug in the system, but he's also conjectured a reason for the bug as well as a fix for it; hopefully we'll hear more about the second version of this system--and possible other uses for these data bits.

Where else are there opportunities in your work (or personal) life for adding value through knowledge processing?

August 7, 2007

Time Magazine sez "Online Snooping Gets Creepy"

According to Time Magazine, Online Snooping Gets Creepy. I'm not so sure about that. Web "snooping" has always been more or less creepy in some ways, and more or less useful in others.

Time points to a new wave of search engines that supposedly go beyond and behind the web content that Google indexes to give an uncannily complete profile of whoever you want to "investigate". These include ZoomInfo, PeekYou, Pipl, Wink/, and Spock/.

I tried them out, using my own name since I can best judge the results (except for Spock, which was down for maintenance) and here are the results:

  • As far as PeekYou is concerned, I don't exist. Boo.
  • ZoomInfo came up with references to a lot of work I've done over the years that I sort of forgot about--as well as some "positions" with companies I'd never heard of. I even signed up for full (free) privileges, which wasn't too intrusive and gave me access to all the sources they cite. So, it's actually pretty useful, for me.
  • Pipl gave me a lot of information, including links to places where my publications have been cited, as well as contact information. Pretty good, but it looks like a front-end to existing engines, including Google. OK.
  • Wink found some stuff about me (including this blog), but seemed to miss a lot more. It does point to my LinkedIn profile, but otherwise it doesn't really find much else of interest.

How do they stack up to Google? Well, Google is still a more comprehensive search engine, pointing to a more complete set of my publications (books as well as articles published online), including lots of pointers to websites and blogs that seem to have "borrowed" my articles for their own use. Oh well.

If you're looking for someone's address, phone number or birthday, try Pipl; if you're looking for a terse and easy to understand (but possibly inaccurate) precis, try Wink. Otherwise, you might as well stick with Google, at least to start with.

August 6, 2007

Hot New Blog: High Scalability

High Scalability is a new (not quite one month old) blog aiming to "bring together all the lore, art, science, practice, and experience of building scalable websites into one place so you can learn how to build your system with confidence."

Don't be fooled into thinking it's all about CSS or Apache configurations or stuff like that. Consider this: An Unorthodox Approach to Database Design : The Coming of the Shard. Now, blogger Todd Hoff may or may not be onto something entirely novel and unique with his concept of database "sharding" to distribute database storage and computation, but it is definitely interesting.

And he's got a lot of other good information here, such as a precis of the eBay Architecture, MySpace Architecture, and much more.

Great job, Todd, and keep it up!

May 27, 2007

IBM virtual Linux environment beta program

If you want to see what a virtual Linux environment looks like, check out the IBM System p Application Virtual Environment for x86 Linux. Follow the link to find out more about participating in the beta program (as well as more details about what it does and how it works).

Bill Andad at DANIWEB.com has more about it here.

December 15, 2006

Interesting Data Mining Blog

I just discovered a great blog to check out if you're interested in a practitioner's take on current practical and cutting-edge data mining applications: Data Mining: Text Mining, Visualization and Social Media blog.

Matthew Hurst, the intelligence behind the blog, is Director of Science and Innovation at Nielsen BuzzMetrics and also co-creator of BlogPulse.