Blog: Pete LoshinJanuary 25, 2008Time Travel Maps!I'm fascinated by the physics of time travel, but the type "time travel" instead of "travel time" caught my eye in the link I clicked on to get to Travel-time Maps and their Uses and this More travel-time maps and their uses. Not maps for time travelers, but maps that illustrate the amount of time it takes to travel. Very interesting and even helpful if you're in the UK: you can use these maps, for example, to figure out whether it's quicker to drive or take a train to a destination, or the fastest mode of transportation for rush hour commuting. But it's also a very neat illustration of how big piles of data can be turned into intelligence. And you don't need me to explain how that kind of intelligence can become "business intelligence" for any business that needs to allocate resources to get people or things from one place to another. It's all brought to you by mySociety, a charitable project that develops their software as open source; if you're interested in having them do custom mapping for your business, they seem to be willing to do that for a fee (or a donation, I'm not sure how that works in the UK). November 5, 2007Opening Up the Internet: Craigslist + Yahoo! Pipes = Better Data SearchingWe've really come a long way with the web and the Internet over the past dozen years or so. Back then, it was kind of a big deal to run screen-scraping software that could pull data off websites, or access corporate legacy mainframe systems through a webified front end. Now, we're seeing more and more of the web is instantiated in some seriously big data stores, and we're seeing more and more of the owners of those seriously big data stores making data processing tools and APIs available to anyone who wants them, so we can have some nice little mashup applications combining, for example, maps and data with geographical components. But here's something sort of new: a way to make an already popular, useful and generally great website--in this case, Craigslist--with another popular, useful and great website--Yahoo! Pipes. The result is even better than either one. Yahoo! Pipes is kind of like a web version of UNIX piping: a way to take the results of one command (output) and "pipe" it into another command as input. What you get is a very handy way to create very specific and powerful searches, and turn the results into useful information. So, here's the article that got me hooked: How to Actually Search Craigslist. As great as Craigslist is, it has some drawbacks. James Aaron, who wrote the article, is a student at San Jose State's School of Library and Information Science, and is looking for a job currently. He likes Craigslist, but, as he explains, it could be even more helpful if there were ways to search better: There is no way to truncate searches, such as "librar*" to include librarian, library, libraries, etc. There is no way to perform Boolean AND, OR, NOT searches. There is no way to remove frequently occuring irrelevant items. There is no way to search two sub-regions at once. So, unless I want to perform 20 searches a day and receive MANY completely irrelevant hits, I basically have to browse. The answer, he tells us, is Yahoo! Pipes, and he explains just how to use Pipes with Craigslist to make Craigslist that much more useful. In other words, more evidence of just how much the entire web is evolving into the world's biggest ever data store, with the most powerful ever set of tools for extracting business intelligence. How could you use this kind of capability to extract actionable knowledge from the web? November 2, 2007Mining Valuable Intelligence From Random NumbersSomewhere in my stack of obsolete 3.5" floppy diskettes I've got a spreadsheet that contains some interesting raw data. Long ago I was in the habit of buying a bag of M&Ms from a vending machine in the corporate cafeteria every afternoon: before eating any, I would open the bag, sort the colors, count the M&Ms of each color, and record the totals in a spreadsheet. The primary benefit I got from that activity was a nice set of data, from which I could infer some general rules about which were the most and least common M&M colors> I also got something to do during the afternoon lull to keep me from falling asleep. It was the kind of job where most of my co-workers were very bright, but we often had time on our hands; conversation topics included arguing different strategies for getting rich by inventing something really cool--and strategies for winning the lottery. Now that we have the Internet, and there's an endless supply of data sets to play with, here's a guy who actually came up with something useful on that whole lottery thing: Pattern Analysis of MegaMillions Lottery Numbers. Can you use the information in this article to increase your odds of winning the big bucks? It's not clear: if the lottery number selection process is truly random, the answer is no. But you could use the numbers, and the techniques, as described in the article, to discover hidden influences on the selection process that might skew the results. For me, though, the best part of this article is that it takes the question of whether lottery drawings are truly random and then applies a scientific approach to it. And, that all the data is available on the New Jersey lottery website, both in HTML and delimited format for easier processing. September 28, 2007Guerilla Knowledge ProcessingNo matter what our job descriptions, most of what we do every day revolves around manipulating data and turning it into knowledge/intelligence. Enterprises routinely budget hundreds of millions of dollars for this kind of thing, so we all know how difficult it can be. But here's an example of how one guy turned his credit card statements into a vehicle for generating and tracking information--and the surprising discovery he made when the results didn't match his expectations. In A low-bandwidth, high-latency, high-cost, and unreliable data channel, this fellow, Ian Hixie, starts by noting that when he eats at a restaurant in the US, "you never pay what the bill says" because he always adds a tip. Thus, he concludes: The net effect of this is that you basically get to decide how much you pay. Indeed, credit card bills at restaurants have a space where you fill in how much you want to pay. The aha moment for Hixie came when he noticed that, to make things simple, he usually rounded this amount to a full dollar: ... there are data bits there, lying unused! It struck me that with every single restaurant transaction I could set the cents field to some number under my control, thus allowing me to communicate with myself at a later date! Ian goes on to describe a protocol for encoding several different pieces of information about the restaurant into the decimal values available ($.00 through $.99), and--if you want to know how it worked, you'll have to go read Ian's article. But what's really relevant here is that Ian did some applied business information processing by:
Ian discovered a bug in the system, but he's also conjectured a reason for the bug as well as a fix for it; hopefully we'll hear more about the second version of this system--and possible other uses for these data bits. Where else are there opportunities in your work (or personal) life for adding value through knowledge processing? August 7, 2007Time Magazine sez "Online Snooping Gets Creepy"According to Time Magazine, Online Snooping Gets Creepy. I'm not so sure about that. Web "snooping" has always been more or less creepy in some ways, and more or less useful in others. Time points to a new wave of search engines that supposedly go beyond and behind the web content that Google indexes to give an uncannily complete profile of whoever you want to "investigate". These include ZoomInfo, PeekYou, Pipl, Wink/, and Spock/. I tried them out, using my own name since I can best judge the results (except for Spock, which was down for maintenance) and here are the results:
How do they stack up to Google? Well, Google is still a more comprehensive search engine, pointing to a more complete set of my publications (books as well as articles published online), including lots of pointers to websites and blogs that seem to have "borrowed" my articles for their own use. Oh well. If you're looking for someone's address, phone number or birthday, try Pipl; if you're looking for a terse and easy to understand (but possibly inaccurate) precis, try Wink. Otherwise, you might as well stick with Google, at least to start with. August 6, 2007Hot New Blog: High ScalabilityHigh Scalability is a new (not quite one month old) blog aiming to "bring together all the lore, art, science, practice, and experience of building scalable websites into one place so you can learn how to build your system with confidence." Don't be fooled into thinking it's all about CSS or Apache configurations or stuff like that. Consider this: An Unorthodox Approach to Database Design : The Coming of the Shard. Now, blogger Todd Hoff may or may not be onto something entirely novel and unique with his concept of database "sharding" to distribute database storage and computation, but it is definitely interesting. And he's got a lot of other good information here, such as a precis of the eBay Architecture, MySpace Architecture, and much more. Great job, Todd, and keep it up! May 27, 2007IBM virtual Linux environment beta programIf you want to see what a virtual Linux environment looks like, check out the IBM System p Application Virtual Environment for x86 Linux. Follow the link to find out more about participating in the beta program (as well as more details about what it does and how it works). Bill Andad at DANIWEB.com has more about it here. December 15, 2006Interesting Data Mining BlogI just discovered a great blog to check out if you're interested in a practitioner's take on current practical and cutting-edge data mining applications: Data Mining: Text Mining, Visualization and Social Media blog. Matthew Hurst, the intelligence behind the blog, is Director of Science and Innovation at Nielsen BuzzMetrics and also co-creator of BlogPulse. |