Business Intelligence Network business intelligence resources

Blog: Pete Loshin

Main

November 2, 2007

Mining Valuable Intelligence From Random Numbers

Somewhere in my stack of obsolete 3.5" floppy diskettes I've got a spreadsheet that contains some interesting raw data. Long ago I was in the habit of buying a bag of M&Ms from a vending machine in the corporate cafeteria every afternoon: before eating any, I would open the bag, sort the colors, count the M&Ms of each color, and record the totals in a spreadsheet.

The primary benefit I got from that activity was a nice set of data, from which I could infer some general rules about which were the most and least common M&M colors> I also got something to do during the afternoon lull to keep me from falling asleep.

It was the kind of job where most of my co-workers were very bright, but we often had time on our hands; conversation topics included arguing different strategies for getting rich by inventing something really cool--and strategies for winning the lottery.

Now that we have the Internet, and there's an endless supply of data sets to play with, here's a guy who actually came up with something useful on that whole lottery thing: Pattern Analysis of MegaMillions Lottery Numbers.

Can you use the information in this article to increase your odds of winning the big bucks? It's not clear: if the lottery number selection process is truly random, the answer is no. But you could use the numbers, and the techniques, as described in the article, to discover hidden influences on the selection process that might skew the results.

For me, though, the best part of this article is that it takes the question of whether lottery drawings are truly random and then applies a scientific approach to it. And, that all the data is available on the New Jersey lottery website, both in HTML and delimited format for easier processing.

July 1, 2007

Conjectures on Blog Categories and Blog Maturity

When I first started blogging here, I had a list of categories that I thought would cover most of my entries. Based entirely on my personal and quirky blog reading experiences, I suspect other bloggers have done the same thing: come up with a bunch of topics they want to write about, and add those to their blog's "categories" list. They probably also write about other stuff at first, things that are interesting to them at the time, and add new categories to cover those ad hoc topics.

I'm thinking on line here, so bear with me.

There are a number of different quantifiable variables here that I'd like to consider:

  1. The number of entries in the blog
  2. The number of categories to which each blog entry is assigned.
  3. The number of categories in the blog. There's more data here:
    • The earliest date on which an entry is linked to each category.
    • The last date on which an entry is linked to each category.
    • The overall number of blog entries linked to each category.

My conjecture here is pretty simple: that the Pareto Principle guides the distribution of most recent posts and categories in which they are posted. In other words, in mature blogs, roughly 80% of all entries will be logged under roughly 20% of the categories. Or, in *other* other words, as time goes on, bloggers tend to focus their writing on a small subset of the topics they originally intended to cover.

I further suspect that as bloggers become more adept at writing, they also tend to be better at distilling the essence of their message--and as a result, multiple-topic postings should decline as the length of time the blog is maintained increases.

Now, if only I could figure out a way to extract that kind of data into a usable data set, I'd be on my way to a possibly cool new piece of information.

June 29, 2007

How NOT to Protect Sensitive Data

If you work as a bank teller, I'm pretty sure you can't take your cash drawer home to count out your currency. Likewise, I don't think jewelers allow their employees to take precious metals or stones home and pharmacists probably don't have the option of taking drugs home to fill prescriptions.

Most companies whose employees handle valuable commodities have strict security protocols intended to prevent losses due to carelessness as well as outright theft.

Except the IT industry, apparently.

It seems to be perfectly OK for employees--and contractors, consultants and various other third-party non-employees--to walk out the door with corporate databases loaded onto laptops or portable hard drives, with predictable results when those laptops or hard drives are lost/stolen.

When laptops with sensitive data get lost and/or stolen, it doesn't matter how conscientiously you've protected your personal information from identity thieves. You are at risk because someone who should have known better acted irresponsibly. Maybe it was a human resource clerk at your current employer--or maybe at a company you haven't worked for since the Reagn administration.

Maybe it was someone at a hospital where you received emergency medical treatment, or the insurance company that paid your claim, or your university. Or someone who works for a government agency.

Whoever did it may never be held accountable. And you may not even hear about it until you get a letter informing you that your data may have been compromised and you can sign up for a free credit monitoring service, sponsored by the company or organization that lost your data in the first place.

To get an idea of the scope of the problem, check out numbrX Security Beat, "an online record of reported personal, private and confidential data breaches which can lead to identity theft and credit fraud."

And remember, the breaches you read about on numbrX are probably only the tip of the iceberg: these are only the breaches that have been reported publicly.

June 1, 2007

Making Sense out of Data

Go read this article by Matthew Haughey: How Ads Really Work: Superfans and Noobs, and then think about how you can turn data into knowledge.

If that doesn't convince you to drop everything and go read the article, here's my quick summary:

In one sentence, what Matt (re-)discovered is the old 80/20 rule, also known as the Pareto Principle, or power law (this one's an article about power laws and blogs.

Matt was using Google Analytics and found that most of his ad revenue came from "noobs" (one-time visitors who are on the search for something), with most of his loyal visitors ("superfans") generating a disproportionately low volume of ad revenue.

So, what can you do with this data? Matt decided it made sense to give his loyal fans an ad-free experience because they didn't click on ads anyway. Win-win: he got a higher click-through rate because all the pages served to his superfans didn't actually have any ads AND he was able to give potential superfans an incentive to opt for premium membership.

Not really a big deal, just an example of using common sense when you're crunching numbers.

April 9, 2007

Personal Finance, Open Source Style

Check out this article 8 Free Personal Finance Management Programs at one of my new favorite websites, the Consumerist.

The thing for me about financial software is that you've got to have a lot more discipline, know-how and ambition than I've got (apparently) to get all your data loaded in. So even though I once actually bought a copy (years ago) of Quicken, I save a lot more money by not using one of the open source solutions than I do by buying, and not using, a commercial one.

I know this hasn't much to do with enterprise computing or business intelligence, but it does have a lot to do with the reason so many people can't migrate away from Windows: they're locked into using Quicken, MS Money or some other proprietary application that hasn't yet attracted an open source alternative.

When adequate open source alternatives for personal finances and tax preparation become widely usable, the potential for widespread migration to Linux is huge.

February 19, 2007

Rules to Live By From Joel on Software

My take on Joel on Software is that you either love him or hate him, but I caught this piece on Seven steps to remarkable customer service, and was struck by how those seven tips were not just good for people doing customer service but for anyone--because, after all, isn't EVERYTHING we do in our work lives ultimately "customer service"?

The article follows the Seven Meta-Rules for Creating Simplified Lists of Rules, but at least 80% of it can be summed up in the phrase: "Don't do things to other people you wouldn't them to do to you."

December 15, 2006

Interesting Data Mining Blog

I just discovered a great blog to check out if you're interested in a practitioner's take on current practical and cutting-edge data mining applications: Data Mining: Text Mining, Visualization and Social Media blog.

Matthew Hurst, the intelligence behind the blog, is Director of Science and Innovation at Nielsen BuzzMetrics and also co-creator of BlogPulse.

June 8, 2006

Really, REALLY Big Databases

It's easy to lose sight of the real magnitude of the collections of data we can now slice and dice, but every now and then something happens to remind me.

For example, there was last month's news about a missing laptop and external hard drive containing detailed personal information about 26.5 million veterans--as well as up to 80% of the active duty armed forces. That's about one tenth the population of the US, and it could fit in a carry-on bag. So what does a really big database, one that calls for serious hardware, even look like?

Continue reading "Really, REALLY Big Databases" »

May 1, 2006

Web Spurs Move to Useful (but Boring) Headlines

The best way to generate web traffic these days is to place high in the search engines; the BBC reports that Search users 'stop at page three'. Most users expect to find the answers to their questions on the first page of search results, so if you want to be the page they click to, you've got to score high.

A pretty obvious way to score high, get views, and retain readers is to use headlines that are clear, relevant, and likely to accurately reflect the contents of your pages. So, according to this note on Slashdot, This Boring Headline is Written for Google, the New York Times reports that website editors are increasingly writing headlines that are clear, relevant and accurately reflect the contents of the articles they top.

What a great idea! It makes sense, and the news isn't that fresh. It makes sense to write and print catchy, intriguing, funny or odd headlines when you need to snag readers who are browsing headlines at the corner newsstand; when your readers are brought to you by Google.com or some other search engine, you'd better write a headline that will be seen, no matter how boring.

So being clever, sarcastic, funny or even literate is no longer a requirement for headline writers. It's good to be getting back to the fundamentals of journalism. Writing a good headline is hard.

February 13, 2006

The Four Hundred Million Dollar Mistake

Someone clearly messed up big time: Indiana House Wrongly Valued at $400 Million. Somehow, someone--they're blaming an "outside user", apparently--got into the system and changed the assessed value of a $121,900 house to $400,000,000.

This year's tax bill for the homeowner, usually around $1,500, came in at $8 million.

What would have been a harmless error could cause serious disruption, including layoffs, for the tax districts which were counting on that $8 million for their budgets.

So, who's going to get blamed for this one? Some possibilities:

  • "Someone" from outside who got into the system and made the change. That's who the county IT director is betting on.
  • The IT director also is pointing her finger at the county auditor's office, which she said was notified about the error and told how to fix it.
  • The treasurer's office, which noticed the billing error and tried to fix it, but that wasn't enough.
  • The IT director's predecessor(s), for having failed to remove the program under suspicion and/or for implementing a program that didn't properly implement security checking.
  • The IT director, for being in the wrong place at the wrong time.

It's tempting to blame a lack of sanity checking in the original code, but implementing it in a system that might be dealing with residential as well as commercial and industrial properties could be a problem: large factories could easily be valued at hundreds of millions of dollars and generate multi-million dollar tax bills.

More likely, the fault is in overburdened and understaffed IT departments that have to deploy new systems as well as support (and eventually remove) old systems that they may not even be aware exist.

February 7, 2006

What Would YOU Pay to Link to a News Story?

Last week I commented on how Microsoft wasn't planning to publish a patch for the Kama Sutra/Blackworm/MyWife worm until next week; it turned out not to be that big a deal.

But imagine my surprise when I noticed that the news source for the original article was playing some games: they'll email the article to all your friends for you, in the process collecting all of your email addresses. Or, they'll sell you a "license" to email the URL for as little as $5.00. If you prefer, you can pay a measly $2.50 to "license" the link on your own website--a better deal because if you wanted to email the URL to 200 people you'd have to pay $50.00).

The costs go up even faster if you want to license an article, or even just excerpt an article, to be used in a book or newsletter; the whole thing is done through a third-party clearance company and presumably the publisher and the clearance company split the proceeds and leave the original author out in the cold.

Rather than increasing profits, this whole thing tends to reduce the likelihood that anyone would want to link to this publisher's articles, or that other authors would cite their articles. Why bother with the cost and nuisance of this "license", or even worse, worry about legal action resulting from what would normally be considered "fair use"?

February 1, 2006

Security Outrage and an Easy Answer

OK, so Microsoft apparently isn't going to be releasing a patch to the world until February 14 for the Kama Sutra/Blackworm/MyWife worm. Super, unless your time machine is in the shop, since that particular bit of malware is scheduled to strike on Friday. Here's Microsoft's security advisory on the worm. According to the summary posted on Slashdot, Microsoft customers who pony up for premium services can get the fix before the worm hits, but otherwise you're out of luck.

Easiest solution? Get yourself a current Linux distribution, install it, and access your important data with no problems tomorrow. It would be best if you could install Linux on a fresh hard drive, and then copy your data over from your old corrupted (with Windows) hard drive, but chances are good you can create a Linux partition that coexists with your Windows partition if you pay attention during the installation.

Alternatively, you could change the way your system boots through your BIOS, so you can boot to a clean and virus-free CD, and then do what you need to protect your data on your hard drive.

I found this whole little story amusing and interesting, but I dug up an even more outrageous bit in the original news story Slashdot linked to. Stay tuned; I've got to run but will be back later on this afternoon to write it up.

Bill Gates Gets Special Treatment by IRS

We all know that the very very very rich are different than we are: they have more money. Lots more. So much more, in fact, that their taxes are apparently handled on special, exclusive, computers.

According to the international mainstream news media (as reported in Australia's NEWS.com.au via a report by Agence France-Press), Bill Gates just has Too much money to tax.

Reportedly, Gates stated at a conference in Lisbon: "My tax return in the United States has to be kept on a special computer because their normal computers can't deal with the numbers.."

Much hilarity ensues on Slashdot's report of the report, Bill Gates' Taxes Require Special Computer, where the discussion starts off with some speculation over whether the IRS had to switch from a Windows platform to a Macintosh to handle the big numbers.