Blog: Pete Loshin« Oracle's Latest Open Source Release, Oracle Berkeley DB Java Edition 3.0 | Main | Hypo-Allergenic Cats, Adding Value, and Proprietary Technology » Really, REALLY Big DatabasesIt's easy to lose sight of the real magnitude of the collections of data we can now slice and dice, but every now and then something happens to remind me. For example, there was last month's news about a missing laptop and external hard drive containing detailed personal information about 26.5 million veterans--as well as up to 80% of the active duty armed forces. That's about one tenth the population of the US, and it could fit in a carry-on bag. So what does a really big database, one that calls for serious hardware, even look like? The Phone Company (whatever that is, depending on the current legal and corporate situation) does some of the most serious databasing around: they keep track of all the hundreds or thousands of phone calls that each of their tens of millions of customers make every month. It means that a lot of the work done at AT&T Labs has to do with recording and retrieving lots of information in very big data stores. For example, there's Project Daytona, for "Managing Data at AT&T Scale". I really like the example they give of a "big" database: For example, Daytona is managing over 312 terabytes of data in a 7x24 production data warehouse whose largest table contains over 743 billion records as of Sept 2005. Indeed, for this database, Daytona is managing over 1.924 trillion records; it could easily manage more but we ran out of data. The emphasis is mine; but even this kind of database is dwarfed by the one that the government may (or may not) be building from data supplied (or not) by ISPs. Internet Protocol (IP) packets are almost a model of relatonal database records: there are fields for source and destination IP addresses, identifiers for the application protocol being used, whether or not there is encryption in use, even identifiers to help the computers reconstruct the data carried in a stream of packets. With all that informaton, you can sift and sort the packets to figure out what servers some computer is using, the contents of email messages sent from or to a particular user, or more. If you can compile all the packets sent over all the Internet backbones: but the size of such a database is mind-boggling. An IP packet is usually no more than about 1,500 bytes, so it's not hard to come up with some back of the envelope estimates for how big a database of all the packets sent in this country would amount to. Let's say the average Internet user generates 150 megabytes of traffic--that's about 100,000 packets per day. Let's say they log on every weekday--that's about 2 million packets per month. And let's say there are about 200 million Internet users (roughly 2 out of 3 Americans)--that means about 400 million million (400,000,000,000,000) records to store for just one month's worth of Internet traffic. Finally, let's say the average packet is 1,000 bytes. So one month's worth of Internet traffic would be about 400,000,000,000,000,000 bytes--or about 400 petabytes worth of disk. A year's worth would require as much as five exabytes worth of storage, and contain almost 5,000,000,000,000,000,000 quintillon records. I wonder whether Daytona is up to that task? |
Comments
That is a good site with out any thing important for us students.
Posted by: akm123 | March 24, 2008 3:19 PM