Google: Organizing .0034% of the world’s information.
January 21st, 2007 § Leave a Comment
I have been doing some research on information, specifically, how much information there is in general, and how much of that information is searchable and indexable online. This is not an easy number to come up with, and is very dependent on what you choose your definition of “information” to be in the first place. Phone calls, IMs, and emails are produced in huge volume, but only certain portions of that data is actually interesting (e.g.: Presidential phone calls v. my personal email). It seems that questions about the size of the web and the information universe were asked fairly often 4, 5, or 6 years ago when people were still getting comfortable with the net. They would ask AOL “how big is the internet?” and some folks at the time tried to figure it out.
The favored unit of measure for massive amounts of data is the terabyte (this is actually a pretty funny wikipedia entry). A terabyte is the equivalent of 1000 gigabytes, or 1 trillion bytes (10^12). I spent a fair chunk of yesterday afternoon trying to find some sort of reliable source that had researched this in the last few years. I used an interesting service – www.chacha.com – in my search. They have been getting a good amount of press recently, so I thought I would give it a try. You go to the site and are linked with a “search consultant” via IM who helps run your search for you. It seems like they screen Google results and give you what they think is best. They make $0.83 a search but only get paid for the first ten minutes (unclear how that meshes, but whatever). Chacha turned up nothing I hadn’t seen already, but it was cool to try it out. This is what I found:
- Alexa (owned by Amazon): 100 terabyte index
- AT&T: 300 terabyte “Daytona” index of customer data (that they apparently share with the NSA)
- Library of Congress: 136 terabytes
- The “Surface Web” in 2003: 167 terabytes
The most interesting thing I found was a transcript of a speech given by Google CEO Eric Schmidt to the Association of National Advertisers on October 8, 2005. Mr. Schmidt said:
…how much information is there in the world? A study that was done last year indicated roughly five million terabytes. How much is indexable, searchable today? Current estimate: about 170 terabytes. So again we’re back in that two or three percent of the indexed and searchable world.
Takeaways:
1. Google had access to 170 terabytes of data in 2005 surmised there to be 5 million terabytes available. That is not a whole lot of coverage.
2. Eric Schmidt and his speechwriters need to check their math. 170 of 5 million is .0034%, not 2 or 3%.
EDIT: To be clear, I love Google and used it for all the research in this. However, I think the volume of actual information out there vs. the volume accessible via the web is difficult to comprehend and if Google’s numbers are right, they are very surprising.
Also, one could argue that Google might have organized the .0034% of the world’s information that is interesting and applicable…
