Shift Happens
February 13th, 2007 § Leave a Comment
My good buddy Jake sent me this presentation, which was originally written by Karl Fisch and was adapted by Scott McLeod. When I first watched it, I was thinking how I would love to see a footnoted version…and then I found it. Pretty powerful stuff. I bet this will be viral pretty fast.
I found the 1.5 exabytes of new information each year figure to be especially interesting in light of Schmidt’s 2005 5 million terabytes figure. 1.5 exabytes = 1.5 million terabytes, so we are now up around 6.5 million terabytes in the world, and Google has indexed approximately 500 of these, which puts them back down to .07%, but that’s up from .0034%. It seems to me that Schmit and Co. probably extrapolated their figures from the ’03 Berkeley report where the 1.5 exabyte figure came from, which I referenced in my .0034% post, but there I missed the 1.5 growth number. All 100 pages of the report can be downloaded in .pdf form here.
Google up to .01% of the world’s information
February 7th, 2007 § Leave a Comment
Marissa Mayer did a phenomenal job of fielding questions after her very informative talk today at the Kellogg Technology Conference. I was able to sneak in a question during the session and asked her if she could ballpark the number of terabytes that Google had indexed to date – her guess was 500. If she is right, then they are up 330 terabytes, or 194%, from 170 where Schmidt thought they were 16 months ago. That’s pretty good! However, if you take all of the numbers from Schmidt’s speech as valid, specifically that 5 million terabytes of data exist, then they have now organiz
ed .01% of the world’s information, up from .0034%. Interestingly, Marissa spent not an insignificant portion of her talk focusing on the acquisition of offline data via projects like Google Books, Picasa, etc…
As you can see, I have convinced myself that 500 terabytes, in the scheme of things, isn’t that much. This may be crazy, especially considering I don’t really trust my conceptualization of how much data is in a gigabyte, let alone a terabyte. However, the fact that 1TB servers are commonplace and Apple announced last month that they were shipping 10.5 terabyte servers (for a mere $12,399) makes me think I’m not too far out in left field on this. Pictured to the left is the 200 terabyte GLOW system that the University of Wisconsin physics department has put together. Yes, the information Google has organized is likely a larger percentage of the “useful” information out there. Yes, what you take your definition of “information” to be makes a huge impact on this analysis. Yes, the information Google has indexed may be on the “light” side of the spectrum. And yes, these servers I’m talking about are HUGE. But the fact is, all of the information that Google has indexed could be put onto roughly 48 Apple Xserve RAID servers, or 2.5 of these behemoths.
The question that follows for me: why is Google building all these massive computing centers if all of their information can be stored in the area the size of a large walk in closet? Tthe answer t
o this was covered also covered by Marissa in her presentation: she mapped out what happens when you run a search on Google and showed how Google searchs hundreds of millions of sites in less than a second. So these data centers are needed to get nearly instant access to an amount of data that could be stored on 48 commercially available computers.
Edit: If google continues at the 194% growth rate, they will hit 5MTB in roughly 28 years…that’s a lot better than 300 years…
Markets, markets, everywhere…
January 31st, 2007 § 1 Comment
In the last 2 weeks or so, there has been a flurry of interesting niche sites on TechCrunch that have piqued my interest.
- Farecast: Airline price insurance
- Price Protector: Price protection alerts
- PicksPop: Pop culture betting
- PicksPal: Sports betting
- SocialPicks: Stock picking
- Gottabet: Bet on anything
- Weatherbill: Bet on the weather
- My Currency: Crowd home & property valuations
- SccopLive: Paparazzi photos
These sites let users either cash in on what they know, or protect themselves from what they don’t. They allow bets and hedges to be placed on outcomes that are difficult to predict and inherently create markets for information.
I have been thinking a great deal recently on the commoditization of information in the Google age. Information that is in the public domain is extremely useful, but also “worthless” in the sense that doesn’t really generate alpha. Roger Ehrenberg of Monitor110 and Information Arbitrage has had quite a few interesting posts on the value of unique information and the commoditization of information within the public domain. He has also based his business around harvesting and, in a sense creating, unique information.
As more and more information becomes free and accessible, more and more value will be placed on information that remains unknown. The sites above create a marketplace for slices of the 99% of the world’s information that is not searchable via Google. Some of these sites may face some rough sledding because their business models require volume and liquid markets to generate accurate and efficient prices. That said, the market value of anything is whatever someone else is willing to pay for it. Services that link parties that have placed disparate values on the same item (a la eBay) are the ones that create effective markets and generate real value for end users.
Google: Organizing .0034% of the world’s information.
January 21st, 2007 § Leave a Comment
I have been doing some research on information, specifically, how much information there is in general, and how much of that information is searchable and indexable online. This is not an easy number to come up with, and is very dependent on what you choose your definition of “information” to be in the first place. Phone calls, IMs, and emails are produced in huge volume, but only certain portions of that data is actually interesting (e.g.: Presidential phone calls v. my personal email). It seems that questions about the size of the web and the information universe were asked fairly often 4, 5, or 6 years ago when people were still getting comfortable with the net. They would ask AOL “how big is the internet?” and some folks at the time tried to figure it out.
The favored unit of measure for massive amounts of data is the terabyte (this is actually a pretty funny wikipedia entry). A terabyte is the equivalent of 1000 gigabytes, or 1 trillion bytes (10^12). I spent a fair chunk of yesterday afternoon trying to find some sort of reliable source that had researched this in the last few years. I used an interesting service – www.chacha.com – in my search. They have been getting a good amount of press recently, so I thought I would give it a try. You go to the site and are linked with a “search consultant” via IM who helps run your search for you. It seems like they screen Google results and give you what they think is best. They make $0.83 a search but only get paid for the first ten minutes (unclear how that meshes, but whatever). Chacha turned up nothing I hadn’t seen already, but it was cool to try it out. This is what I found:
- Alexa (owned by Amazon): 100 terabyte index
- AT&T: 300 terabyte “Daytona” index of customer data (that they apparently share with the NSA)
- Library of Congress: 136 terabytes
- The “Surface Web” in 2003: 167 terabytes
The most interesting thing I found was a transcript of a speech given by Google CEO Eric Schmidt to the Association of National Advertisers on October 8, 2005. Mr. Schmidt said:
…how much information is there in the world? A study that was done last year indicated roughly five million terabytes. How much is indexable, searchable today? Current estimate: about 170 terabytes. So again we’re back in that two or three percent of the indexed and searchable world.
Takeaways:
1. Google had access to 170 terabytes of data in 2005 surmised there to be 5 million terabytes available. That is not a whole lot of coverage.
2. Eric Schmidt and his speechwriters need to check their math. 170 of 5 million is .0034%, not 2 or 3%.
EDIT: To be clear, I love Google and used it for all the research in this. However, I think the volume of actual information out there vs. the volume accessible via the web is difficult to comprehend and if Google’s numbers are right, they are very surprising.
Also, one could argue that Google might have organized the .0034% of the world’s information that is interesting and applicable…
weatherbill
January 12th, 2007 § Leave a Comment
TechCrunch had a post on an interesting company a while back. Weatherbill will apparently let people bet on the weather. I guess this will give the online gambling community something to do when the gambling execs come back to the US from the Bahamas to visit their moms and they get arrested.

