Jean Véronis
Aix-en-Provence
(France)


Se connecter à moi sur LinkedIn Me suivre sur Twitter Facebook RSS

dimanche, janvier 23, 2005

Web: Google searching 9,105,590,456 pages [en]



Read also :



As I announced in November, Google doubled its index to eight billion pages, and posted proudly on the home page:
Searching 8,058,044,651 pages
Small problem, this information has not varied since, although the size of the index increases regularly. Here is a screen copy of today's page (Google.fr):


Google obviously indexes new pages everyday. For example, this blog's pages join the index very quickly, usually in less than 48 hours (see request). Even if we suppose that Google adds only the blogs in its index, and even if it indexes only a fraction of the six new blogs which are created each second (the Technorati site lists more than six million blogs at the moment), the index size should change rapidly.

However, at the same time, Google changes the count of individual words. I applied the same requests on 16 words on November 22, 2004 and on January 22, 2005:

WordNov. 22, 200422 jan 2005
Aznar16900001600000
Bernadette19200002250000
Blair1410000015800000
Chirac31200003280000
Claude1560000017900000
Coluche161000193000
Corona67500007430000
Jacques1900000021400000
Jospin669000768000
Poutine272000316000
Raffarin752000893000
Saddam1110000012400000
Sarkozy838000695000
Thatcher21400002770000
Veronis6260060100
Zidane10900001280000

There is a quasi-perfect correlation between the results obtained at these two dates (determination coefficient> 0,999!):



The slope of the regression line (1.13) gives us the progression between November 22 (very little time after the publication of the index size by Google) and January 22. This enables us to estimate the new size of the index (8,058,044,651 x 1.13). I am thus happy to announce it: Google's index exceeds nine billion pages . Google should thus post:
Searching 9,105,590,456 pages
The size of the index thus increased roughly by a billion pages in two months. I have no means to know whether the progression is linear, but one can safely predict that the index will reach 10 billion pages before the end March .

Why Google doesn't update its home page? If they intend to hide the progression from their competitors, it is rather ridiculous since as this post shows, it can be estimated in a very simple way.

This small annoyance is less serious than the bug on advanced research that I reported the other day, but all these sloppy details end up throwing suspicion on the quality control at the Google house. Of course, for the moment only professionals are concerned with these things. They do not make the smallest difference for requests about yellow pages or Britney Spears (see this post).

5 Commentaires:

Blogger Nathan Weinberg a écrit...

Actually Jean, when Google updated the numbers on its front page in November, eagle-eyed watchers (including myself) noticed that for the briefest time, a Google saerch for "the yielded over 10 billion results, before Google smacked it back down to the exact same number it says on its front page (which is of course, statistically impossible). It stands to reason that Google has anywhere from 10-13 billion pages already in its index, but is hiding the number from its competitors.

23 janvier, 2005 18:34  
Blogger Jean Véronis a écrit...

I just saw your follow up on this topic on InsideGoogle, which I recommend to readers of this post:

http://google.blognewschannel.com/index.php/archives/2005/01/23/google-at-how-many-billion-9-11/

Fascinating, indeed! Many thanks for the additional info.

23 janvier, 2005 20:16  
Blogger Jean Véronis a écrit...

In a comment on the InsideGoogle's follow up to this story, Philip Lenssen cites a press release from Google which seem to indicate that they consider pdf etc. as Web pages (which is their interest, anyway, if they want to impress the word with large figures):

http://google.blognewschannel.com/index.php/archives/2005/01/23/google-at-how-many-billion-9-11/

In any case, that doesn't change my point. There has been a 13% progression, i.e. ca. one billion pages, which is not reflected in the main count.

23 janvier, 2005 22:50  
Anonymous Anonyme a écrit...

More than 50% of the URLs shown in Google results are:

1. URLs without titles, descriptions or content. These are URLs that are restricted via the robots.txt or pages that have never been or will never be fully indexed because of bugs within their indexing system. Example:

http://www.google.com/search?num=10&hl=en&lr=&safe=off&c2coff=1&q=site%3Ausatoday.com+olympics+saltlake&btnG=Search

Almost 50% of these results are empty. Shame on you Google!

2. Supplemental Result - Because of Google's limitation on the total number of URLs they can store in an index, Google now has at least two separate indexes. This is so they can say they are bigger. But the Supplemental index is rarely used. It's just there so they can say they have more URLs than Yahoo!.

In reality Google does not have over 8 billion "indexed" URLs. Yes they possibly have over 8 billion URLs in their index(s), but only a percentage are actually fully indexed pages or pages you can search out and find.

Google will update their logo for all special events (new year, their anniversary, Olympics and so), but the only time they update the "Searching 8,058,044,651 web pages" statement is when they feel threatend like they did when Yahoo! announced the purchase of Overture, AltaVista, Inktomi and bla, bla, bla.

Google's technologies are great in the classroom, but terrible in the real world. PageRank is the easiest algorithm to manipulate. It's also easy to steel another site's PageRank slamming your competition to the bottom of the search results.

24 janvier, 2005 16:36  
Anonymous ben a écrit...

Fascinating. Of course the maths is beyond mine, so I'll take your word for it.

I also suspect that indexed does not mean the same as displayed in their results. Apparently Google spider and index large numbers of a site's pages without necessarily displaying them in their results - depending on how particular websites perform. Low bounce rates and more pages suddenly appear... Can't quite work out the rational for it yet.

17 avril, 2009 16:59  

Enregistrer un commentaire