Jean Véronis
Aix-en-Provence
(France)


Se connecter à moi sur LinkedIn Me suivre sur Twitter Facebook RSS

mardi, février 08, 2005

Web: Google's missing pages: mystery solved?




Read follow up

28 feb - MSN cheating too?
7 mar - Yahoo indexes more pages than Google
13 mar - Google adjusts its counts
23 mar - 5 billion "the" have disappeared overnight
25 mar - A snapshot of the update



In previous articles, I pointed out two strange problems with Google counts (here and here). Pages seem to massively disappear:
  • If you type Chirac OR Sarkozy, you get half the number results of Chirac alone, which may have a political explanation... but is a weird approach to boolean logic.
  • If you search the in the English pages, you get 1% of the number you get for the all languages together. Does this mean that the is 99 times more frequent in languages other than English? Of course not.
Where are the missing pages gone? This is the question that I am trying to address in this article. A possible scenario is that the real index used by Google is considerably smaller than the counts officially announced. The detailed experiment reported below yields a precise estimate of 60%, thus leading to a real index size of ca. 5 billion pages. This scenario is of course entirely hypothetical, but it enables to explain both the discrepancy in the English page counts and the strange behaviour of Google's Boolean operators.

Let me say it right away, in order to save commentators' time: this does not mean that Google is a bad search engine (and I actually have it as my browser's home page). For most users, counts are useless, and what... counts for them is whether they find the right results quickly and accurately or not. Figures are relevant only for experts, but in this case, these have some reasons to wonder.

An experiment

In this new experiment I do not use frequent words such as the, because frequent words are likely to be processed in a special way by any search engine. They are probably on a special stoplist, and their occurrences not fully indexed. I have used instead 50 English words drawn randomly from mid-range frequencies in a 1-million word corpus of English text (accumulated, alive, ancestor, bushes, etc.). I have eliminated words for which I knew obvious homographs in other languages (such as patio, etc.).

The figure below plots the counts given by Google for English pages vs the entire Web (the part known to Google, of course) [see complete results here -- all figures in this study were obtained on February 6th].


The slope of the regression line indicates that the English results represent 56% of the results for the entire Web for the same words. Of course, I may have missed some collisions of homographs accross languages, and some of the words probably appear cited in non-English pages as well, but these factors should be marginal, and in any case, different for each word. If almost half of the occurrences of these words are located in non-English pages, there should be a considerable amount of dispersion in the plot. Instead, there is a very strong correlation between the two counts, with a coefficient of determination R2 equal to 0.96. This high correlation is statistically impossible, and some systematic factor must explain it. A possibility would be an extremely poor behavior of the language detection algorithm used by Google, but this is very unlikely because we would see evidence of that in almost every other result, and it is far from being the case: Google's language detection is fairly robust, if not perfect.

On the other hand, if we look at Yahoo's results for the same word list, we get a much more expected pattern [see complete results here]:


The correlation is very high too (higher, indeed), but this is normal because the results are almost identical: English results represent 92% of the whole. This figure is in line with our linguistic knowledge.

Results for French are very similar. I built a French word list on the same principle, and ran it through Google and Yahoo. Google gives a 58% share of results located in French pages, and again a high correlation, slightly lower (R2 = 0.86), but still incompatible with a large proportion of results outside the pages categorised as French. Individual word behaviour should bring a much more random pattern [see complete results here].



Yahoo behaves just as it did for English. The proportion of results located in French pages is even higher (97%), which is expected, since English, as an international language, tends to be cited in more documents than French.



A possible scenario

Many experts believe (see for example here) that Google's database is composed of (at least) two parts. One part which is a full index, and another one which contains URLs and other information for pages that Google knows about, but whose content has not been indexed (only the words in their URLs are possible indexed). I have no means to know whether this hypothesis is correct (although Google admitted it publicly until 2002), but it could explain the strange behaviour reported above.

Lets call the two hypothetical parts A and B respectively, composing together the whole database D:



We can then build a possible scenario. When we query Google with a word X in any language, it looks it up in its index, i.e. the part A, and extrapolates the count to match the size of the entire database D. However, when we restrict the search to a given language, it does not extrapolate, because pages in part B are not indexed and not categorised in any language. Only the results of A are reported. Of course, it would have been possible to extrapolate the language proportions from A to the entire database D, and extrapolate anyway, but the Google engineers didn't think of it, or didn't think it was important.

We can compute a fairly good estimate of parts A and B, using my calculations above. According to Yahoo (if we accept to trust it), 92% of the results for my English word list are located in English pages. If we apply the same proportion to Google, this means that the index, i.e. part A, is 0.52 / 0.92 = 60.9% the size of D. Interestingly enough, if we do the same computation using the French list, we get an estimate of 0.58 / 0.96 = 60.4%. These figures are so close that it would be surprising that they are a pure coincidence.
Under the scenario outlined above, the real size of Google's index is therefore ca. 60% of the entire database, and the numbers reported are inflated by a factor of 66% (1/0.60 - 1).
This is difficult to match to absolute numbers, because nobody knows exactly the size of Google's database. In November 2004, Google announced that it was searching 8,058,044,651 web pages. The number has not changed since then on the main page of the engine, but I have shown on January 23 that the index had increased by a factor of 1.13 since the announcement (read here). An estimate on February 6th gives a growth of 1.14. This would correspond to a current database size of ca. 9.2 billion pages, i.e. a real index size (part A) of 5.5 billions. However, some observers have noticed that for a short while before the announcement in November Google reported 10.8 billion results for a query on the, which would indicate an even larger database, unless it simply means that at some point in time Google had considered an even larger inflation factor. We will probably never know.

A new light on Googlean logic

The hypothetical scenario above also nicely explains the Googlean logic problem. We remember that X OR Y returns fewer results than X alone (see details). Even weirder, both X OR X and X (AND) X return also fewer results than X itself. I queried Google for X OR X and X (AND) X for each word X in my English list (with the "any language" setting) . The results for both queries are almost identical for all words [see complete results here], and very surprisingly, they are almost identical to the number of results for X in the English pages only (coefficient of determination R2 > 0.999!).


It is likely that Google does the boolean computations (union and intersection of lists) on the basis of the real index, i.e. part A. This would explain why X OR X and X (AND) X yield the same results as the search in English pages when X is an (almost exclusive) English word. The same occurs with French words [see complete results here]. This fact probably went unnoticed until now because if you use words that can appear in many languages (homographs such as patio, or proper names such as Chirac or Bush), the pattern is blurred.

In all likelihood, the Google engineers simply forgot to plug the extrapolation routine at the end of the boolean module! Therefore, if you want to know the real index count for any word, simply type it twice:

WordCount
stuttering749,000
stuttering stuttering452,000

The second line is likely to be the real count...

Read follow up

28 feb - MSN cheating too?
7 mar - Yahoo indexes more pages than Google
13 mar - Google adjusts its counts
23 mar - 5 billion "the" have disappeared overnight
25 mar - A snapshot of the update



11 Commentaires:

Anonymous Anonyme a écrit...

Very, interesting study. Re: the repeat keyword searches I'm not sure if your analysis is completely accurate. For example, stocks vs. stocks stocks produces entirely different results (notably removing many of the expected results). If you look at the descriptsion of stocks stocks you will see that it looked for sites that happened to have a paragraph like this: "Learn The Basics of Stocks. Stocks, stocks and more about stocks."

-WebConnoisseur

11 février, 2005 00:30  
Blogger Jean Véronis a écrit...

Re: stocks stocks

The counts seem to be in line with what my findings:

Stock 30,200,000
Stocks stocks 18,700,000

However, your are right, the ordering of results is very different. It seems that when you type a multi-term query A B C... without quotes, Google gives an advantage to pages containing the exact string "A B C...". In fact, it seems that you get a mixture of what you would get with A B C with no quotes and "A B C" within quotes. This makes great sense, because most users do not put the quotes around multi-term queries.

11 février, 2005 09:28  
Anonymous Anonyme a écrit...

Thanks for posting this -- very interesting.

I've noticed another discrepancy that you may be interested in investigating:

We have been trying to get counts for double-byte character words in Chinese and Japanese and wrote a simple program that sent appropriate url calls (VB.net) and pulled out -- the counts that we get doing this however are several orders of magnitude different from what you get entering the same word manually, ie. entering the word in the webform and hitting search. It appears only to make a difference in double-byte characters - not single.

Weird.

14 mars, 2005 18:22  
Anonymous Anonyme a écrit...

Moi je retiens votre insistence à parler de "billet"... et je m'en excuse, je vous avoue que ça me fait un peu rire.

On a essayé blogue et joueb pour évite le "post" anglophone qui est aussi simple qu'économique.
Rien n'y fait.

Courriel traîne encore face à mél
et à l'encontre d'e-mail...

Je ne suis pas parvenu à faire passer adrélec ni adrelec pour adresse électronique (faites une recherche avec Google et vous comprendrez pourquoi je dis ça).

La liberté, c'est sacré : alors, faites comme vous voulez, mais Billet , c'est peine perdue ...Vous m'en direz qqch...d'ci qq temps.
Non, on billette pas des billets, on poste des posts ... voilà.
"La loi du franglais est toujours la meilleure" ;-(
Tant pis pour ceux qui ne marcheront pas au pas ;-(3)

08 avril, 2005 02:12  
Anonymous Anonyme a écrit...

I think google has a crappy language recognition system. So it can very rarely be 100% sure whether a certain page is English or not. And it only displays a page as "English" when it's 100% sure that it is. Therefore, among all the English pages it indexes, it probably only recognizes a small number of them as "English". That's why when you search for "English pages only" it displays results among the small number of pages it recognizes as "English".

15 avril, 2005 15:39  
Anonymous Anonyme a écrit...

It is very well-known that all major search engines return estimates of the count. There are multiple papers in conferences such as WWW2005 that tell you how they use a sampling technique to do the estimate. The problem with estimations, of course, is that they're sometimes wildly off.

28 mai, 2005 10:23  
Anonymous Anonyme a écrit...

-yo

18 août, 2005 21:55  
Anonymous Anonyme a écrit...

Hmm.... Good article. Thanks for research.

22 septembre, 2005 12:11  
Anonymous Anonyme a écrit...

Interesting Interesting Interesting stuff!

17 février, 2006 17:03  
Anonymous Anonyme a écrit...

thanks for post

08 juillet, 2007 14:35  
Anonymous Anonyme a écrit...

Interessant. Et pourtant,

"english pages search" = (.com/.org/.mit/..) + badLanguageRecoSystem;

Vs.

"ouebe francais" = (.fr + .ca(quebec)) + EvenWorseLangRecoSys;

ouebe finlandais = .fi..?

etc?

Juste une pensee.
Merci pour les donnees!

20 octobre, 2007 18:22  

Enregistrer un commentaire