Jean Véronis
Aix-en-Provence
(France)


Se connecter à moi sur LinkedIn Me suivre sur Twitter Facebook RSS

mardi, août 16, 2005

Yahoo: Missing pages? (2)



Since I published the first part of this study, the affair of Yahoo's missing pages has caused quite a stir. Google has announced that its researchers don't believe the figures announced by its competitor (see here), and a detailed study carried out by the NCSA (University of Illinois at Urbana-Champaign) seems to confirm quite clearly the phenomenon that I described in my previous post: for searches that return fewer than 1000 pages, Google systematically returns more results than Yahoo, which seems to contradict the idea that Yahoo's index is two and a half times the size of Google's [23 Aug -- The NCSA has issued a strong disclamer and the study has been revised; see original version and details].



Unfortunately, the study carried out by the researchers at the NCSA has several shortcomings. Firstly, as I showed in my previous post, Yahoo's indexing of long documents is nowhere near as deep as Google's. As a result, even if Yahoo is not lying about the size of its index in terms of the number of documents, this could partly explain the smaller number of documents returned for certain search requests. Sometimes, the document may well be in the database, but it cannot found by key words that do not appear at the beginning of the document. This is the case, for instance, for the pdf document "Depression and soul-loss" in pdf format, which is returned by Google when searching for inabilities hydrocephalic, but which is not returned by Yahoo for the same search, despite the fact that it is in Yahoo's database (see here).



However, the NCSA study contains an even more worrying error in its methodology, which completely invalidates its conclusions. The authors chose words at random from the compter dictionary ispell and typed them in pairs into the two search engines. This is an absurd strategy, for the chances of real documents containing two words chosen at random from a very large dictionary are virtually zero. The researchers in question are almost certain to find more artefacts (lists of words and spam) than anything else. If one of the two search engines produces fewer of these, we can but salute its filtering mechanism; in no way can we extrapolate these figures to make comments about its behaviour in general and about the size of its index.

We can see, for instance, that for the first search carried out by the NCSA researchers - carbolization clambers - the only results returned by Google (and which Yahoo does not find) are pages consisting of simple lists of words, most of which seem to be copies of the ispell dictionary itself.

The following document is a typical example:
It consists of a 1.3 MB file containing 134,175 words that seems to be a copy of ispell. It is not returned by Yahoo for the same search and indeed doesn't seem to figure in the Yahoo database. The Yahoo database, on the other hand, does contain five other (apparently identical) documents that Google does not contain (found via the search wspears dictionary site:www.cs.uwyo.edu):
It is interesting to note that these documents are the only ones among the 29 returned by my search that are not indexed in the Yahoo database, which only includes their URL. Either Yahoo recognises, for instance from a signature calculation, that this is the ispell dictionary, or else it has a filter that allows it to detect documents that are merely lists of words (which is not too difficult to imagine). This is a perfectly intelligent behaviour, and much to the search engine's credit.

Readers can consult the list of search terms provided by the authors, and can see for themselves that, in the vast majority of cases retained (i.e. those with fewer than 1000 results), the results in question are lists and spam. Results that prove to be an exception to this rule, such as cultist email, have been eliminated by the authors because they return more than 1000 results.

By carrying out their research in this way, the NCSA researchers have shown just one thing: that Google has a greater capacity to index lists of words, including the ispell dictionary, and spam. In no way does it prove that the Yahoo index is smaller (in terms of number of documents indexed) than that of Google.

Quite the contrary; if we look at the same sites as those where Yahoo "forgets" the copies of ispell, we can see how it generally indexes a far higher number of relevant documents than its competitor. For example, on the site www.cs.uwyo.edu mentioned above, Yahoo announces 1630 results for the search wspears site:www.cs.uwyo.edu, and I checked that the first 1000 really do exist. Google only returns 289 (or 249 if we exclude "similar results"). In fact, from about the 200th result onwards, the results returned are simply URLs where the content is not indexed, while the first 1000 in Yahoo are all indexed. Here, we have a factor of 5 to 1 in favour of Yahoo ...

The NCSA study contains another considerable bias, which the authors themselves are aware of, since they quite wisely present their working assumptions right at the beginning of their article:
The study operates under two working assumptions. The first is that both the Yahoo! and the Google search engine return all the results that match the particular keywords and does not do any filtering beyond removing duplicate results.
The thing is, everything seems to suggest that these conditions are not respected. I will demonstrate, in the third part of this article, how this problem invalidates the NCSA study and others of a similar nature.



Post-Scriptum

18 Aug -- Very interestingly the authors have just modified their text and have deleted the phrase "and does not do any filtering beyond removing duplicate results"... [thanks to Serge Courrier who alerted me about this modification]


Follow-up


Libellés :


5 Commentaires:

Anonymous Anonyme a écrit...

All you have to say is certainly interesting. It seems that Google and Yahoo have distinctly different views on what a user should see in his results. Google seems to search documents more by simple words contained in the document, where, it seems, that Yahoo tries to return pages that are on the subject of the user's query.

Very interesting indeed. Keep up the blog,

Aryeh Hillman
thenewcloo at gmail dot com

22 août, 2005 18:39  
Blogger Jean Véronis a écrit...

Frank (Surreal dreams)> I guess it's a matter of taste. Many people seem to emphasize that relevance is indeed what matters, though.

22 août, 2005 20:20  
Anonymous Anonyme a écrit...

The problem when searching is that you want relivant results. I will NEVER go through 1000 results on a search.

Keep in mind, just because an engine returns more results doesn't mean that they're relivant results. It all depends on how and what you're searching for.

For the most part, what I'm looking for is nearly identical between searching Yahoo and Google. I tend to favor Google when searching for technical information, and favor Yahoo on localized information (i.e. resturant menus, etc..). There's no rhyme or reason for this methodology, but all I can say is that I don't see a huge difference between the two.

Great article.

22 août, 2005 22:28  
Anonymous Anonyme a écrit...

Note that for the search "wspears dictionary site:www.cs.uwyo.edu", Yahoo does not at all need to have the 5 results indexd. It might have seen links to them, and maybe stored the URL. That is enough to fulfill the search you did.

However, saying that a file is in the index, simply because you know the URL, wouldn't be fair. So while it MIGHT be that yahoo detected the files as ispell-dictionarys or spam or anything, it might have well simply not DL'ed them.

22 août, 2005 22:40  
Anonymous Anonyme a écrit...

It is interesting to note that there is a new version of the study which addresses the dictionary problem:
http://vburton.ncsa.uiuc.edu/indexsize.html

This version shows Google "wins" 84% of the time, and returns 65% more results on average.

23 août, 2005 00:13