[Xapian-discuss] UTF8 support plans (without stemming)

Thu Apr 28 16:52:37 BST 2005

Craig Macdonald wrote:

>> Well, these two querstions relate to each other: Xapian is strong in
>> 'probabilistic IR' and that approach kind of needs some sort of 
>> stemming.
>
> I dont totally agree with that. We've had some success in applying 
> only the first two steps of the English (Porter) stemmer
> to large English web corpuses. Many submissions to last year's TREC 
> Terabyte track didnt use stemming at all.
>    http://www.google.co.uk/search?q=2004+trec+terabyte+stemming
> It would also appear to be a similar approach to what Google is doing. 
> The first two steps only drops plurals and tense suffixes.
>
When you are looking for enough hits in a near infinite document set the 
drop in recall can be hidden, because the user never knows what they 
miss out on - as long as there are enough results - because they never 
were going to look at all good results anyway.

In a smaller document set or where the user knows what results they are 
expecting (sometimes the same thing) this can become very annoying.

Sam