[Xapian-discuss] Stopword addition and stemming

Avi Rappoport avi-list at searchtools.com
Wed Nov 17 20:07:51 GMT 2010


>Hmm, interesting.  I'm wondering how good an idea this would be for a
>general-usage search engine (specifically to prevent the
>phrase-search-time penalty for "co.us")?  Shooting from the hip I
>think it's a great trade-off.  I just *know* folks are going to search
>for [bob_co.us] and then wonder why the page is not responding
>promptly.
>
>Can you think of a downside to doing this?

Yes: the more stopwords, the more confusing search results.

In general search engines, it's usually a bad thing to overdo the 
stopwords.  With short queries, you never know what's going to be 
vital.  For example, searching for  The Who  on wordpress.com is 
pretty awful, and searching for   The The   fails completely.

I highly recommend using the data from your search logs to guide you, 
rather than shooting from the hip.  How often do the tokens  us  and 
co   appear in the logs?  Are they always together, or in phrases 
with other terms?  If you have facets, are these terms useful for 
faceting?

Hathitrust has a great set of articles about how they're using ngrams 
for the most frequent words so they don't have to index every single 
one.
-- 
Complete Guide to Search Engines for Web Sites and Intranets
    <http://www.searchtools.com>



More information about the Xapian-discuss mailing list