[Xapian-discuss] Stopword addition and stemming
Avi Rappoport
avi-list at searchtools.com
Wed Nov 17 20:07:51 GMT 2010
>Hmm, interesting. I'm wondering how good an idea this would be for a
>general-usage search engine (specifically to prevent the
>phrase-search-time penalty for "co.us")? Shooting from the hip I
>think it's a great trade-off. I just *know* folks are going to search
>for [bob_co.us] and then wonder why the page is not responding
>promptly.
>
>Can you think of a downside to doing this?
Yes: the more stopwords, the more confusing search results.
In general search engines, it's usually a bad thing to overdo the
stopwords. With short queries, you never know what's going to be
vital. For example, searching for The Who on wordpress.com is
pretty awful, and searching for The The fails completely.
I highly recommend using the data from your search logs to guide you,
rather than shooting from the hip. How often do the tokens us and
co appear in the logs? Are they always together, or in phrases
with other terms? If you have facets, are these terms useful for
faceting?
Hathitrust has a great set of articles about how they're using ngrams
for the most frequent words so they don't have to index every single
one.
--
Complete Guide to Search Engines for Web Sites and Intranets
<http://www.searchtools.com>
More information about the Xapian-discuss
mailing list