You are here

Stopwords in SearchWorks - to be or not to be

We've been examining whether or not to restore stopwords to the SearchWorks index. Stopwords are words ignored by a search engine when matching queries to results. Any list of terms can be a stopword list; most often the stopwords comprise the most commonly occurring words in a language, occasionally limited to certain functions (articles, prepositions vs. verbs, nouns).

The original usage of stopwords in search engines was to improve index performance (query matching time and disk usage) without degrading result relevancy (and possibly improving it!). It is common practice for search engines to employ stopwords; in fact Solr (http://lucene.apache.org/solr), the search engine behind SearchWorks, has English stopwords turned on as the default setting.

In our implementation of SearchWorks, there was no compelling reason to change most of the default Solr settings; thus, since SearchWorks's inception we have been using the following stopword list: a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, s, such, t, that, the, their, then, there, these, they, this, to, was, will, with.

What follows is an analysis of how stopwords are currently affecting SearchWorks, and what might happen if we restore stopwords to SearchWorks, making every word signficant for every search.

Executive Summary of Analysis

The SearchWorks metadata group (see https://consul.stanford.edu/display/NGDE/SearchWorks) believe that restoring stopwords to SearchWorks could improve results in up to 18% of the searches, and will degrade results only in the small number of searches with more than 6 terms.

How Many Terms are there in User Queries (not including facet link clicking)?

Over 50% of the query strings for SearchWorks are 1 or 2 terms.
Over 75% of the query strings are 1, 2 or 3 terms.
Over 90% of the query strings for SearchWorks have 6 or fewer terms.

These figures include any stopwords occurring in queries.

Source: (from Google Analytics for Oct 2011, analyzed by Casey Mullin: https://consul.stanford.edu/display/NGDE/Log+Analysis+Workspace - link at the bottom):

What Percentage of Query Strings have Stopwords?

For November 2011:
There were 142,869 searches
Stopwords appeared in searches 26,076 times

So, stopwords appeared in roughly 18% of searches.

(per analysis by Casey, sent in email to gryph-search on Dec 14, 2011; this information will be in the analytics on consul, once they are updated for November 2011).

Do the Stopwords Currently Used in Queries Imply the Users are Trying Boolean Searches?

The 10 stopwords appearing most often in queries are:

November 2011:
the -- 7578 occurrences
of -- 6582
and -- 4106
in -- 2298
a -- 1137
to -- 1033
for -- 695
on -- 685
an -- 289
with -- 231

or and not do not appear in many queries, while and is not the most frequent stopword, nor close to it in occurrences. I interpret this to mean stopwords in queries are NOT intended as boolean operators.

(per analysis by Casey, sent in email to gryph-search on Dec 14, 2011; this information will be in the analytics on consul, once they are updated for November 2011).

What About "Mininum Must Match"?

Solr allows us to cleverly fudge boolean AND or OR with a setting it calls "mm" or "Minimum Must Match". Our mm value says that if the query has 4 terms or fewer, all must match, but if a query has 5 or more terms, 90% (rounded down) must match. It was suggested a while back that we increase the mm threshold to 6 (from 4). Restoring stopwords to significance makes it more important to increase the mm threshold, as there are more significant words in our queries. Given that over 90% of queries have 6 or fewer terms, 6 seemed an appropriate threshold.

see https://consul.stanford.edu/display/NGDE/How+Search+Works+in+SearchWorks to learn more about mm

What is Improved by Restoring Stopwords to the Index?

1. searches comprised only of stopwords now retrieve results (improved recall)
examples:
- to be or not to be (with or without quotes)

2. precision is greatly improved for short searches including stopwords
examples:
- pearl vs. the pearl
- the one
- a zukofsky (author Zukofsky, title "a")
- there will be blood
Prod: 12678 (as a title search, 5013)
Test: 31 (as a title search, 5)
- OR spectrum (a periodical)
- Jazz: an introduction

3. subject links distinguish "in" from "and", etc.
archaeology in literature is no longer conflated with archaeology and literature

4. improved results for languages having lexical words overlapping English stopwords

What is Degraded by Restoring Stopwords to the Index?

1. long queries (over 6 terms) with a lot of stopwords have reduced precision
BUT: the words occurring as a phrase float to the top.
example: Lectures on the Calculus of Variations and Optimal Control Theory

What Else Have Testers Reported?

note that "test" is stopwords restored, while "prod" utilizes stopwords

Kathy Kerns (email of Dec 2): known item searches - in all cases, test kept the correct result at the top or tied or improved its relevance.

Casey Mullin (email of Dec 2): children in literature as a subject search - test is much better.

Phyllis Kayten (feedback of Dec 5):
searched for: dorothy and the wizard of oz (known item search) - title sought was actually dorothy and the wizard *IN* Oz; test did not retrieve it (due to increased precision), but prod did.
searched for: the man from nowhere - test is much better
searched for: death and taxes - first result on test is worse, but next results are good. First result on prod isn't perfect either.

Linda Yamamoto (feedback of Dec 9):
searched for: Lectures on the Calculus of Variations and Optimal Control Theory (known item search) - correct result is first in both prod and test. The other results: prod is better (but this is one of those long title searches).
C# and C++ searches don't work (unrelated -- this is a special character searching issue and has nothing to do with stopwords)

Vitus Tang (feedback of Dec 2):
"A potential problem of the stopword change is that title access points (aka uniform title) constructed according to AACR2 are without initial articles. So, for instance, the access point for the series "The NASA history series" is "NASA history series". A query that includes the initial article will not affect the search result in current production SW because "the" is eliminated as a stopword, but will affect the search result when stopwords are treated as significant words. On test, a phrase title search for "The NASA history series" retrieves 76 records. The same search on production retrieves 125 records. The test search still retrieves some of the records that belong to this series because the transcribed series statement, which is in the 490 field, includes the initial article, but not all of them do. The series access points in the 830 field are all without the initial article. [Symphony browse series retrieves 94 results.]"

Naomi's notes on Vitus's feedback: in gryphon-search, many of the records we examined had the "wrong" information in the field (it included the initial article, and it shouldn't have). Sooo … our data is dirty -- shocking, but true. It would be nice to know if series searches are common outside of Library Staff.

Additional Comments

SearchWorks employing stopwords gives imperfect search results. SearchWorks restoring stopwords, so that every term is signficant, gives different imperfect search results. Socrates gives yet different imperfect search results. The back end algorithms for determining what results match a query will always be fairly opaque to the end users - the algorithms are complicated. Moreover, users will have typos and other mistakes in their queries no matter what we do, and it seems unlikely we can consistently rescue them from themselves.

Solr gives us incredible control over our search engine's algorithm. There are many many knobs we can twiddle in our quest to improve the relevancy of search results. A few of the possibilities include:

a. tweak mm -- require a higher percentage of matching terms when there are more than 6 terms in the query

b. increase phrase boosting -- this would float results to the top that have the query terms occurring close together (and presumably in the same order). currently *seems* high enough, but have never performed any empirical tests)

c. reduce phrase slop (currently 3) -- implies words need to occur closer together in the results. Not clear exactly how phrase boosting and phrase slop interact.

d. adjust the relative boosting of fields (give even more weight to title field matches, etc.)

e. adjust the situations where the length of the INDEXED string affects the score of matches. (query "my cat" scores higher for title "my cat" than for "my cat and dog")

(See https://consul.stanford.edu/display/NGDE/Glossary for more information.)