You are here

Digital library

RSS

Archives

We've been examining whether or not to restore stopwords to the SearchWorks index. Stopwords are words ignored by a search engine when matching queries to results. Any list of terms can be a stopword list; most often the stopwords comprise the most commonly occurring words in a language, occasionally limited to certain functions (articles, prepositions vs. verbs, nouns).

The original usage of stopwords in search engines was to improve index performance (query matching time and disk usage) without degrading result relevancy (and possibly improving it!). It is common practice for search engines to employ stopwords; in fact Solr (http://lucene.apache.org/solr), the search engine behind SearchWorks, has English stopwords turned on as the default setting.

In our implementation of SearchWorks, there was no compelling reason to change most of the default Solr settings; thus, since SearchWorks's inception we have been using the following stopword list: a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, s, such, t, that, the, their, then, there, these, they, this, to, was, will, with.

What follows is an analysis of how stopwords are currently affecting SearchWorks, and what might happen if we restore stopwords to SearchWorks, making every word signficant for every search.

The Digital Production Group is very excited about an upcoming project featuring the personal papers of "Laura Bassi, a noted 18th-century Italian scientist and Europe's first female professor, " with Project Manager Cathy Aster at the helm.

More information to come, but in the meantime take a look at this recent article in the Stanford University News.

The (meta)data underneath SearchWorks is largely based on our MARC records from Symphony. MARC records are exported from Symphony, then slurped up by an application called SolrMarc, which transforms the MARC data into an index for the Solr search engine used by SearchWorks.

SolrMarc is open source software made available by Bob Haschart of the University of Virginia Libraries. SolrMarc is used by all(?) VuFind sites as well as most Blacklight sites built on MARC data (e.g. SearchWorks). SolrMarc has been great for us -- it gave us an enormous jump start for SearchWorks. Bob is also a great guy, and made me a "committer" almost immediately -- so I can make contributions to the open source code.

But.

Open Source Software does best when there is a critical mass of developers: group wisdom rocks, as does sharing the work. To date, SolrMarc is very much Bob's project, despite a number of committers such as myself. There are some ... interesting ... practices as to how SolrMarc is organized and how it is tested. I've even contributed a bit to some of its squirreliness. Occasionally, changes to the SolrMarc codebase break the code I've written especially for Stanford.

In February approximately 7,000 objects representing six collections were accessioned to the Stanford Digital Repository (SDR), bringing the total number of objects in SDR to nearly 250,000.

  1. Buckminster Fuller collection: 5,200 slides
  2. Kitai topographical maps: 1,600 maps
  3. McLaughlin Maps, California as an Island: 114 maps
  4. R. Stuart Hummel collection: 52 items
  5. Eliasaf Robinson collection addendum: 1 gazette
  6. Islamic prayer book, 1228 H: 1 manuscript

More details, including links to sample images are listed below.

Inclusion in the Stanford Digital Repository ensures that these materials are available to researchers and scholars (while upholding appropriate access restrictions), now and in the future through a secure, sustainable stewardship environment.

While many of these objects are already discoverable via SearchWorks others will get SearchWorks records in the coming months. However, all materials are currently available via the item’s PURL (a persistent URL which ensure that these materials are available from a single URL over the long-term, regardless of changes in file location or application technology).

In March, approximately 2,100 objects representing three collections were accessioned to the Stanford Digital Repository (SDR).

  • R. Stuart Hummel collection: ~ 2,100 items
  • The Life of Saint Catherine, Codex M0381: 1 manuscript
  • Special collection requests: 1 thesis

More details, including links to sample images are listed below.

While many of these objects are already discoverable via SearchWorks others will get SearchWorks records in the coming months. However, all materials are currently available via the item’s PURL (a persistent URL which ensure that these materials are available from a single URL over the long-term, regardless of changes in file location or application technology).

In May, approximately 1,400 images representing eighteen mostly 15th and 16h century books were accessioned to the Stanford Digital Repository (SDR). These items are part of Special Collections' goal to digitize and make more accessible materials considered "Beautiful Books". John Mustain is the collection contact for the materials listed below.

All of these books were previously discoverable via SearchWorks but required a visit to Special Collections to view these non-circulating materials. Access to digitized images of these books is now available via the item’s PURL (a persistent URL which ensure that these materials are available from a single URL over the long-term, regardless of changes in file location or application technology).

DLSS has released the source code for two of its library infrastructure projects:

Argo, Stanford's administrative "hydra head" for Fedora, provides a viewing, reporting and administrative interface for objects in a Fedora repository. It is also coupled with Stanford's lightweight and engine-free workflow system ("WorkDo") to provide a workflow visualization and control mechanism. WorkDo is a Hydra- and Fedora-compatible system that chains small scripts "robots" and microservices into complex processes to complete both human- and machine-based task flows.

dor-services is a Ruby gem that exposes Stanford’s Fedora-based Digital Object Registry (DOR) services and content models to both Hydra and non-Hydra processes. In addition to functional access to DOR’s Registration, Workflow, Identifier, Search, Metadata, Digital Stacks, and Preservation Ingest services, the dor-services library also defines a number of discrete modules that can be mixed into Hydra object models to extend their functionality. Each module is named according to a salient characteristic that it imparts to a digital object, and defines both object methods (what the object can do) as well as expectations (what metadata the object needs to provide) in order to properly represent that characteristic.

In June, approximately 68,000 images representing nearly 300 items across several collections were accessioned to the Stanford Digital Repository (SDR). The items include:

  • Archives Parlementaires (81 books, 64,800 pages)
  • Classic Papyrii (44 fragments, 88 images)
  • Stanford Oral History Project (140 interviews, 2110 files)
  • Special Collections Materials (18 photo collections, 900 images)

While many of these objects are already discoverable via SearchWorks others will get SearchWorks records in the coming months. However, all materials are currently available via the item’s PURL (a persistent URL which ensure that these materials are available from a single URL over the long-term, regardless of changes in file location or application technology).

Pages