Collection development

Our collection development guidance is intended to fulfill the following objectives:

  • complement discipline-specific collection development policies;
  • help curators decide what and, more importantly, what not to collect; and
  • ensure that comparatively limited web archiving resources are deployed only for the most valuable content.

Focus on at-risk content

All web content is in some sense at-risk; this is, in fact, the raison d'être for web archiving. Particular categories of web content are more at-risk, however, because they are of time-limited interest or purpose, subject to government censorship, disseminated by immature organizations, or for other reasons. Spontaneous events, including disasters, revolutions, and trending social topics may briefly occupy the public spotlight, then fade from view. This unique and ephemeral content is especially deserving of our attention.

Complement existing collecting strengths

We have collecting strengths in particular areas, reflected by the research we support, our staffing for different subjects, our Special Collections, our relationships with donors and alumni, our geography and our institutional history. We provide added value when we consider web archiving as a potential component of a broader collecting plan and create web archives to complement other extant and prospective collections.

Observe resource constraints

A format-agnostic collection development policy will more than likely designate a broader range of web content as in scope for collecting than is practically feasible, given available web archiving resources. We should be mindful of collection dimensions that are most likely to increase costs. This includes not just the number of nominated websites but also their complexity (i.e., demanding additional staff time for crawl configuration and quality assurance) and contents (i.e., large files like video balloon storage requirements).

Consider what others are collecting

We are a member of an international community whose collective goal is collecting, preserving, and providing access to the historical web. Considering the cumulative and growing volume of information that has ever existed on the Web, even our aggregated efforts represent but a small fraction. We should therefore strive to identify existing web archives that overlap with areas where we intend to archive the Web ourselves and minimize duplication of effort. An enhancement to this approach is finding ways to provide seamless access to those external resources to our users, such as through topic guides, SearchWorks, or Memento.

Web archive holdings are not documented systematically, in terms of subject area, temporal coverage, language, top-level domain, or other identifiers, though research is underway that should simplify this. In the meantime, places to consult to discover existing web archives include: Archive-It's collections portal, California Digital Library's Web Archive Service collections portal, the International Internet Preservation Consortium's list of member archives, the Wikipedia List of Web archiving initiatives, the Internet Archive Wayback Machine, and the UK Web Archive Memento aggregator service. Curators may often learn about and/or contribute to planned web archives through their discipline-specific communities of practice. If overlap with another web archive is discovered, we should additionally consider the depth and frequency of their archiving to determine whether it is still worthwhile for us to archive it.

Consider the access conditions of what others are collecting

National libraries, in particular, create web archives under legal frameworks that only permit limited access (e.g., on-premise, for designated research, etc.). While generally we should avoid duplicatively archiving web content that has already been preserved by another organization, the prospect of their not making it accessible should count in favor of our archiving it, as well.

Assess value to researchers

A fundamental challenge for selecting content is that its potential utility increases over time, as the risk of change to or loss of the original content increases and the archive takes on historical context. Through their relationship with faculty and awareness of the web resources that have been vital to research within a given subject area, curators are best positioned to identify the content that matters for future research.

Enable specific research use cases

The hope is that the rest of these guidelines yield web archive collections that are ultimately useful to researchers, broadly construed. Given that goal, specific researcher use cases that happen to contravene some of the guidance are still worth considering. For example, we were approached by researchers in the Political Science department with a scholarly use case for archiving 2014 congressional campaign websites. Strictly following the guideline to not duplicate other organizations' collecting efforts, it wouldn't have made sense to support the project, since the Library of Congress builds comprehensive election web archives every cycle. However, we also determined that the Library of Congress data could not be made available to the researchers in the time frame they needed it, so we worked out another way to enable the archiving in partnership with Archive-It.

Consider the appropriate archiving service

Even having concluded that some web content is worth saving, web archiving isn't necessarily the most appropriate mechanism for capturing, storing, and re-presenting it. Web archiving is best suited to either where the "object" of interest is a website, consisting of an arbitrary number of files that must be stored and re-presented in their original relationship to each other, or where it is important to preserve the temporal context of the web content, such that it could theoretically be later addressed temporally using Memento. The technical limitations of web archiving may also point toward alternative approaches.

Everyday Electronic Materials (EEMs) was built partly with web-accessible documents in mind. It is a better solution for that use case since the documents are discrete and preserving their temporal context is unimportant. Web-based video can be difficult to capture using crawler-based tools and demands outsized amounts of storage, relative to other web objects. The Stanford Media Preservation Lab is improving their support for digital video formats.

Prefer archiving content over links to content

Historical web addresses are valuable primarily, though not exclusively, for facilitating access to historical web resources. The web address is the key to discovering the range of temporal snapshots for a particular resource within Wayback; it can facilitate discovery of resources stored in other web archives; and it incidentally tells us about the content beyond the edges of what was captured.

On the other hand, accessing the historical web resources themselves presupposes that they have been archived. If there is value in capturing an aggregation of links on a particular topic, it is likely on account of the value of the websites that those links point to, so they should be considered for collection themselves. Depending on their origin, externally-curated lists of links can be a useful selection tool. For example, the Library of Congress uses the effectively crowd-sourced Wikipedia list of U.S. think tanks to seed their Public Policy Topics collection.

Prefer current and esoteric content

Current content is less likely to be represented in existing web archives than content that has been on the Web for a while. It matters also how content is linked to; on average, search engine results and shortened links are less prevalent in web archives than are resources that many stable websites link to. We should prefer content that is contemporary and/or not likely to be extensively linked to.