You are here

Web archiving

'The Dish (HDR)' (under CC BY-NC-SA 2.0)

A couple of weeks ago, Stanford University Libraries hosted Dame Wendy Hall, Jim Hendler, and other web scientists affiliated with the Web Science Trust for a briefing on the Web Observatory initiative and a follow-on workshop organized by Lisa Green from Common Crawl. The notion of a Web Observatory implies a center proferring scientific instruments, but for the analysis of web data rather than natural phenomena. Indeed, the group's vision is that Web Observatories provide access to web datasets, projects, and tools. Eventually, a network of Web Observatories might offer both an interoperable architecture and distributed infrastructures for sharing and analysis of web datasets. The initiative touches on several areas of interest and investment by Stanford University Libraries, including data curation, web archiving, and supporting social science research.

Social science research increasingly depends on computational methods and digital primary materials. As a case in point, the listserv of the Association of Internet Researchers (AoIR), an organization for social science research on networked communications, features regular discussions on web data collection and analysis. A perusal of those conversations underscores the dearth of reusable web datasets and the one-off nature of new datasets that are created. In the context of research data more broadly, it is for this and other reasons that research libraries increasingly offer data curation services. Persistent access to well-described data is only one part of the puzzle, though; as Victoria Stodden noted in the 2013 Forum on the Future of Scientific Publishing, the review, reproduction, and/or reinterpretation of computational analyses also demands the continued availability of the employed applications (PDF). The Web Observatory architecture natively recognizes this requirement.

The web archiving community meanwhile collectively hosts petabytes of historical web data and grapples with the specification of the fundamental set of services (PDF) to support common research use cases. Common Crawl itself provides access to hundreds of terabytes of web data through Amazon Web Services Public Data Sets platform. Working with (a manageable subset of) this corpus was the focus of the follow-on workshop. The research that the Common Crawl data is more broadly enabling (including by Stanford-affiliated researchers) is a useful demonstration of the interest in web datasets, the kinds of services that researchers may be interested in, and the potential of the Web Observatories initiative.

As we continue to develop our web archiving services, in particular, we will look for opportunities to align with and contribute to the Web Observatories framework.

'Step 7' (under CC BY-NC-ND 2.0)

A major challenge for web archivists is the low visibility that downstream archiving has on upstream web content creation. And, yet, deliberate and inadvertent architectural decisions made by web content creators strongly impact the ease or difficulty with which their websites can be captured and faithfully re-presented. A non-trivial byproduct of webmasters helping to ensure their content is archived for their own later use is that the Web itself becomes more archivable, to everyone's benefit.

logo of the International Internet Preservation Consortium

Web archivists Ahmed AlSum and Nicholas Taylor and LOCKSS Chief Scientist David Rosenthal recently attended the International Internet Preservation Consortium (IIPC) General Assembly, an annual meeting of national libraries, research universities, non-profits, and service providers engaged in web archiving. This was the first General Assembly we all attended since Stanford University Libraries (SUL) joined the IIPC, though we had all previously attended meetings under the auspices of other organizations.

'Material' (under CC BY-NC 2.0)

Congressional campaign websites are valuable primary source material for historians, social scientists, and the public to better understand the evolution of political communication in the Web era. Campaign websites also afford unique opportunities for the mass collection of materials that would have been previously difficult to acquire outside of the candidate's district. While it is a truism that the Web is constantly changing and broken links are an inevitable outcome, campaign websites are predictably ephemeral given their time-limited purpose.