You are here

Web archiving

logo of the International Internet Preservation Consortium

Web archivists Ahmed AlSum and Nicholas Taylor and LOCKSS Chief Scientist David Rosenthal recently attended the International Internet Preservation Consortium (IIPC) General Assembly, an annual meeting of national libraries, research universities, non-profits, and service providers engaged in web archiving. This was the first General Assembly we all attended since Stanford University Libraries (SUL) joined the IIPC, though we had all previously attended meetings under the auspices of other organizations.

Niels Brügger's closing remarks best captured the emergent theme of the meeting: how can we best serve researchers, broadly construed? The word clouds on the fourth and fifth slides of his presentation (PPT) helped to visualize how the focus of the international web archiving community has shifted over the past decade.

In keeping with the emphasis on understanding how web archives are being used, the open day (PDF) consisted of presentations by researchers working with historical web content. Some examples included an initiative to create distributed web science research centers (PPT), the user demographics of shuddering consumer web services (PDF), the proferring of web archive datasets on cloud infrastructure (PPT), and an architecture for archiving of cited web addresses in scholarly publications deposited into a repository.

The presentations and discussions from the member-only days (PDF) have not been systematically gathered, but some are available. There were discussions about collaborative or, at least, mutually-informed collection development; models of close collaboration between researchers and web archiving organizations; exchanging of best practices for full-text indexing; and updates on the OpenWayback collaborative development effort.

The last day-and-a-half were open workshops (PDF) on topics including crawl engineering, the web archiving tool landscape, the role and responsibilities of curators, and novel crawler architectures for capturing dynamic content or facilitating creation of precise corpora through interactive archiving. I co-organized the Curator Tools Fair (PDF) with Abbie Grotke and presented on strategic web archive collection development.

SUL will be assuming an increasing role in the IIPC in the coming year. I have stepped up as co-lead of the Access Working Group along with Daniel Gomes; we will continue to contribute to a technical proposal for profiling of web archives to enable scalable Memento aggregation, and we are exploring co-hosting the next General Assembly in the San Francisco Bay Area in collaboration with California Digital Library and Internet Archive.

'Material' (under CC BY-NC 2.0)

Congressional campaign websites are valuable primary source material for historians, social scientists, and the public to better understand the evolution of political communication in the Web era. Campaign websites also afford unique opportunities for the mass collection of materials that would have been previously difficult to acquire outside of the candidate's district. While it is a truism that the Web is constantly changing and broken links are an inevitable outcome, campaign websites are predictably ephemeral given their time-limited purpose.