Capture

Improving the ability of archival crawlers to capture your website will also tend to improve discovery of your content by search engine crawlers, enhance website performance by decreasing load caused by robots and other clients, and save storage space.

Make links transparent

An archival crawler finds your content by following links, typically starting from the home page. However, it can't archive what it hasn't discovered. Links that are dependent on JavaScript execution or embedded in binary files (e.g., Flash, PDF, Word documents, Excel spreadsheets, etc.) tend to be opaque to an archival crawler, so ensure that they are additionally discoverable in a way that doesn't depend on those technologies. You can affirmatively improve the discoverability of your content by enumerating the resources you'd like crawlers to know about in an XML sitemap, a user sitemap, and/or RSS feeds.

Represent web app states with links

The breadth and complexity of a modern web application is often belied by the paucity of unique web addresses it presents in the course of its use. Building the application in such a way that distinct states are represented by distinct and fixed web addresses permits users with a shared link to bypass arbitrary interactions to get to the desired destination, facilitates more precise citation, and provides a more granular target for annotation. Critically, it also makes the website more accessible to both search engine and archival crawlers.

Use one link for each resource

Every web resource is available through at least one web address. For archiving, it is additionally preferable that every web resource be available through no more than one web address. Archival crawlers often de-duplicate captured content based on a combination of web address and checksum. When either of those values varies from what was recorded in a previous crawl, the resource is considered new. Some content management systems allow for the same resource to be served using different web addresses, which will result in superfluous requests from crawlers and increased archival storage requirements.

Be careful with robots directives

You may already use the robots exclusion standard to convey machine-readable preferences to search engine crawlers. Most web archiving initiatives obey these instructions (PDF), as well, overriding only when they would substantially interfere with archiving. Directives that have historically been appropriate for search engine crawlers - e.g., excluding directories containing scripts and style and layout instructions - are becoming less so. These exclusions have long been problematic in the archiving context, as they may prevent the capture of assets that are essential to faithfully re-presenting the archived website.

Aside from not directing the crawler away from vital resources, the robots exclusion standard can be affirmatively employed to improve archiving efforts. Use a site-level robots.txt file to link to an XML sitemap or specify a sustainable crawler request interval. Liberally ward crawlers away from website sections that may programmatically generate an arbitrary number of links using a site-level robots.txt file, a page-level <meta> tag, or rel="nofollow" link attributes.

Mind content license terms

In the absence of specific provisions in U.S. copyright law to cover web content archiving by third-party cultural heritage institutions (PDF), U.S. web archiving organizations resort to seeking explicit permission (PDF) and/or assertions of fair use (PDF). Boilerplate website terms of use or inadvertently strident copyright statements may deter archiving efforts by increasing the perception of legal risk. If you are amenable to archiving of your web content, consider making it available under an open content license or at least ensure that terms of use and copyright statements are not antagonistic to archiving.

Return reliable response codes

Web archiving tools exhibit varying levels of success at reconstructing dynamically-generated web addresses from within JavaScript. Sometimes, this process results in the generation of nonexistent web addresses. Configuring your web servers to return reliable HTTP status codes and avoiding soft 404 responses, in particular, will facilitate detection and minimization of superfluous requests from archival and other crawlers.

Implement caching enhancements

Web clients including archival crawlers take advantage of various HTTP response headers to minimize requests for content that hasn't changed since it was last cached: Content-Length, Last-Modified, and ETag. Research suggests that server responses related to caching can be surprisingly unreliable (PDF), yet the correct implementation of these HTTP headers will reduce superfluous requests from all types of clients.

Minimize reliance on external assets necessary for presentation

It is increasingly easy to take advantage of externally-hosted assets, such as fonts or JavaScript libraries. This can not only improve website performance and the availability and caching of those resources but also archiving, in a limited number of cases. For Internet Archive's ongoing crawls of the public Web, centrally-hosted files used by many different websites only need to be captured once. The rest of the web archiving community doesn't collect so broadly, however, and the major downside risk is that external hosts may be less disposed to archivability and instruct crawlers not to collect assets that may be necessary to faithfully re-present the archived website.

Serve reusable assets from a common location

The key rationale for hosting some resources externally - performance - should also motivate serving reusable local assets from a single location. Content management systems sometimes instantiate each new sub-site with its own complement of the standard theme assets. Storing these in a common location referenced by each of the sub-sites allows for more efficient client caching, simultaneously improving website performance and archivability.