Data formats and APIs
The ISO standard Web ARChive (WARC) and its predecessor ARChive (ARC) are the file formats typically associated with web archives. They are container formats, designed to store arbitrary internet content and associated network communications. W/ARC files may contain much more data than is relevant to specific research investigations, which has prompted experimentation with more specialized and lightweight derivative datasets such as those described below.
Web Archive Transformation (WAT) is extracted W/ARC record header metadata, HTML <meta> tag contents, and hyperlinks, formatted in JSON. For each resource stored in a given W/ARC file, it can provide information such as content length, date of capture, mime type, and URL. Internet Archive provides utilities for working with WAT files and documentation on usage, and Common Crawl provides some additional format details.
Longitudinal Graph Analysis (LGA) files feature a complete, time-indexed list of what URLs link to what other URLs for a given web archive corpus. Internet Archive provides documentation on the format and usage.
Web Archive Named Entities (WANE) are a list of names, organizations, and places extracted using the Stanford Named Entity Recognizer. Entities are provided for each URL, for each timestamp that URL is represented in the archive. Internet Archive provides documentation on the format and usage.
WET (parsed text)
WARC Encapsulated Text (WET) or parsed text consists of the extracted plaintext, delimited by archived document. Each record retains the associated URL and timestamp. Common Crawl provides details on the format and Internet Archive provides documentation on usage, though they use different names for the format.
Crawl inDeX (CDX) files allow Wayback-like replay platforms to relate requests for specific mementos to assets stored in W/ARC files. As a de facto standard, there may be idiosyncrasies between CDX formats used by different systems (see format documentation from Internet Archive and from Ilya Kreymer, for example), but indices typically contain URL, timestamp, mime type, HTTP status code, digest, and content length. The Internet Archive Wayback Machine allows, and Open Wayback and pywb can be configured to allow querying of CDX indices through a web service.
Internet Archive provides a Wayback Machine API to report in JSON the boolean archived status and availability of a given URL and the access point, timestamp, and HTTP status code for its most recent archived version.
The Memento Time Travel service federates queries to distributed, Memento-compliant web archives. The Time Travel APIs provide similar, if expanded functionality to the Wayback Availability JSON API and across multiple web archives (including Internet Archive Wayback Machine).