"We had no idea that we were making history and were just trying to get the job done in our 'spare' time',” Louise Addis, one of the WWWizards team who developed the SLAC website from 1991, said during our conversation about the restoration of SLAC's earliest website. Last May, Nicholas Taylor, web archiving service manager, told me, "SLAC has a historical collection of webpages that may be the first website in the US. Can we help them to find a home for this archive?” As Web archivist, I felt that I found a treasure. I replied, "Of course, Stanford Web Archive Portal should be the home."
One of the major use cases for the Web Archiving Service is preserving Stanford University web content. The earliest SLAC website represent the oldest such content we could find; it is the first website in the US dated to 1991, so we started there. Stanford Web Archiving Service launched its portal this week which featured SLAC's earliest website that was kept on SLAC servers for many years. This Halloween, it comes back to life. Our task was to convert the original list of scattered files into an accessible, browsable website with temporal navigation. In this post, I will discuss the technical challenges of and lessons learned from restoration process.
First, let’s review what we received. The SLAC archive is a filesystem backup format that has been taken from SLAC backup system. Joan Winter and Jean Deken took these copies periodically from January's backup between 1992 and 1999. Each year's backup was supported with a context directory that contained Readme files of the backup process and a list of last modified dates for each page. The rest of the sub-directories varied from year-to-year. Initially the WWW system was deployed on mainframe computing based on VM which used mini disks and hosted only a single level of pages. Later the system was deployed on UNIX system reading from AFS file. The archive had a mix of web pages and system artifacts like UNIX backup notes and console history. Manual exploration was required to classify and separate the directories between the actual pages and the other artifacts.
The Stanford Web Archive Portal uses Open Wayback 2.0, an open source tool to query and replay the archived websites based on URI look-up. The Open Wayback architecture depends on two types of files: content files and index files. Content files may be ARCs or WARCs, which are compressed files that combine multiple webpages. The Stanford Web Archive Portal implementation depends on a CDX format index file, which contains fields for the URI, captured timestamp, and its location in the content file. It is no surprise that the SLAC archive didn't have any of them. So, the challenge was how to convert the SLAC content to WARC files, then how to index these pages with the right timestamp to make it browsable to the user.
The Restoration Strategy and Tools
My strategy was to determine three pieces of information for each page: page content, timestamp, and URI. The page content means the actual file that is available in the SLAC archive. For this page, the captured timestamp is the last-modified date of this file as defined by listing command file in the context directory in the UNIX backup system. The big challenge was how to determine the URI for this page, this challenge will be discussed later in detail. I converted the web pages to WARC format using wget with warc option. I wrote Ruby scripts to modify the wget automatically generated CDX. We replayed the archived website using customized version of Open Wayback 2.0.
Crawling the Archive
The SLAC Archive was a set of flat files. Open Wayback needs them combined in WARC file. I used the wget unix command that downloads the feed URIs in WARC output. wget provides CDX index for the generated WARC. The first step was the preparation for the pages' URIs. For this, I parsed the backup context files to extract the candidate page URIs. Then, I ran the wget against this list. I moved generated WARC to Open Wayback data store as content source. I modified the generated CDX to add the historical URI and timestamp.
In our first meeting, I asked the SLAC team, “What was the home page in 1991?” They answered, “There was nothing called ‘home page’ at that time.” So I had to figure out what URLs were for these pages in the past. We applied many techniques:
- Inlinks from the collection itself - I extracted outlinks from all pages and tried to match these links with available pages. I used the page name as key.
- Search Engines - I queried the popular search engines with the predicted URL to find an evidence for these URLs.
- Source code documentation - the early WWW engineers were eager to document their activities and decisions in www.history, such as installing new servers and migrating the pages from domain to another, I read these notes to get insights about technical decisions. Sometimes, I had to read the comments in the page source code. You can read a comprehensive list of comments in the source code of slac.html view-source.
- The Internet Archive Waybac Machine was another source for historical URL, specifically after 1995 when there were captures available.
- Publications - there are a few publications that talk about the development of the early web. Some of these publications included links to old URLs.
- Interviews with SLAC staff - we conducted a couple of interviews with the engineers who were involved with the SLAC earliest websites, including George Crane and Joan Winters.
After defining the URL for each page, I wrote ruby scripts that mapped SLAC archive URL to the actual URL in the past.
Though Open Wayback was designed to replay old websites, it didn’t work as expected with the really old websites we had. Here are some problems that we found:
- For historical reasons, the earliest acceptable date in Open Wayback was hardcoded to 1996. This is due to the fact that the oldest web content preserved by Internet Archive is from 1996. The search didn’t return any results because it considered 1991-1995 as invalid entries.
- The early website included captial letter (e.g., /FIND/), which caused a problem when it was indexed. I had to modify the indexer and the Open Wayback search module to use small letters.
- Stanford banner in the Wayback instance caused another problem. Open Wayback allows for customization of the banner displaying the capture date/time and navigation features. The combination of the styled banner with the latest CSS and historical pages with nostyle at all confused Internet Explorer, causing it to enforce rendering in “Quirks mode” that messed up the standard banner. I fixed it by adding a special header in the page response to force IE to use the latest standards.
- The latest problem was the famous file type obsolescence; the first logo used by SLAC was in XBM format, which is no longer supported by modern browsers.
QA is how you ensure you did the right thing. In the web site restoration, we didn't know what the Web was in the past. In this step, we reached out to the people who created these websites and were able to remember them. We shared the initial site with them and received their feedback.
The first rule for the time machine traveler is, “Don’t change the history, just observe.” I didn’t change the content of any page, even though some of the pages had syntax error or broken links that could be fixed easily. The idea behind this was the vision of the web archive: browse the web as it appeared in the past, even if it appears broken. The second limitation was that the early design of the website was to query the SPIRES and BINLIST databases. Web archiving access methodologies right now don't support retrieval of dynamic webpages.
Finally, I hope these thoughts about the restoration process will help other archivists in dealing with non-standard web archive materials that requir to be replayed by the Wayback Machine. We’re interested in hearing from other Stanford University units or departments about historical web content they may have saved and would like to donate to Stanford University Libraries to maintain in our Stanford Web Archive Portal.