Content inventorying

We are pleased to offer the Stanford University community a free service for stanford.edu web content inventorying. This service can help webmasters and web content owners understand what content lives on their site and where, plan and prepare for major site updates, and assess some aspects of search engine optimization - particularly those which overlap with archivability.

Our reports provide a URL list and count, broken down by host or by mime type; data volume; a list of out-link hosts; and resources discovered but excluded from crawling due to robots.txt directives. The archiving tools we use execute JavaScript, making them suitable for websites either using AJAX or where content is not otherwise traversable by a crawler alone. Snapshots for content inventorying also benefit from the ongoing configuration enhancements we make to avoid crawler traps, including limitless calendar pages and false positive URLs.

Our collaboration with Stanford University webmasters and web content owners meanwhile helps us to improve the quality of our stanford.edu archiving, by learning about archiving gaps and pending changes from the content experts.

Here is the process for taking advantage of the service:

  1. Request web content inventorying for one or more stanford.edu websites through our Contact form.
  2. We will run a one-time crawl of the site(s), outside of the usual schedule (typically at least quarterly).
  3. We will send you a link to the resulting reports with a login for read-only access.
  4. Let us know if you have any difficulty interpreting the reports, or if you discover that content was missing from the crawl. We can run additional crawls as needed.

Typical turnaround time for web content inventorying requests is about one week.