Since its inception in the early 1970s, email has become a durable form of communication – one that presents a massive problem for donors, repositories, and researchers. Over 140 billion email messages are sent every day, and many, if not all have research value as part of an archival collection. Email is used for more than just communication. It is used for collaboration, planning, sharing, conducting transactions, and as an aid to memory – a self-archive. It documents relationships – personal, business, and communal. Our reliance on and daily use of email over the past 40 years has developed rich archival material with a secondary benefit of recording social networks in the header information of senders and recipients.
The Department of Special Collections at SUL proposes to address important facets of stewarding email archives that have not been tackled in previous projects. Characteristics of email such as its relatively stable format standardization as well as the inherent structure itself – header, body, attachments – make email an ideal candidate for automated tools to support archival workflows, such as appraisal and processing, as well as benefitting the user through discovery and delivery.
History of Email Projects at SUL
In 2010 when Stanford University Libraries was involved in the AIMS Project, one born-digital collection, that of poet Robert Creeley, contained over 50,000 email messages. That project allowed us the luxury of both time and dedicated staff to explore various methods for appraising and processing born-digital materials.
After several attempts with different software, our AIMS Project Digital Archivist, Peter Chan, discovered a program under development at Stanford called MUSE. MUSE is a Java-based software program that uses Natural Language Processing tools to run automated processes on an email archive – like extracting entities (names, places). The designer, Sudheendra Hangal, was a Ph.D. candidate in Stanford’s Computer Science Department. Sudheendra was looking for individuals to test the program and take part in a follow-up interview. Several staff members in Special Collections were interviewed during this process. In an ad hoc way, we began collaborating over the course of the following by funneling requests for enhancements for MUSE through Peter.
Current Project – Email: Process Appraise Discover Deliver (ePADD)
This spring we decided to go a step further and proposed to design a repository-based email software program from the ground up based in part on MUSE. Our end goal is to produce an open-source tool that will allow repositories and individuals to interact with email archives before and after they have been transferred to a repository. It would consist of four modules, each based on a different functional activity: Processing (arrangement and description), Appraisal (collection development), Discovery (online via the web), and Delivery (access). Much of this is outside the scope of our current internal funds and will require extramural funds which we are pursuing.
But preliminary work has begun on building a working prototype for the Discovery Module. In order to accomplish this, we requested and received two awards from the Payson J. Treat Fund for Library Program Development and Research and began a programming effort with Ixora Technology (founded by another SU graduate, Chaiyasit Manovit). Peter Chan is the technical lead for a team of programmers with Sudheendra Hangal acting as our technical consultant. The archival team is led by Glynn Edwards and consists of Aimee Morgan (University Archives) and staff at collaborating institutions (Columbia University, Oxford University, the Smithsonian Institution Archives, and the New York Public Library).
The Discovery Module is designed to deliver metadata extracted through automated processing of unrestricted emails in two of our email collections: Robert Creeley (poet) and Richard Fikes (Stanford Computer Science faculty).
Specific goals in the Treat proposals:
- Deliver correspondents (from header information) and names/places/events(entities) access so that researchers could more easily determine efficacy of using the email archives in their research
- Deliver summaries of extents of emails (incoming/outgoing) and attachments (formats, quantities)
- Resolve entities as much as possible during the processing of the collections
- And, do all this in as automated a process as possible in part through use of Natural Language Processing tools
We did not want to deliver:
- Full content of the emails online to protect third-party privacy (only metadata of individual emails is displayed)
- Complete email addresses of correspondents
This fall, the SUL archival team began focusing on the ePADD project: commenting on iterations, drafting functional requirements, and creating UI specifications for all 4 modules (Processing, Appraising, Discovering, and Delivering). In February, the SU team will begin the work of finalizing and documenting the requirements and user interfaces with our external collaborators. The software specifications will be designed by archivists from each of the five collaborating institutions and incorporate feedback on the requirements and user-interface design.
Just before winter break, Sudheendra Hangal’s dissertation on MUSE was published and is available online.