Stanford University Libraries develops new software to give researchers unprecedented access to email archives
The privacy and access challenges of archives containing electronic communications of enduring historical value are addressed in the Libraries’ latest release of ePADD.
Despite rapid growth of email use since its inception 40 years ago, and the increasing presence of email within research collections, the vast majority of email archives of modern historical figures remain inaccessible to researchers. Repositories that seek to make email content available for research face significant copyright and privacy issues and can be daunted by the sheer volume of email transferred.
“Email archives provide access to significant historical events at a level of detail that has rarely been available in the past,” said Roberto Trujillo, Frances & Charles Field Curator of Special Collections and Director, Department of Special Collections and University Archives at Stanford. “Making archival email available to scholars and students is a priority for Stanford Libraries since the majority of collections we acquire today include both paper-based and born-digital components.”
Not satisfied with the commercial options available, Stanford Libraries set out to develop a software solution that enables the appraisal, processing, discovery, and delivery of email. The result is ePADD, an open source program built with grant funding from the National Historical Publications and Records Commission.
A game changer
The software, released today by Stanford Libraries, takes advantage of many new developments in the field of machine learning, in order to help promote automated archival workflows and enable advanced research techniques. ePADD leverages the browsing, search, and visualization features of MUSE, a precursor program for mining and visualization of personal email archives.
“Implementing named entity recognition and other natural language processing techniques to enable advanced review, browsing, and search functions advances ePADD far beyond the realm of most existing software programs related to email,” said Sudheendra Hangal, Professor of Computer Science at Ashoka University in India and creator of MUSE.
Hangal, who is also a technical advisor to the ePADD project, believes the software offers the archival community a real solution to an age old problem, “These additions are a real game changer in terms of enabling the public, including journalists and those working within digital humanities, to make sense of and creatively utilize email in their research.”
Email vs. traditional correspondence
To illustrate the differences between working with email archives versus traditional paper archives, Glynn Edwards, Head of Technical Services for Special Collections and Manager of the Born-Digital Program for Stanford Libraries, references the archive of American poet Robert Creeley, which is held at Stanford Libraries. “The Creeley archive, which consists of 7,000 paper letters and over 155,000 email messages spanning about 13 years, represents one of the smaller email collections currently held at Stanford,” said Edwards.
Edwards, who also serves as the ePADD Project Director, explains that if the Creeley correspondence were entirely paper-based, archivists might first organize the messages chronologically or by name of correspondent and then include them in a finding aid, so that researchers could more easily identify and access relevant materials. “This standard process for archivists does not translate well for handling email collections, which—as in the case of Creeley—can involve hundreds of thousands of communications for archivists to review,” said Edwards.
If traditional archival processes were followed for email collections, appraisal and processing tasks would consume so many repository resources that the underlying collections might never be made available for researchers, suggests Edwards. In fact, several collections do remain inaccessible for use and discovery, which was a motivating factor for Stanford Libraries to release an early version of ePADD.
According to Edwards, natural language processing allows ePADD to automate the extraction of named entities, like people, organizations, and places. Archivists are also empowered to create lexicons to pull out specific terms and subjects across diverse correspondence. “These functionalities make discovery and delivery possible for large email collections, and even more importantly, researchers interested in working with email now have access to the same tools,” said Edwards.
The concept for ePADD emerged from Stanford Libraries’ extensive involvement in numerous projects aimed at devising solutions for archival issues surrounding born-digital materials.
“Our goal in developing ePADD was to expedite the processing of Stanford’s email archives and automate the process as much as possible,” said Trujillo. “Not only has that been accomplished in the first version of ePADD, but because it is open source, allowing for ready adoption by other institutions, we can also benefit from other institutions making their own collections accessible to the Stanford community.”
The ePADD software and installation/user guide are now available for download at the community website.
Stanford Libraries is invested in the long-term success of ePADD and is currently submitting proposals to support future development.
Press Contact: Gabrielle Karampelas, Stanford Libraries | 650-492-9855