Working with students on library collections as data

July 20, 2022
Catherine Nicole Coleman

Library collections provide exciting opportunities for students — particularly those interested in computational linguistics, computer vision, machine learning and data science — to apply methods they are learning in their classes to real world problems. Stanford's world-renowned AI institute, HAI, and interdisciplinary data science program, Stanford Data Science, attract students from around the world to learn about the latest computational techniques applied to a mind-bending array of projects. Much of that work begins with text, images, audio and video recordings — all things the library has in large supply.  When the library creates conditions to support this kind of student work with collections, we have a tremendous amount to gain. Five years ago, subject specialists in the Stanford libraries identified a number of projects that would benefit from this work. Since then that list has continued to grow. But if we do not take the time to lay the groundwork, the opportunity will pass us by.

First, these projects need to originate with and be led by the bibliographers, archivists, and research support staff who know the content and the domain we are serving. If, for example, the goal is to make a collection of Edwardian novels more discoverable, the subject specialist who selected the materials and knows the faculty who use them will play a crucial role in determining what would make the collection more discoverable. It may be that the task is to mine the digitized text of the novels to understand the mood or the writing style. Or it may be that running topic modeling across the entire corpus is the best way to make the collection accessible. The people essential to shaping the project are the people who know the content and best understand the desired output. Collaborating with students on these projects involves training students in existing library tasks, like shelving books or processing collections. But introducing computational methods also opens up the possibility of doing things that we would never have the time to do by hand.

The second requirement for success is providing students with a flexible, adaptable computational environment where they can work with our collections. We cannot expect students to work with our collections on their own laptops — in many cases, due to licensing, copyright, or deed restrictions, that would be prohibited anyway. But trying to do this working within our existing software development and deployment infrastructure is not a good solution either. If software development environments are like construction sites, where work is carefully planned, so that results are robust, reliable, and secure, computing environments for data science are more like kitchens. There is a lot of experimentation and adaptation of recipes to the ingredients. This is why cloud services that can grow and shrink to match the needs are so helpful. Jupyter notebooks where the process can be documented and the 'recipes' stored for others to return to have become essential. When the students return to classes after the summer or graduate, we want to have a record of what they have done — a record that documents their contribution to our work and helps us to build upon that work for the future.