Integrating Generative AI Tools in Library Collections: Challenges and Solutions


Silhouettes of 2 heads, face to face generated by DALL·E 2, a generative AI software


During the 2022-23 Annual Meeting of the Academic Council, a panel discussion featuring Stanford faculty delved into topics like ChatGPT, generative AI, and their implications on future teaching and learning. My reflections have largely centered around the application of generative AI in the domains of web and email archives, areas in which I possess more expertise. Herein, I detail some of the hurdles encountered while integrating generative AI tools, and propose potential strategies to address these.

Understanding the Challenges

1. Knowledge Cut-off

The first challenge relates to the "knowledge cut-off" of AI models like ChatGPT. The May 24, 2023 version of ChatGPT, built on the GPT 3.5 large language model, was trained using an extensive volume of data, but its training encapsulates knowledge only up to September 2021.  This means it cannot access real-time information or updates after this date, which could limit its utility in a rapidly evolving knowledge landscape.

2. Artificial Hallucination

The second issue is artificial hallucination, a problem inherent in ChatGPT and similar Generative AI products. This refers to scenarios where the AI's generated responses may appear confident but lack sufficient justification from the training data. In a library context, this could lead to the dissemination of information that is incorrect or not fully backed by reliable sources.

3. Data Privacy and Security

Lastly, the concerns surrounding data privacy and security linked to these AI models cannot be overlooked. In general, these AI systems handle data by sending information to a server based in the cloud. This method potentially leaves the data exposed to breaches, creating a substantial level of unease for libraries. This is particularly true when our collections encompass personal archives and those with restricted user licenses.

Addressing the Challenges with Effective Solutions

1. Extending Knowledge with Plugins:

To address the limitation of the knowledge cut-off, add-ons can be implemented that enable ChatGPT to extract information or data from a specified URL. This essentially gives the AI model the ability to expand its understanding beyond what it was initially trained on, keeping it current with live information. Furthermore, Google has introduced an experimental feature that integrates generative AI with Google Search. In these cases, the knowledge cut-off date will be the period between when the information is published online and when it is indexed into Google Search.

2. Controlled Responses:

Addressing the artificial hallucination issue can be achieved by setting controls to restrict the AI's responses to data assigned. This can help ensure that the information generated is reliable and backed by sound data.

3. Ensuring Data Privacy:

Lastly, to address data privacy and security concerns, tools can be used that allow users to install and operate question and answering systems on their local machines. This ensures that the data remains within the organization and is not transmitted to external servers over the internet. This local processing method significantly reduces the risk of data leaks, ensuring a safer information environment in the library.

I've authored a pair of blog posts:  "Navigating Through Archived Websites: From Text Matching to Generative AI-Enhanced Q&A" and "Evolving Email Archive Investigation: From Full Text Search to Generative AI-Aided Q&A". These pieces demonstrate how we can incorporate generative AI tools into web and email archives. My aim with these blogs is to provide readers with a brief overview of the challenges and potential solutions when applying generative AI in the context of library collections.

Note: This blog was created with the assistance of ChatGPT. 


Further reading