How to Prevent Data Leakage in LangChain Vectorstores

I wish the LangChain documentation was clearer!

Introduction

I love LangChain. I really do! But I hate LangChain's documentation. I really do. But I aldo understand why it is that way - things have been moving super fast in the LLMs space and LangChain has done an amazing job to keep up with adding support to all the new llm features/applications coming out super quickly but I feel it does so at the expense of good documentation/examples. So while it has some amazing functionality, sometimes its not very clear what is happening behind the scenes and the user is often making assumptions on it. Naturally those assumptions could sometimes be wrong leading to unexpected phenomenon like data leaks etc.

I recently ran into a data leakage issue while trying to run a RAG application using LangChain's Chroma vectorstore. I was able to figure out the issue but I was a bit surprised and felt that if the documentation was better I would not have run into this issue at all. So I wanted share it here and hopefully it will help you avoid running into the same issue.

The Setup

Let us say we are given 2 research papers ["Generative Adversarial Nets", "Attention Is All You Need"] and you are tasked with writing a RAG application such that for each paper you answer a list of questions : ["What are the contribution of this paper?", "Give the names of all the authors?"]

This sound pretty straight forward right? So let's say we go ahead and write the following code (I used this LangChain example as my reference code):

from langchain_community.document_loaders import PyPDFLoader
from langchain_chroma import Chroma
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

SYSTEM_PROMPT = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

def rag_qa(pdf_paths, questions):
    llm = ChatOpenAI(model="gpt-4o")
    embedding_model = OpenAIEmbeddings()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

    outputs= {}
    for paper_path in pdf_paths:
        outputs[paper_path] = {}
        loader = PyPDFLoader(paper_path)
        docs = loader.load()
        chunks = text_splitter.split_documents(docs)
        vectorstore = Chroma.from_documents(documents=chunks, embedding=embedding_model)
        retriever = vectorstore.as_retriever()
        prompt = ChatPromptTemplate.from_messages(
            [
                ("system", SYSTEM_PROMPT),
                ("human", "{input}"),
            ]
        )
        question_answer_chain = create_stuff_documents_chain(llm, prompt)
        rag_chain = create_retrieval_chain(retriever, question_answer_chain)
        for  q in questions:    
            results = rag_chain.invoke({"input": q})
            outputs[paper_path][q] = results['answer']

    return outputs

So we have a function raq_qa that takes the list of questions to be answered and a list of local pdf file locations as inputs runs all the typical RAG setup process:

For each paper the function follows these steps sequentiall:

loads the pdf text -> creates chunks from pdf text -> gets the embeddings for each chunk -> stores embeddings in chroma vectorstore -> constructs a RAG based QA chain -> returns the LLM answers for each question

The Issue

Everything looks fine, right? Now let's run the code and display outputs to see what we get as the outputs:

outputs = rag_qa(pdf_paths=["attention_is_all_you_need.pdf", 
                            "generative_adversarial_nets.pdf"], 
                 questions=["What are the contribution of this paper?", 
                            "Give the names of all the authors?"])
for p in ["attention_is_all_you_need.pdf", \
            "generative_adversarial_nets.pdf"]:
    print("-------------------------------------")
    print(f"Displaying results for {p}")
    print("QUESTION: What are the contribution of this paper?")
    print("ANSWER: " + outputs[p]["What are the contribution of this paper?"])
-------------------------------------
Displaying results for attention_is_all_you_need.pdf
QUESTION: What are the contribution of this paper?
ANSWER: The paper "Attention Is All You Need" introduces the 
Transformer architecture, which is based solely on attention 
mechanisms and eliminates the need for recurrence and convolutions. 
Key contributions include the proposal of scaled dot-product 
attention, multi-head attention, and a parameter-free position 
representation. The paper also demonstrates that the 
Transformer generalizes well to other tasks, such as 
English constituency parsing, with both large and limited training data.
-------------------------------------
Displaying results for generative_adversarial_nets.pdf
QUESTION: What are the contribution of this paper?
ANSWER: The paper "Attention Is All You Need" introduces 
the Transformer model, a new network architecture based 
solely on attention mechanisms, eliminating the need for 
recurrence and convolutions. It demonstrates that the 
Transformer generalizes well to other tasks, such as 
English constituency parsing, with both large and limited training data. Key contributions include the introduction of scaled dot-product attention, multi-head attention, and a parameter-free position representation.

Hmm something seems off, the answers for the "Attention Is All You Need" paper seem correct but the answers for "Generative Adversarial Nets" (which introduced GANs setup and not the Transformer architecture)paper seem wrong, somehow the info from the "Attention Is All You Need" is leaking into the QA for "Generative Adversarial Nets". Such data leakage issues are not good to have because in addition to just the answers being wrong, such leakages can also pose privacy issues for your applications.

Can you guess what is causing this leakage? After a bit of debugging I found out that the following line is the culprit:

vectorstore = Chroma.from_documents(documents=chunks, embedding=embedding_model)

Once the first For loop iteration for "Attention Is All You Need"
is executed and the second iteration for "Generative Adversarial Nets"
has begun to and reaches the following line again:
Chroma.from_documents(documents=chunks, embedding=embedding_model)

I assumed that this would return a new vectorstore without retaining any embeddings data from the previous iteration of the For loop (i.e from "Attention Is All You Need"). However that is not the case, the vectorstore returned contains the new as well as the previous iteration's embeddings. We can confirm this by adding a couple of simple prints to see how many embedding vectors are added in each iteration of the For loop and how many entries are present in the vectorstore after each iteration of the For loop. After the above changes the code looks like the following:

from langchain_community.document_loaders import PyPDFLoader
from langchain_chroma import Chroma
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

SYSTEM_PROMPT = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

def rag_qa(pdf_paths, questions):
    llm = ChatOpenAI(model="gpt-4o")
    embedding_model = OpenAIEmbeddings()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

    outputs= {}
    for paper_path in pdf_paths:
        outputs[paper_path] = {}
        loader = PyPDFLoader(paper_path)
        docs = loader.load()
        chunks = text_splitter.split_documents(docs)
        vectorstore = Chroma.from_documents(documents=chunks, embedding=embedding_model)
        num_vectors_in_vectorstore = len(vectorstore.get()["ids"])
        print("------------------------------")
        print(f"Currently running iteration for {paper_path}")
        print(f"Number of chunk embeddings in this iteration : {len(chunks)}")
        print(f"Number of entries in vectorstore at this iteration : {num_vectors_in_vectorstore}")
        retriever = vectorstore.as_retriever()
        prompt = ChatPromptTemplate.from_messages(
            [
                ("system", SYSTEM_PROMPT),
                ("human", "{input}"),
            ]
        )
        question_answer_chain = create_stuff_documents_chain(llm, prompt)
        rag_chain = create_retrieval_chain(retriever, question_answer_chain)
        for  q in questions:    
            results = rag_chain.invoke({"input": q})
            outputs[paper_path][q] = results['answer']

    return outputs

outputs = rag_qa(pdf_paths=["attention_is_all_you_need.pdf", "generative_adversarial_nets.pdf"], questions=["What are the contribution of this paper?", "Give the names of all the authors?"])

And the output is:

------------------------------
Currently running iteration for attention_is_all_you_need.pdf
Number of chunk embeddings in this iteration : 52
Number of entries in vectorstore at this iteration : 52
------------------------------
Currently running iteration for generative_adversarial_nets.pdf
Number of chunk embeddings in this iteration : 39
Number of entries in vectorstore at this iteration : 91

As we can see during the second iteration the vectorstore retained all the 52 chunk embeddings from the previous iteration. Which is what is causing this leakage.

The Fix

A simple fix to avoid this would be to "clear" the vectorstore after each iteration! This can be done by adding vectorstore.delete_collection() at the end of each iteration to remove all the vector embeddings added to the vectorstore. So the code now looks like this:

from langchain_community.document_loaders import PyPDFLoader
from langchain_chroma import Chroma
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

SYSTEM_PROMPT = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

def rag_qa(pdf_paths, questions):
    llm = ChatOpenAI(model="gpt-4o")
    embedding_model = OpenAIEmbeddings()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

    outputs= {}
    for paper_path in pdf_paths:
        outputs[paper_path] = {}
        loader = PyPDFLoader(paper_path)
        docs = loader.load()
        chunks = text_splitter.split_documents(docs)
        vectorstore = Chroma.from_documents(documents=chunks, embedding=embedding_model)
        num_vectors_in_vectorstore = len(vectorstore.get()["ids"])
        # print("------------------------------")
        # print(f"Currently running iteration for {paper_path}")
        # print(f"Number of chunk embeddings in this iteration : {len(chunks)}")
        # print(f"Number of entries in vectorstore at this iteration : {num_vectors_in_vectorstore}")
        retriever = vectorstore.as_retriever()
        prompt = ChatPromptTemplate.from_messages(
            [
                ("system", SYSTEM_PROMPT),
                ("human", "{input}"),
            ]
        )
        question_answer_chain = create_stuff_documents_chain(llm, prompt)
        rag_chain = create_retrieval_chain(retriever, question_answer_chain)
        for  q in questions:    
            results = rag_chain.invoke({"input": q})
            outputs[paper_path][q] = results['answer']
        vectorstore.delete_collection()

    return outputs

outputs = rag_qa(pdf_paths=["attention_is_all_you_need.pdf", "generative_adversarial_nets.pdf"], questions=["What are the contribution of this paper?", "Give the names of all the authors?"])
for p in ["attention_is_all_you_need.pdf", \
            "generative_adversarial_nets.pdf"]:
    print("-------------------------------------")
    print(f"Displaying results for {p}")
    print("QUESTION: What are the contribution of this paper?")
    print("ANSWER: " + outputs[p]["What are the contribution of this paper?"])

And the output for this is:


-------------------------------------
Displaying results for attention_is_all_you_need.pdf
QUESTION: What are the contribution of this paper?
ANSWER: The paper "Attention Is All You Need" presents the Transformer 
model, which relies solely on attention mechanisms, eliminating the need 
for recurrence and convolutions. The authors demonstrated that 
the Transformer generalizes well to various tasks, including English 
constituency parsing, with both large and limited training data. 
Additionally, the paper introduces key innovations such as 
scaled dot-product attention, multi-head attention, and a 
parameter-free position representation.
-------------------------------------
Displaying results for generative_adversarial_nets.pdf
QUESTION: What are the contribution of this paper?
ANSWER: The paper introduces a new framework for estimating generative 
models through an adversarial process, which involves training a 
generative model \( G \) to capture the data distribution and a 
discriminative model \( D \) to distinguish between real and generated 
data. This adversarial training corresponds to a minimax two-player game. 
Additionally, the paper demonstrates the viability of this framework, 
suggesting potential research directions in semi-supervised learning, 
efficiency improvements, and learned approximate inference.

As you can see the issue is fixed! Hope this helps you avoid inadvertent data leakage issues while building RAG applications! See you next time! Do let me know if you have any further questions!