Using ChatGPT on Local Data

5 min readJun 23, 2024

What if you want GPT to give you answers on local data?

Well the folks over at langchain have got all of the components of a solution.

For importing the data, we use UnstructuredFileLoader and JSONLoader to load text and json files. We run a RecursiveCharacterTextSplitter on the raw docs to create documents which we then create a FAISS (Facebook AI Similarity Search) index. Using that index, we then create a ConversationalRetrievalChain.

(Note: whilst the above paragraph sounds simple enough — indeed it is — this did take a few hours of research to find the right things to use.)

  #gets a list of documents from raw documents
  def getDocuments(self, rawDocs: list) -> list:
        text_splitter = RecursiveCharacterTextSplitter()
        return  text_splitter.split_documents(rawDocs)
  #loads data from the data dir  
  def loadData(self):
        documents = []
        #get text files at the location and load using UnstructuredFileLoader
        for file in glob.glob(self.dataLocation+"/*.txt"):
            print("Loading file %s", file)
            documents.extend(self.getDocuments(UnstructuredFileLoader(file).load()))
        #get json files at the location and load using JSONLoader
        for file in glob.glob(self.dataLocation+"/*.json"):
            print("Loading file %s", file)
            documents.extend(self.getDocuments(JSONLoader(file_path=file, jq_schema='.', text_content=False).load()))

        #create the index - here be data
        self.index = FAISS.from_documents(documents, self.embeddings)

        #create the question & answer chain - a specific llm, index and memory
        self.qa_chain = ConversationalRetrievalChain.from_llm(
            self.llm,
            retriever=self.index.as_retriever(),
            memory=self.memory
        )

Now querying that data is very straightforward:

    #take a question, invoke the q&a chain and return the answer
    def query(self, question: str) -> str:
        result = self.qa_chain.invoke({"question": question})
        return result['answer']

Finally some code to make this happen:

if __name__ == '__main__':

    #init all the things
    aiData = AiData("gpt-3.5-turbo", os.environ['OPENAI_API_KEY'], "../data")

    #load the things
    aiData.loadData()

    #use stdin to ask away to your dreams content
    while True:
        # Prompt the user to introduce a question
        question = input("Ask a question or type 'exit': ")
        
        if question.lower() == "exit":
            break

        #if this doesn't make sense, go to a doctor now
        answer = aiData.query(question)
        print("Answer: ", answer)

The data I gave it:

Tilly and Leon are dogs.
Tilly's date of birth is 20th May 2014
Leon's date of birth is 9th July 2018

[
 {
  "color": "red",
  "hexValue": "#f00"
 },
 {
  "color": "green",
  "hexValue": "#0f0"
 },
 {
  "color": "blue",
  "hexValue": "#00f"
 },
 {
  "color": "cyan",
  "hexValue": "#0ff"
 },
 {
  "color": "magenta",
  "hexValue": "#f0f"
 },
 {
  "color": "yellow",
  "hexValue": "#ff0"
 },
 {
  "color": "black",
  "hexValue": "#000"
 }
]

The text from this page:

History of Python - Wikipedia

This article Please expand the article to include this information. Further details may exist on the is missing…

en.wikipedia.org

Some example questions

Question: what is the age difference between Tilly and Leon?

Tilly was born in May 2014, and Leon was born in July 2018. To find the age difference, we need to calculate the difference in years between their birthdates. Tilly is older, so we subtract Leon’s birth year from Tilly’s birth year: 2018–2014 = 4 years. Therefore, the age difference between Tilly and Leon is 4 years.

Question: what is the hex value of green?

The hex value of green is #0f0.

Question: what are the main problems when upgrading to python 3 in bullet form

Python 3 broke backward compatibility with Python 2, making it difficult for Python 2 code to run unmodified on Python 3.
The changes in Python 3, such as the print function becoming a built-in function, removal of the input function, and changes in integer division functionality, required significant modifications to existing code.
The tool “2to3” was introduced to help with the translation of Python 2 code to Python 3, but there were aspects of translation that the tool could not handle.
Projects requiring compatibility with both Python 2 and 3 were recommended to have one source for Python 2.x and use compatibility modules for Python 3.x, which could be challenging to manage.
Python 3 introduced new features and removed backward compatibility features, which required developers to adapt their code to the new syntax and functionality.

Whereas it doesn’t know about anything else except for local data:

Question: what is the capital of France?
I don’t know.

Privacy Concerns?

As far as I know, the local data never gets uploaded to OpenAI. The language processing of the question clearly does use their API, but I believe the local data stays private.

Comment below if you think otherwise.

Want to run this yourself?

All you’ll need is docker setup, the below files or https://github.com/nfitbh/ai-local-data and an OPENAI_API_KEY in your environment. Also populate a directory called data and place .txt and .json files in there.

- data
    [txt and json files here]
- src
    main.py
Dockerfile
docker-run.sh
requirements.txt

Complete code here:

src/main.py:

import os
import glob

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain_community.document_loaders import UnstructuredFileLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import JSONLoader


class AiData:

    def __init__(self, llmName: str, openAiApiKey: str, dataLocation: str):
        #set up some config
        #use OpenAI (you'll need an OPENAI_API_KEY)
        self.embeddings = OpenAIEmbeddings()

        #store the init data
        self.dataLocation = dataLocation
        self.openAiApiKey = openAiApiKey
        self.llm_name = llmName
        
        #init the openai things
        self.llm = ChatOpenAI(
            model_name=self.llm_name,
            temperature=0,
            openai_api_key=self.openAiApiKey
        )

        #init the converstaion memory
        self.memory = ConversationBufferMemory(
            memory_key="chat_history",
            return_messages=True
        )

        #we dont have an index yet (but this is an important thingy)
        self.index = None

    #gets a list of documents from raw documents
    def getDocuments(self, rawDocs: list) -> list:
        text_splitter = RecursiveCharacterTextSplitter()
        return  text_splitter.split_documents(rawDocs)

    #loads data from the data dir
    def loadData(self):
        documents = []
        #get text files at the location and load using UnstructuredFileLoader
        for file in glob.glob(self.dataLocation+"/*.txt"):
            print("Loading file %s", file)
            documents.extend(self.getDocuments(UnstructuredFileLoader(file).load()))
        #get json files at the location and load using JSONLoader
        for file in glob.glob(self.dataLocation+"/*.json"):
            print("Loading file %s", file)
            documents.extend(self.getDocuments(JSONLoader(file_path=file, jq_schema='.', text_content=False).load()))

        #create the index - here be data
        self.index = FAISS.from_documents(documents, self.embeddings)

        #create the question & answer chain - a specific llm, index and memory
        self.qa_chain = ConversationalRetrievalChain.from_llm(
            self.llm,
            retriever=self.index.as_retriever(),
            memory=self.memory
        )

    #take a question, invoke the q&a chain and return the answer
    def query(self, question: str) -> str:
        result = self.qa_chain.invoke({"question": question})
        return result['answer']

if __name__ == '__main__':

    #init all the things
    aiData = AiData("gpt-3.5-turbo", os.environ['OPENAI_API_KEY'], "../data")

    #load the things
    aiData.loadData()

    #use stdin to ask away to your dreams content
    while True:
        # Prompt the user to introduce a question
        question = input("Ask a question or type 'exit': ")
        
        if question.lower() == "exit":
            break

        #if this doesn't make sense, go to a doctor now
        answer = aiData.query(question)
        print("Answer: ", answer)

Dockerfile:

ARG PYTHON_VERSION=3.11.4
FROM python:${PYTHON_VERSION}-slim as base

ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

WORKDIR /app

RUN mkdir -p /data

#RUN apt-get update -y
#RUN apt install python3 -y
#RUN pip install --upgrade pip
RUN --mount=type=cache,target=/root/.cache/pip \
    --mount=type=bind,source=requirements.txt,target=requirements.txt \
    pip3 install -r requirements.txt

COPY src/*.py /app

COPY data/* /data

CMD python3 /app/main.py

requirements.txt

langchain==0.2.5
langchain-chroma==0.1.1
langchain-cli==0.0.25
langchain-community==0.2.5
langchain-core==0.2.9
langchain-openai==0.1.9
langchain-text-splitters==0.2.1
faiss-cpu==1.8.0
unstructured==0.14.7
unstructured-client==0.23.7
jq==1.7.0

docker-run.sh

docker build . -t ai-data-docker
docker run -e OPENAI_API_KEY=${OPENAI_API_KEY} --rm -it --name ai-data ai-data-docker

Using ChatGPT on Local Data

History of Python - Wikipedia

This article Please expand the article to include this information. Further details may exist on the is missing…

Some example questions

Privacy Concerns?

Want to run this yourself?

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Wilf (Neil Wilkinson)

No responses yet