Using ChatGPT on Local Data

Wilf (Neil Wilkinson)
5 min readJun 23, 2024

--

What if you want GPT to give you answers on local data?

Well the folks over at langchain have got all of the components of a solution.

For importing the data, we use UnstructuredFileLoader and JSONLoader to load text and json files. We run a RecursiveCharacterTextSplitter on the raw docs to create documents which we then create a FAISS (Facebook AI Similarity Search) index. Using that index, we then create a ConversationalRetrievalChain.

(Note: whilst the above paragraph sounds simple enough — indeed it is — this did take a few hours of research to find the right things to use.)

  #gets a list of documents from raw documents
def getDocuments(self, rawDocs: list) -> list:
text_splitter = RecursiveCharacterTextSplitter()
return text_splitter.split_documents(rawDocs)
#loads data from the data dir
def loadData(self):
documents = []
#get text files at the location and load using UnstructuredFileLoader
for file in glob.glob(self.dataLocation+"/*.txt"):
print("Loading file %s", file)
documents.extend(self.getDocuments(UnstructuredFileLoader(file).load()))
#get json files at the location and load using JSONLoader
for file in glob.glob(self.dataLocation+"/*.json"):
print("Loading file %s", file)
documents.extend(self.getDocuments(JSONLoader(file_path=file, jq_schema='.', text_content=False).load()))

#create the index - here be data
self.index = FAISS.from_documents(documents, self.embeddings)

#create the question & answer chain - a specific llm, index and memory
self.qa_chain = ConversationalRetrievalChain.from_llm(
self.llm,
retriever=self.index.as_retriever(),
memory=self.memory
)

Now querying that data is very straightforward:

    #take a question, invoke the q&a chain and return the answer
def query(self, question: str) -> str:
result = self.qa_chain.invoke({"question": question})
return result['answer']

Finally some code to make this happen:

if __name__ == '__main__':

#init all the things
aiData = AiData("gpt-3.5-turbo", os.environ['OPENAI_API_KEY'], "../data")

#load the things
aiData.loadData()

#use stdin to ask away to your dreams content
while True:
# Prompt the user to introduce a question
question = input("Ask a question or type 'exit': ")

if question.lower() == "exit":
break

#if this doesn't make sense, go to a doctor now
answer = aiData.query(question)
print("Answer: ", answer)

The data I gave it:

Tilly and Leon are dogs.
Tilly's date of birth is 20th May 2014
Leon's date of birth is 9th July 2018
[
{
"color": "red",
"hexValue": "#f00"
},
{
"color": "green",
"hexValue": "#0f0"
},
{
"color": "blue",
"hexValue": "#00f"
},
{
"color": "cyan",
"hexValue": "#0ff"
},
{
"color": "magenta",
"hexValue": "#f0f"
},
{
"color": "yellow",
"hexValue": "#ff0"
},
{
"color": "black",
"hexValue": "#000"
}
]

The text from this page:

Some example questions

Question: what is the age difference between Tilly and Leon?

Tilly was born in May 2014, and Leon was born in July 2018. To find the age difference, we need to calculate the difference in years between their birthdates. Tilly is older, so we subtract Leon’s birth year from Tilly’s birth year: 2018–2014 = 4 years. Therefore, the age difference between Tilly and Leon is 4 years.

Question: what is the hex value of green?

The hex value of green is #0f0.

Question: what are the main problems when upgrading to python 3 in bullet form

Python 3 broke backward compatibility with Python 2, making it difficult for Python 2 code to run unmodified on Python 3.

The changes in Python 3, such as the print function becoming a built-in function, removal of the input function, and changes in integer division functionality, required significant modifications to existing code.

The tool “2to3” was introduced to help with the translation of Python 2 code to Python 3, but there were aspects of translation that the tool could not handle.

Projects requiring compatibility with both Python 2 and 3 were recommended to have one source for Python 2.x and use compatibility modules for Python 3.x, which could be challenging to manage.

Python 3 introduced new features and removed backward compatibility features, which required developers to adapt their code to the new syntax and functionality.

Whereas it doesn’t know about anything else except for local data:

Question: what is the capital of France?
I don’t know.

Privacy Concerns?

As far as I know, the local data never gets uploaded to OpenAI. The language processing of the question clearly does use their API, but I believe the local data stays private.

Comment below if you think otherwise.

Want to run this yourself?

All you’ll need is docker setup, the below files or https://github.com/nfitbh/ai-local-data and an OPENAI_API_KEY in your environment. Also populate a directory called data and place .txt and .json files in there.

- data
[txt and json files here]
- src
main.py
Dockerfile
docker-run.sh
requirements.txt

Complete code here:

src/main.py:

import os
import glob

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain_community.document_loaders import UnstructuredFileLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import JSONLoader


class AiData:

def __init__(self, llmName: str, openAiApiKey: str, dataLocation: str):
#set up some config
#use OpenAI (you'll need an OPENAI_API_KEY)
self.embeddings = OpenAIEmbeddings()

#store the init data
self.dataLocation = dataLocation
self.openAiApiKey = openAiApiKey
self.llm_name = llmName

#init the openai things
self.llm = ChatOpenAI(
model_name=self.llm_name,
temperature=0,
openai_api_key=self.openAiApiKey
)

#init the converstaion memory
self.memory = ConversationBufferMemory(
memory_key="chat_history",
return_messages=True
)

#we dont have an index yet (but this is an important thingy)
self.index = None

#gets a list of documents from raw documents
def getDocuments(self, rawDocs: list) -> list:
text_splitter = RecursiveCharacterTextSplitter()
return text_splitter.split_documents(rawDocs)

#loads data from the data dir
def loadData(self):
documents = []
#get text files at the location and load using UnstructuredFileLoader
for file in glob.glob(self.dataLocation+"/*.txt"):
print("Loading file %s", file)
documents.extend(self.getDocuments(UnstructuredFileLoader(file).load()))
#get json files at the location and load using JSONLoader
for file in glob.glob(self.dataLocation+"/*.json"):
print("Loading file %s", file)
documents.extend(self.getDocuments(JSONLoader(file_path=file, jq_schema='.', text_content=False).load()))

#create the index - here be data
self.index = FAISS.from_documents(documents, self.embeddings)

#create the question & answer chain - a specific llm, index and memory
self.qa_chain = ConversationalRetrievalChain.from_llm(
self.llm,
retriever=self.index.as_retriever(),
memory=self.memory
)

#take a question, invoke the q&a chain and return the answer
def query(self, question: str) -> str:
result = self.qa_chain.invoke({"question": question})
return result['answer']

if __name__ == '__main__':

#init all the things
aiData = AiData("gpt-3.5-turbo", os.environ['OPENAI_API_KEY'], "../data")

#load the things
aiData.loadData()

#use stdin to ask away to your dreams content
while True:
# Prompt the user to introduce a question
question = input("Ask a question or type 'exit': ")

if question.lower() == "exit":
break

#if this doesn't make sense, go to a doctor now
answer = aiData.query(question)
print("Answer: ", answer)

Dockerfile:

ARG PYTHON_VERSION=3.11.4
FROM python:${PYTHON_VERSION}-slim as base

ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

WORKDIR /app

RUN mkdir -p /data

#RUN apt-get update -y
#RUN apt install python3 -y
#RUN pip install --upgrade pip
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,source=requirements.txt,target=requirements.txt \
pip3 install -r requirements.txt

COPY src/*.py /app

COPY data/* /data

CMD python3 /app/main.py

requirements.txt

langchain==0.2.5
langchain-chroma==0.1.1
langchain-cli==0.0.25
langchain-community==0.2.5
langchain-core==0.2.9
langchain-openai==0.1.9
langchain-text-splitters==0.2.1
faiss-cpu==1.8.0
unstructured==0.14.7
unstructured-client==0.23.7
jq==1.7.0

docker-run.sh

docker build . -t ai-data-docker
docker run -e OPENAI_API_KEY=${OPENAI_API_KEY} --rm -it --name ai-data ai-data-docker

--

--

Wilf (Neil Wilkinson)

A self-professed nerd that works as a CTO by day and a gamer/programmer by night