[Day 83] Summary of my PDF RAG from scratch

 Hello :)
Today is Day 83!


A quick summary of today:
  • Played around with llama-index and parsing, and then organised the github repo to be a bit more presentable


Below I will go over the project.


Used pdf: Lloyds Banking Group's Relationship T&C from their website, includes high-level information about accounts, and information about the user and the Bank's rights.

Embedding model: mixedbread-ai/mxbai-embed-large-v1 from huggingface

Language model: google/gemma-2b-it from hugging face

Repo structure:


PDF parsing

  1. For the main RAG chat (rag.py) ->  using PyMuPDF and langchain's RecursiveCharacterSplitter
  2. For the dev RAG (dev_rag.ipynb) -> llama-index and langchain's RecursiveCharacterSplitter

RAG chat

I call the main 'RAG chat' - these are preprocess_pdf.py and rag.py, and with them, using the terminal, a user can read, preprocess a PDF, 

Here is a demo I recorder from my terminal



Dev PDF chat:
I experimented a bit with chunk_size and chunk_overlap - the params of langchain's text splitter, and finally I settled on 1000 and 0, and those seemed to provide decent results when using llama-index as the PDF reader. Below are the sample queries and the answers provided by the model
However the results are not consistent, if a model starts to predict the words 'the context' the end would be that the context does not provide information about the query, but sometimes when I re-run the code, the answer actually gets generated. This could be due to several issues, but one idea that came to mind was adding beam search, because even if the model starts to generate 'the context ...' in one output, I believe in one of the others - the actual answer would appear. Actually, if we add return_context=True to the ask function, the top 1-2 context has the answer, but the model does not see it. Some other reasons - the way the context is provided (i.e. it is not clear to the model), bad query, bad base_prompt (how I tell the model to read for context), probably other reasons of which I am not aware at the moment (or I am not recalling them as reasons for problem).

Instructions for setup:
I see in other repos and just in general, setup instructions are added. I added as well and it looks nice haha

Closing thoughts:
This particular one uses out-of-the-box embedding and LLM, but the world out there might require me to make those from scratch too. I am still learning about RAG, I learned a lot about retrieval models from the ACL talk (Day 74, 75, 76), and I am looking forward to applying and seeing that knowledge in practice. 

The github repo link is at the top, any help is appreciated ^^

That is all for today!

See you tomorrow :)

Popular posts from this blog

[Day 198] Transactions Data Streaming Pipeline Porject [v1 completed]

[미리 공부] 기초 통계 복습 (Day 1는 1월2일)

[Day 61] Stanford CS224N (NLP with DL): Machine translation, seq2seq + a side CDCGAN mini project