[Day 83] Summary of my PDF RAG from scratch
Hello :)
Today is Day 83!
A quick summary of today:- Played around with llama-index and parsing, and then organised the github repo to be a bit more presentable
Below I will go over the project.
Used pdf: Lloyds Banking Group's Relationship T&C from their website, includes high-level information about accounts, and information about the user and the Bank's rights.
Embedding model: mixedbread-ai/mxbai-embed-large-v1 from huggingface
Language model: google/gemma-2b-it from hugging face
Repo structure:
PDF parsing:
- For the main RAG chat (rag.py) -> using PyMuPDF and langchain's RecursiveCharacterSplitter
- For the dev RAG (dev_rag.ipynb) -> llama-index and langchain's RecursiveCharacterSplitter
RAG chat
I call the main 'RAG chat' - these are preprocess_pdf.py and rag.py, and with them, using the terminal, a user can read, preprocess a PDF,
Here is a demo I recorder from my terminal
I experimented a bit with chunk_size and chunk_overlap - the params of langchain's text splitter, and finally I settled on 1000 and 0, and those seemed to provide decent results when using llama-index as the PDF reader. Below are the sample queries and the answers provided by the model
However the results are not consistent, if a model starts to predict the words 'the context' the end would be that the context does not provide information about the query, but sometimes when I re-run the code, the answer actually gets generated. This could be due to several issues, but one idea that came to mind was adding beam search, because even if the model starts to generate 'the context ...' in one output, I believe in one of the others - the actual answer would appear. Actually, if we add return_context=True to the ask function, the top 1-2 context has the answer, but the model does not see it. Some other reasons - the way the context is provided (i.e. it is not clear to the model), bad query, bad base_prompt (how I tell the model to read for context), probably other reasons of which I am not aware at the moment (or I am not recalling them as reasons for problem).
Instructions for setup:
Instructions for setup:
I see in other repos and just in general, setup instructions are added. I added as well and it looks nice haha
Closing thoughts:
This particular one uses out-of-the-box embedding and LLM, but the world out there might require me to make those from scratch too. I am still learning about RAG, I learned a lot about retrieval models from the ACL talk (Day 74, 75, 76), and I am looking forward to applying and seeing that knowledge in practice.
The github repo link is at the top, any help is appreciated ^^
That is all for today!
See you tomorrow :)