Posts

Showing posts from March 24, 2024

[Day 83] Summary of my PDF RAG from scratch

Image
 Hello :) Today is Day 83! A quick summary of today: Played around with llama-index and parsing, and then organised the github repo to be a bit more presentable Below I will go over the project. Used pdf : Lloyds Banking Group's Relationship T&C from their website, includes high-level information about accounts, and information about the user and the Bank's rights. Embedding model : mixedbread-ai/mxbai-embed-large-v1 from huggingface Language model : google/gemma-2b-it from hugging face Repo structure : PDF parsing :  For the main RAG chat (rag.py) ->  using PyMuPDF and langchain's RecursiveCharacterSplitter For the dev RAG (dev_rag.ipynb) -> llama-index and langchain's RecursiveCharacterSplitter RAG chat I call the main 'RAG chat' - these are preprocess_pdf.py and rag.py, and with them, using the terminal, a user can read, preprocess a PDF,  Here is a demo I recorder from my terminal Dev PDF chat: I experimented a bit with chunk_size and chunk_overl