50+ days of Machine Learning

Posts

Showing posts from March 22, 2024

[Day 81] RAG from scratch - chunking is very important!

3/22/2024 11:57:00 pm

Hello :) Today is Day 81! A quick summary of today: Continued with my custom RAG from scratch project - extracting knowledge from a bank's terms and conditions ( github repo ) So, yesterday my main struggle was reading tables. I remembered gemini can read pictures so I gave the below to gemini to try to read and give me the text. The output, not very good. Gemini could not read this table very well, and such tables were common in yesterday's PDF Yes, this is an image, but even if a powerful gemini LLM could not read this table and output it as text, for now at least, I gave up on this particular PDF, and looked for one with a bit more straigh-forward text and less tables. The newly chosen pdf is here: on github Now... given an allegedly more simple pdf, I used the code as it was to get outputs. But the results were just... really bad. Most times even though the scores were high, and the top-1 included the exact answer, the output was ~'the context does not ...

[Day 80] Starting to write my own RAG from scratch on a bank's T&C pdf

3/22/2024 01:06:00 am

Hello :) Today is Day 80! A quick summary of today: based on yesterday's RAG from scratch, I decided to create a new one, based on a pdf chosen by me I uploaded the code so far on this github repo , and below is an overview of my progress so far. Firstly, what document? I wanted to use a document that is shorter (tutorial used 1200 page doc), but a bit more complex - including tables, more numbers. So I chose a bank's terms and conditions 32 page document ( source ). The file itself is in the github repo. Sample content Firstly, I decided to follow the tutorial's code, but just shortening it and make it more comfortable to use. To embed the text I tried using mixedbread-ai/mxbai-embed-large-v1 (new SoTA) and all-mpnet-base-v2 (from tutorial). Each model has to be used following specific instructions, for example, the all-mpnet one directly outputs normalised scores, but the mixedbread-ai one outputs raw scores, and cosine similarity function needs to be...