[Day 203] Starting LLM zoomcamp module 4 - Monitoring

 Hello :)
Today is Day 203!


A quick summary of today:
  • meeting with my lab mate for the KB AI competition project
  • starting LLM zoomcamp module 4


Today we met to talk about neo4j and doing EDA for the Kukmin Bank AI competition. We talked about the benefits of docker and docker-compose, and how to use neo4j. The next step we will do is EDA. This might sound a bit odd, given that yesterday I created a GNN. Then I just use raw data without many features. The reason being I wanted to make sure I can get even a simple GNN to work.

 My lab mate suggested to split the columns of the dataset in half, and to see what we can find in terms of needing to change in the dataset. In total there are 22 columns, so we split it down the middle. 

These are the columns assigned to myself: trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,state,zip,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud 


On another note, today I covered about 90% of LLM zoomcamp's 4th module. Below is a summary

We started by loading a document where each item contains a question, from which course it is, from which document it is, original answer, and answer from the llm. (the answer_llm is the newly added).

To get an LLM answer for each doc (total docs were ~1800) we used gpt3.5-turbo, gpt4o and gpt4o-mini to compare. Out of them, the fastest was gpt3.5-turbo running just for a few minutes, while the longest was gpt4o which took ~2 hours. gpt4o cost 9.79 dollars. gpt3.5-turbo was $0.79
Here we can see a comparison of the distribution of cosine similarity between original answer and llm answer:

the recently released gpt4o-mini which is cheaper than even gpt3.5, has the following results:
it cost 0.28$ for a smarter model compared to gpt3.5. and it is similar to gpt4o


4o-mini seems similar to 4o and is way cheaper.

The next topic was about using an LLM as a judge for answers

We developed two prompts:

Two prompts:

1. When we have the original answer 

"""

You are an expert evaluator for a Retrieval-Augmented Generation (RAG) system.

Your task is to analyze the relevance of the generated answer compared to the original answer provided.

Based on the relevance and similarity of the generated answer to the original answer, you will classify

it as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".


Here is the data for evaluation:


Original Answer: {answer_orig}

Generated Question: {question}

Generated Answer: {answer_llm}


Please analyze the content and context of the generated answer in relation to the original

answer and provide your evaluation in parsable JSON without using code blocks:


{{

  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",

  "Explanation": "[Provide a brief explanation for your evaluation]"

}}

"""

and


2. when we want to do online (in prod-like) evaluation

"""

You are an expert evaluator for a Retrieval-Augmented Generation (RAG) system.

Your task is to analyze the relevance of the generated answer to the given question.

Based on the relevance of the generated answer, you will classify it

as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".


Here is the data for evaluation:


Question: {question}

Generated Answer: {answer_llm}


Please analyze the content and context of the generated answer in relation to the question

and provide your evaluation in parsable JSON without using code blocks:


{{

  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",

  "Explanation": "[Provide a brief explanation for your evaluation]"

}}

"""


Here is one example case with included actual question/answer:

"""

You are an expert evaluator for a Retrieval-Augmented Generation (RAG) system.

Your task is to analyze the relevance of the generated answer compared to the original answer provided.

Based on the relevance and similarity of the generated answer to the original answer, you will classify

it as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".


Here is the data for evaluation:


Original Answer: Scikit-learn offers another way: precision_recall_fscore_support

Example:

from sklearn.metrics import precision_recall_fscore_support

precision, recall, fscore, support = precision_recall_fscore_support(y_val, y_val_pred, zero_division=0)

(Gopakumar Gopinathan)

Generated Question: What is the syntax for using precision_recall_fscore_support in Python?

Generated Answer: The syntax for using `precision_recall_fscore_support` in Python is as follows:


```python

from sklearn.metrics import precision_recall_fscore_support

precision, recall, fscore, support = precision_recall_fscore_support(y_val, y_val_pred, zero_division=0)

```


Please analyze the content and context of the generated answer in relation to the original

answer and provide your evaluation in parsable JSON without using code blocks:


{

  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",

  "Explanation": "[Provide a brief explanation for your evaluation]"

}

"""

and gpt’s response:

Prompt engineering is important here. Because GPT can sometimes return data not in the format we need, so finding a prompt that gets results with a consistent format is key.

We can turn the evaluations to a df for further analysis and visualisations


Example NON_RELEVANT case


because answer_llm does not answer the question and is not close to the original answer


Then we set up a simple chat bot using streamlit:

And the data is saved into a db for further analysis and monitoring

and the feedback (+1 or -1 in streamlit) table has:

The last bit is related to grafana. But I did not get to cover that today.


That is all for today!

See you tomorrow :) 

Popular posts from this blog

[Day 198] Transactions Data Streaming Pipeline Porject [v1 completed]

[Day 107] Transforming natural language to charts

[Day 54] I became a backprop ninja! (woohoo)