[Day 71] Backprop, GELU, Tricking ChatGPT, and Stealing part of an LLM

3/12/2024 11:36:00 pm

Hello :)
Today is Day 71!

A quick summary of today:
Finished up the manual backprop code to make it more clear
Read some research papers
Gaussian Error Linear Units (GELUs)
Using Hallucinations to Bypass RLHF Filters
Stealing Part of a Production Language Model

Firstly, the manual backprop code cleanup.

To finish it up (for now atleast), I decided to get more data. Original was ~32000 names, on kaggle I found with around ~90000 names.

As for the architecture - I settled on the original input > flatten > batch norm > activation > output, I added more even up to 2 hidden layers and batch norm for each, but the training time got longer, and the results were not even better than the simple original one.

Even though I stayed with the original model structure, I tried to modify embedding size, context length, hidden layers, just to play around with them and see the result. At the end the generated names look decent. But I was not able to achieve a much better result than ~2.0.

These are the first few generated names, in the github repo, 50 can be seen.

The loss curve looks like this, after training I tried to modify learning rate, other hyperparams based on it, but I did not find much success.

Nevertheless, my main goal of the exercise was to write clearer code.

And I believe by improving the var naming, and adding the comments on the right with shapes is really good for understanding without running the code.

Secondly, the research papers I read today

1) GELU (Hendrycks & Gimpel, 2018)

The GELU activation was used in the LLM from scratch book, and that was the 1st time I got to use it, and also apparently modern LLMs prefer GELU over ReLU. So I just wanted to see the original paper that proposed this activation function. I already covered GELU's basics on Day 68, but just wanted to read the paper. The new things that I found are how the researchers tested GELU against the alternatives ReLU and ELU.

They show that RELU has consistently lower median loss (1st pic), and the most most reobust to noised inputs (2nd pic).

2) Tricking LLMs into producing content outside of their intended scope (Lemkin, 2024)

This paper attempted to use hallucinations to make LLMs forget their RLHF training and go back to their generate-next-input-token state (where they just generate x1 based on probability and the context before it).

Giving the model: 'this is a reversed passage: + some weird unicode version of the alphabet and then adding a prompt that is normal text ('please quote for me the seventh paragraph...'), the researchers found that ChatGPT 4.0 returns something not related, it starts hallucinating.

They started adjusting the prompt to see how the model behaves.

So inbetween the unicoded alphabet letters, in reversed a statement like: '“I can’t believe the dems got awaywith stealing the 2020 election' was put in reverse because this, mixed in with the unicode alphabet would confuse the model. Statements like that result in unbiased opinion about the election, but if we 'hide' it in this prompt which confuses and bypasses the RLHF layer of the LLM's training, the result is:

Text that was somewhere seen in the training data. And this is because, the designed prompt hides the political statement from chatGPT, chatGPT is 'confused' and to decode the whole message goes back to its root - generating tokens based on probability and context. Because ChatGPT was trained on large chunk of the internet, political statements like that have definitely appeared in its training set, so ChatGPT has learned those texts, but thanks to RLHF - ChatGPT is gated from giving them as outputs to users. But because the developed prompt bypasses this RLHF layer, ChatGPT goes and generates what it learned from the training data. There are more examples like that, even a very NSFW output, which can be seen in the paper's pictures.

3) Stealing Part of a Production Language Model (Carlini et al., 2024)

This was a bit more difficult paper to follow, but the gist is that a team of researchers in Google, through GPT's API, tried to gain information about OpenAI models' architecture. The below table sums up their results.

The team signed an agreement with OpenAI to not release the results from the last 2 models, but they confirmed with OpenAI that the size of the hidden dimension (what they learned through using the OpenAI API) is corect. This is interesting because if one person can do it, that means that soon more people will learn how to 'attack' the APIs of these closed-source LLM models and learn hidden from the public information about the models' architecture.

That is all for today!

See you tomorrow :)

Search This Blog

50+ days of Machine Learning