[Day 73] MBR and FUDGE - decoding mechanisms; pre vs post layer normalization
 Hello :) Today is Day 73! A quick summary of today: Covered lecture 6 : Generation algorithms and 9 : Experimental design and human annotation from the CMU 11-711 Advanced NLP course, from which I found out about: Minimum Bayes Risk (MBR) FUDGE decoding Why pre layer normalization is better than post layer normalization in transformers 1) Minimum Bayes Risk decoding for modern generation techniques ( Bertsch and Xie et al., 2023 ) When we get to the generation step of a language model, predicting and outputting the next token in a sequence, we can use methods such as greedy decoding or beam search to select tokens that have a high probability. But MBR which was originally proposed by Bickel and Doksum in 1977, questions whether we actually want to get the highest probability token.  Outputs with low probability tend to be worse than the opposite. But if we just compare the top outputs, it is less clear, as the outputs with the top probabilities might look similar (for example...