[Day 138] Fine-tuning Speech T5 using a very small Glaswegian dataset

5/19/2024 12:46:00 am

Hello :)
Today is Day 138!

A quick summary of today:
fine-tuned microsoft's speech T5 to create a Glaswegian TTS and published it on huggingface (using our project's dataset)

I started the day by meeting with my collaborator (as I am writing this I remembered I forgot to ask them if I can mention them by name today ...) for an hour discussing my progress last week on the Scottish phrases audio clips and how we can move forward.

After the short clips, next we can move forward with transcribing podcasts, or him/friends doing voice recordings for our dataset because the 1st option might be too much manual labour.

During the call M asked me how the whole model operation happens and about embedding the text and audio, and after the meeting I decided to dig a bit and see how *once* we have a good dataset, how do we actually use it.

I found this course from huggingface that introduces working with audio data. One of the units is specifically about TTS. From there I learned about a library called TTS and tried to somehow transform the data we have so far to put init one of their models, but with no success. From that I found a public colab notebook by them that involves basic steps of uploading .wav files -> clicking a create dataset button -> clicking train model button -> clicking generate button. This was using gradio so it was some kind of a low-code demo solution, but I did not completely like it because the generated audio was slow (probably data problems), but also not something that I can easily share on huggingface/a public platform.

Before doing any kind of model I wanted to see what it's like uploading a dataset to huggingface. Using their docs I finally managed to push a simple version. As for the data, instead of having each clip being 3 times each (as in the original youtube videos) I included each phrase only once - mainly for simplicity. After some hurdles ~

I did it ^^

Finally, I randomly found this notebook that fine-tunes Speech T5 using huggingface's API.

It goes through cleaning and preparing the dataset (I actually finished the dataset's upload to huggingface after seeing that this notebook allows for that), tokenization, training (which took about 1 hour), and then evaluating.

After some trial and error, I managed to push a fine-tuned model to huggingface ^^

Sadly the inference and compute button generate some error unfortunately, but when I used python in a notebook, I managed to get a clip out.

Important note! The dataset is terribly small, 118 audio clips, with a total of 250 seconds haha. But it is a nice POC of how we can do it after we have a sensible dataset.
At the end of the notebook there is a way to make some kind of a demo with gradle on huggingface, and I finally managed to get it to work!!! :party:

As a final note, when I woke up I read a bit more of Grokking ML - about classification models' metrics and naive Bayes' model.

That is all for today!

See you tomorrow :)

Search This Blog

50+ days of Machine Learning