0
$\begingroup$

Meta has recently published its new transcription model Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages.

However, I am somewhat sceptical about it, particularly given Whispers' poor performance in Table 5.

In essence, it uses CNN layers rather than spectrograms to process the audio data. Why did they decide to do so? What are the advantages of using learned filters instead of spectrograms?

I have also read that the model does not predict words, but simply 'hears' characters. I am wondering whether the model has actually learned anything about language or if the CNN has simply learned to map sound to letters, which would mean that it has absolutely no understanding of context. It seems to me that the model's strength lies in working with specialized languages rather than standard ones. I think this especially because I can't imagine any language-specific feature surviving in 1600 languages.

New contributor
bilalj is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
$\endgroup$

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.