Meta has recently published its new transcription model Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages.
However, I am somewhat sceptical about it, particularly given Whispers' poor performance in Table 5.
In essence, it uses CNN layers rather than spectrograms to process the audio data. Why did they decide to do so? What are the advantages of using learned filters instead of spectrograms?
I have also read that the model does not predict words, but simply 'hears' characters. I am wondering whether the model has actually learned anything about language or if the CNN has simply learned to map sound to letters, which would mean that it has absolutely no understanding of context. It seems to me that the model's strength lies in working with specialized languages rather than standard ones. I think this especially because I can't imagine any language-specific feature surviving in 1600 languages.