What has Meta's "Ominlingual ASR" really learned?

Asked yesterday

Viewed 5 times

Meta has recently published its new transcription model Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages.

However, I am somewhat sceptical about it, particularly given Whispers' poor performance in Table 5.

In essence, it uses CNN layers rather than spectrograms to process the audio data. Why did they decide to do so? What are the advantages of using learned filters instead of spectrograms?

I have also read that the model does not predict words, but simply 'hears' characters. I am wondering whether the model has actually learned anything about language or if the CNN has simply learned to map sound to letters, which would mean that it has absolutely no understanding of context. It seems to me that the model's strength lies in working with specialized languages rather than standard ones. I think this especially because I can't imagine any language-specific feature surviving in 1600 languages.

asked yesterday

bilalj

New contributor

Add a comment |

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

What has Meta's "Ominlingual ASR" really learned?

0

Hot Network Questions

What has Meta's "Ominlingual ASR" really learned?

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Related

Hot Network Questions