> You’d want parallel speech data: the same utterance spoken in English, Portuguese, Japanese, Arabic, Mandarin, and dozens more languages.
There is no such things as parallel speech data. The idea that parallel text is a thing is dubious in the first place, like there's "translation tone" in Japanese that refers to the voice-of-text distinct to translated Western texts. The entire concept of translation between distinct human languages is a thing born out of practical necessity rather than something with a concrete theoretical basis.
The Babel fish in The Hitchhiker's Guide to the Galaxy is supposed to be mind-reading. They don't rely on spoken utterances at all, but they read the minds of creatures within its telepathic range and feed the language portions of it into its host's brain, thereby achieving zero-lag realtime translation. This may or may not mean the author Douglas Adams knew that general universal translation is impossible, but it makes the fictional fish not contradictory to the reality that zero-lag interpretation through audio is basically impossible.
You CAN probably do a parallel speech voice to voice if you're okay with something like 30s delay. But if you want a voice-to-voice no-pause zero-delay, I mean, people sometimes think as they speak, everyone can do, yet not everyone speaks languages with same word orders, you literally need a crystal ball that reads dices before they're even rolled.
As somebody fluent in quite a few languages, language definitely affects even the way one thinks about things. Translation will always be imperfect because it's between different conceptual spaces, not some sort of mechanical replacement.
Very cool. I learned something new about why EMA (exponential moving average) is needed:
> EMA-based training dynamics like JEPA’s don’t optimize any smooth mathematical function, yet they provably converge to useful, non-collapsed representations.
All the papers say EMA avoids “representation collapse” without justifying it. Didn’t realize there were any theoretical results here.
Roughly, when you train a model to make its predictions align to its own predictions in some way, you create a scenario where the simplest "correct" solution is to output a single value under diverse inputs, aka representation collapse. This guarantees that your predicted representations agree, which is technically what you want it to do but it's degenerate.
EMA helps because it changes more slowly than the learning network which prevents rapid collapse by forcing the predictions to align to what a historical average would predict. This is a harder and more informative task because the model can't trivially output one value and have it match the EMA target so the model learns more useful representations.
EMA has a long history in deep learning (many GANs use it, TD-learning like DQN, many JEPA papers, etc.) so authors often omit defense of it due to over-familiarity or sometimes cargo culting. :)
5 comments:
> You’d want parallel speech data: the same utterance spoken in English, Portuguese, Japanese, Arabic, Mandarin, and dozens more languages.
There is no such things as parallel speech data. The idea that parallel text is a thing is dubious in the first place, like there's "translation tone" in Japanese that refers to the voice-of-text distinct to translated Western texts. The entire concept of translation between distinct human languages is a thing born out of practical necessity rather than something with a concrete theoretical basis.
The Babel fish in The Hitchhiker's Guide to the Galaxy is supposed to be mind-reading. They don't rely on spoken utterances at all, but they read the minds of creatures within its telepathic range and feed the language portions of it into its host's brain, thereby achieving zero-lag realtime translation. This may or may not mean the author Douglas Adams knew that general universal translation is impossible, but it makes the fictional fish not contradictory to the reality that zero-lag interpretation through audio is basically impossible.
You CAN probably do a parallel speech voice to voice if you're okay with something like 30s delay. But if you want a voice-to-voice no-pause zero-delay, I mean, people sometimes think as they speak, everyone can do, yet not everyone speaks languages with same word orders, you literally need a crystal ball that reads dices before they're even rolled.
As somebody fluent in quite a few languages, language definitely affects even the way one thinks about things. Translation will always be imperfect because it's between different conceptual spaces, not some sort of mechanical replacement.
You've missed the point where it's already possible and you can demo it right there, on the website.
Very cool. I learned something new about why EMA (exponential moving average) is needed:
> EMA-based training dynamics like JEPA’s don’t optimize any smooth mathematical function, yet they provably converge to useful, non-collapsed representations.
All the papers say EMA avoids “representation collapse” without justifying it. Didn’t realize there were any theoretical results here.
Roughly, when you train a model to make its predictions align to its own predictions in some way, you create a scenario where the simplest "correct" solution is to output a single value under diverse inputs, aka representation collapse. This guarantees that your predicted representations agree, which is technically what you want it to do but it's degenerate.
EMA helps because it changes more slowly than the learning network which prevents rapid collapse by forcing the predictions to align to what a historical average would predict. This is a harder and more informative task because the model can't trivially output one value and have it match the EMA target so the model learns more useful representations.
EMA has a long history in deep learning (many GANs use it, TD-learning like DQN, many JEPA papers, etc.) so authors often omit defense of it due to over-familiarity or sometimes cargo culting. :)