Google introduces Translatotron 2: An End-to-End Speech-to-Speech Translation Model

Google introduces a new version of their Translatotron, that is Translatotron 2 in July 2021.

Speech-to-text translation systems are developing at a fast pace over the last several decades, with one goal in mind: helping people communicate to each other around the globe, despite of the language difference.

In April 2019, Google introduced a “Direct speech-to-speech translation with a sequence-to-sequence model.” Which is the essence of the Translatotron. Now, Google has released a recent version after 2 years, taking the next step in the machine learning and the translation industry.

Translatotron 2

Last month, in July 2021, Google released a paper Translatotron 2: Robust direct speech-to-speech translation, where they portray their experiments in which Translatotron 2 outperformed its previous versions.

Translatotron 2 comprises three parts:

1. A source speech encoder

2. A target phoneme decoder

3. A target mel-spectrogram synthesizer.

They trained the model with an aim of speech-to-speech translation and perform a speech-to-phoneme translation.

One of the key difference in both the versions is that the previous system could create synthesized translations of voices to keep the sound of the original speaker’s voice. But Translatotron could also generate speech in a distinct voice, making it ripe for potential misuse in, for example, deepfakes. Deepfakes are artificial media in which they replace a person in an existing image or video with someone else. Many people copy content and replace audios with the artificial or machine voice. This trend is troubling not only because these fakes might sway opinion during an election or implicate a person in a crime, Google’s initiative to subdue is applaudable.

To tackle this problem, Translatotron 2 actually translates and synthesizes translated audio in the speaker’s voice itself.

Translatotron 2 also surpasses the original Translatotron in terms of translation quality and naturalness, as well as “substantially” reducing undesired artefacts like babbling and extended pauses.

Leave a Reply

Your email address will not be published. Required fields are marked *