Google’s “speech-to-text” engine is at the heart of what has become the coolest feature in the last few years. Now, with Android apps and even newer web applications that use the x-webkit-speech parameter, we can basically “talk” to our computers. This feature allows our computers and smartphones to do great things such as give us directions to the local movie theater or restaurant, or to translate in real-time what we say into another language.
The more familiar technology, text-to-speech (often abbreviated as TTS), has been in common use for many, many years. It creates a speaking voice by converting written words and text into intelligible speech through a speech synthesizer. Complimentarily, speech-to-text, also known as speech recognition, which has been introduced into most people’s lives rather recently, attempts to convert the audio that is produced by our voices into intelligible phrases. I write “attempts to” because clearly as is usually the case with most emerging technologies, it does not function perfectly yet.
When I was a child, we used to play a game with several participants called “telephone”, where the first participant would recite a phrase and then the phrase would get repeated until the last participant stated what he or she had heard. Usually the original phrase was rather different when the last participant recited the phrase–a real sign that even our own brain is not perfect when it comes to interpreting spoken words.
I am American, English is my mother tongue and I also speak the Italian language. However, I have an accent. Google Translate’s speech-to-text, surprisingly, understands my Italian better than it understands my English.
How is it possible that Google Translate understands my voice better in Italian when I use speech-to-text?
The answer to the question lies in the intricacy of the English language itself. Since English has absorbed so many words from many sources such as old German, Latin, and French, there are many words which sound the same. While not officially being homophones, caress and cherish, for example, sound similar and have similar albeit different meanings. The two words come from the same Latin root, carus, yet through the course of many years, passed through many languages, such as Italian and Old English which caused changes in both spelling, pronunciation, and meaning.
Also, the famous word ghoti was created to show just how complicated the pronunciation of English can be. It is a fictitious word spelled using the pronunciation of f as in tough, o as in women and ti as in nation. Google’s speech-to-text engine does not have a problem with my English, it has a problem with English.
Italian, ahh, beautiful Italian, the language often used in the opera, is an entirely different story. The language has only 21 letters and many less distinct sounds. Derived mainly from one language, Latin, there are many less words that sound the same. Officially, there are no diphthongs, and every word is pronounced exactly as it is written–there is almost no debate about the correct way of saying a word, excluding dialects and accents.
If a speech-to-text program hears the Italian word internazionalizzazione, it can only mean one thing: internationalization. But not only that, beh, ahimè, and cioè and ciò have clearly different sounds.
In conclusion, despite my best attempts to speak Italian like a native, sometimes I just cannot pronounce words with 100% accuracy. Even the “gl” sound heard in the name of the Sardinian city of Cagliari has this “y” sound pronounced cal-yeah-ree, and despite years of practice, I just may never get it right.
Google’s speech-to-text engine understands me though.