


For example, the most commonly used dataset for TTS is, LibriSpeech, a corpus of approximately 1000 hours of 16kHz read English speech. Many TTS systems are trained on limited sets of data, making them less effective. While there are numerous possible reasons, a major cause boils down to the bias that prevails in the algorithms powering these TTS services. So, why does such bias still exist in TTS systems? The average word error rate for white subjects was 19 percent compared to 35 percent for black subjects speaking in African-American Vernacular English. In fact, researchers at Stanford University recently published a study i n which they reported findings from researching five automatic speech recognition systems. For example, voice AI mispronounces words in many Asian, African, and Latin languages. Specifically, in how certain accents and words in particular languages are delivered. Thanks to advances in AI, we’ve reached a point where it’s sometimes difficult to distinguish synthetic voices from real ones.ĭespite such innovative developments, there still exist disparities in these text to speech (TTS) systems. Talking assistants like Siri, Google Assistant, and Alexa, now sound more human than before. Jackson voice clone delivering the weather report on Alexa today. So, it’s not surprising that it has taken well over 200 years for AI voices to get from the first speaking machine-which was able to simulate only a few recognizably human utterances-to a Samuel L. The science that goes into making machines talk like humans is very complex because our speech patterns are so nuanced.
