Robots are now just as good at transcribing speech as humans.
According to a paper published yesterday, a team of Microsoft engineers in the Artificial Intelligence and Research division reported their system reached a word error rate (WER) of 5.9 percent, a figure that is roughly equal to that of human abilities.
“We’ve reached human parity,” said Xuedong Huang, the company’s chief speech scientist. “This is a historic achievement.”
After decades of testing, the milestone comes on the heels of last month’s ‘close but no cigar’ score of 6.3 WER and figures to have wide-reaching implications as the battle for digital assistant supremacy heats up. Cortana, Xbox, and Windows could see the biggest initial impact.
To achieve these levels of accuracy, researchers employed deep neural networks to store significant amounts of data — called training sets — that helped systems recognize patterns from human input. Sounds and images were both used to train the network to utilize its stored data more efficiently.
Researchers want to be clear that parity is far from perfection. In this case, it just means it’s as good as humans, and we’re far from flawless.
Moving forward, the team hopes to achieve even higher levels of accuracy as well as ensure that speech recognition works better in real-world situations, such as noisy restaurants, crowded streets, and in powerful winds. In the future, the team dreams of a system that will not just recognize speech, but truly understand it.
We’re still ways off, but the future consists of the world where we no longer have to understand computers, they have to understand us.