Microsoft creates a record in Speech Recognition technology

Microsoft recently announced that it had achieved a new milestone in pursuit of their Speech Recognition Technology, being developed in their labs. According to the company, its best single system has achieved a WER (Word Error Rate) rate of just 6.3%, which is a world record.

Just last week, at Interspeech Conference, IBM had announced that they had surpassed the WER of 6.6%, but in the rapidly changing technological world, even that is now passé.

Said Xuedong Huang, Microsoft’s chief speech scientist,

“This new milestone benefited from a wide range of new technologies developed by the AI community from many different organizations over the past 20 years,”

Some analysts also believe these technologies could soon reach a point where computers can understand the words people are saying as well as another human would, which aligns with Microsoft’s strategy to provide more personal computing interfaces through technologies like Cortana and Skype, as Nadella focused on making bots come to life last week.

The speech research is also significant to Microsoft’s overall artificial intelligence (AI) strategy of providing systems that can anticipate users’ needs instead of responding to their commands, and to the company’s ambitions for providing intelligent systems that can augment how humans work in real life.

“It’s a simple concept, yet it’s very powerful in its impact. It is about taking the power of human language and applying it more pervasively to all of our computing,”said Satya Nadella.

Another big milestone in their success is the development of the Computational Network Toolkit. This technology implements optimizations that enable automated algorithms to run a process faster than before. A key step forward was a breakthrough for parallel training on graphics processing units, or GPUs.

This is because, although, GPUs were initially designed for computer graphics, researchers have found that they also can be ideal for processing complex algorithms like the ones used to understand speech.

AnkitChawla@TWC