In this episode we discuss voice activation technology. Most voice processing is done in the cloud. Now, there are a lot of good reasons to NOT send all of our conversations off to the cloud, but we do it anyway because it’s significantly cheaper to do it there in the cloud. But what if there were some unexpected, inexpensive alternative for doing voice processing at the edge? Now there is.
Topping the show this week, we’re going to take a look at smart devices and voice input. Voice input is becoming the preferred human-machine interface, with Siri and Alexa and Google Assistant and the like getting built into more and more devices: smartphones, cable TV remotes, and more.
Voice input is being sent to data centers, and that represents a lot of traffic on the net, and a lot of volume in data centers. For voice-enabled products, having the capability means never fully powering down. Instead, idling in power-consuming anticipation of your next command. Voice input also creates the temptation to violate consumers’ privacy.
EE Times ran two stories this week, one by Anne-Francois Pele – the newest addition to our editorial staff – about a company that has developed a MEMS sensor that will detect voice cues and wake up a sleeping device almost instantaneously. That's with the goal of minimizing power consumption. The other by Sally Ward-Foxton about a company called PicoVoice that has devised a way to perform voice processing at the edge easily and inexpensively. Here’s international editor Junko Yoshida with Sally.
EE Times本周发布了两篇文章，其中一篇是由Anne-Francois Pele（我们编辑团队的最新成员）撰写，内容是关于一家开发MEMS传感器的公司，这种传感器可以检测语音提示并几乎可以立即唤醒处于睡眠状态的设备。其目的是将功耗降至最低。另一篇文章由Sally Ward-Foxton撰写，报道了一家名为PicoVoice的公司，设计出一种可以轻松在终端边缘进行语音处理的方法，并且价格低廉。有请我们的国际特派编辑Junko Yoshida和记者Sally。
JUNKO YOSHIDA: This is something I have always been wondering about. You know, when are we moving this natural language processing in the cloud model to the voice inference on the edge devices. So I guess this is exactly the case. Is it?
JUNKO YOSHIDA: 有件事我一直好奇。就是，我们是什么时候将自然语言处理从云端转移到了终端设备上的语音推断。我想这个问题挺关键的，对吗?
SALLY WARD-FOXTON: Right. Yeah. So if we had an appliance or something, maybe a coffee maker or a fridge, if we use that 10 times a day to process that voice in the cloud, at today’s rates that would apparently cost around $15 dollars a year per device. That's quite a lot over the lifetime of the device and for the number of appliances for the appliance manufacturer. So that would have to be balanced against how many of these expensive coffee capsules you can sell for your coffee maker or whatever.
SALLY WARD-FOXTON: 对，是的。 假设我们有一台家电之类的设备，也许是咖啡机或冰箱，如果我们每天会使用这些设备10次，语音信息同时传送到云端进行处理，按当下的费用计算，每台设备每年约需15美元的费用。 对于设备制造商来说，在该设备的整个生命周期中，这个数额就很大了。因此，这种昂贵咖啡机的销量必须与使用成本保持平衡。
The point is, if it’s some kind of smart appliance where you've already got a small CPU in there, now you might not need to use the cloud at all. You can use the compute power you've already got in the device.
These cloud companies, Amazon and Google, the cost to them might not be $15 a year, since it’s their own cloud service that they are using, but yeah, the Amazon business models help for them to sell things, and they make that money back, right?
JUNKO YOSHIDA: So the cost of a big driver, I could see this. But what other motivation do people have to move from cloud to edge?
SALLY WARD-FOXTON: So privacy is a really, really big one. There's been a bit of a scandal recently with the Amazon Echo where it turned out there was people, there was human reviewers listening, eavesdropping on people’s conversations though Alexa. And it’s not just them, other manufacturers of smart assistants have been at it as well. They use humans to transcribe some of the conversations. Basically they label the data basically, so they can use that to train the models. There was a backlash against this from unhappy consumers. Their doctor’s appointments had been recorded and stuff. Obviously, it was not good.
SALLY WARD-FOXTON: 隐私是非常非常重要的一个原因。 最近Amazon Echo传出一些丑闻，曝出有人通过Alexa窃听，窥探其他人的对话。 不单只有Amazon，其他一些智能助手供应商也参与其中，他们雇人来转录部分对话。 基本上，他们会标记获取的数据，利用这些数据来训练模型。对此不满的消费者表示了强烈反对。 他们与医生的约诊被记录了下来。 显然，这不太合适。
So user privacy is a big reason NOT to connect devices to the cloud.
There's other things like security, data security. If you're doing something like, something more like transcription, your own device is doing full transcription . Maybe it's in a meeting room, recording meetings and transcribing the minutes. If that’s company information, you might not want to send that to the cloud for whatever reason. You might want to do that on the device.
还有一些其他原因，诸如安全性，数据安全性。 假设你正在做转录，你自己的设备正在做完整转录。你可能是在会议室里，正在记录会议并转录纪要。 如果这是公司信息，那么无论出于何种原因，你可能都不希望将其发送到云端。你可能倾向于在设备上处理这些事。
There's other reasons where you might need really strict latency, or a certain level of reliability that you can control. But yes, cost and privacy are really the big reasons really.
JUNKO YOSHIDA: So let's talk a little bit about this PicoVoice. It's a startup based on Canada. And it claims it has newly developed a machine learning model for speech-to-text transcription, as you say, that runs on a small CPU. Can you give me some specifics? How much compute power or memory does PicoVoice model require?
JUNKO YOSHIDA: 那让我们来谈谈PicoVoice。 这是一家位于加拿大的创业公司。 正如该公司所声称，他们最近开发出一种用于语音到文本的转录机器学习模型，可以在小型CPU上运行。 你能给我具体讲讲吗？ PicoVoice的模型需要多少计算能力或内存？
SALLY WARD-FOXTON: Right. So there are three different models. There’s a wake word engine, which is detecting a specific phrase that wakes up the rest of the system. There’s a speech-to-intent engine, which operates kind of in this limited domain that’s relevant to the application. And there’s a third model which does full speech-to-text transcription.
SALLY WARD-FOXTON: 好的。 有三种不同的模型。 一个是唤醒词引擎，用于检测唤醒系统其余部分的特定词组。 一个是语音-意图引擎，可在与应用相关的有限域中运行。 还有第三个模型，可以进行完整的语音到文本转录。
So for the speech-to-intent, which is where it understands spoken commands in a particular domain, maybe it’s a smart lighting system; it understands things to do with lighting. Maybe you want to turn the lights on and off. It understands those commands, changing the colors. But if you ask it about politics or economics, it doesn't understand that. You don’t have to have specific phrases, other than the wake word, but it only understands things to do with lighting. That model was less than half a megabyte. So that’s what you’d be doing on your sub-$1 microcontroller.
因此，对于语音-意图引擎（这是可以理解特定领域中语音命令的部分），也许它是一个智能照明系统，它能够理解与照明有关的事物。 也许你想开/关灯。 它理解这些命令，并能够改变灯光颜色。 但是，如果您问有关政治或经济学的问题，那它是不明白的。 除了唤醒词外，你不需要其他特定的词组，它只需了解与照明有关的。该模型不到半兆字节(MB)。 这就是您可以在不到1美元的微控制器上做的事情。
For the full speech-to-text transcription where it understands absolutely everything, 200,000 words, which is the entirety of the English language, the demo they had running for that, they were doing it on a Raspberry Pi Zero, without an internet connection. That’s more like a $5 kind of system. The CPU on that uses an 11 core, kind of classic ARM core from years back, so it’s nothing fancy. Although I don't know the exact size of the model, but yeah, it's very results-constrained environment still.
对于完整的语音到文本转录，它可以完全理解所有内容，20万个单词，即英语的总词汇量。该公司在Raspberry Pi Zero上演示了这个部分的引擎，演示过程没有接入互联网。 这差不多是个5美元的系统。 使用的是ARM几年前经典的11核CPU，没什么新鲜之处。尽管我不知道模型的确切大小，但是它仍然会受结果约束环境的影响。
JUNKO YOSHIDA: I was just reading your colleague Anne-Francoise Pele's story this week. She wrote a story about piezoelectric MEMS microphone company. It's called Vesper. And she interviewed the CEO, and the CEO was kind of alluding to the fact, alluding to the future, when artificial intelligence will be embedded in the sensor itself. So what other devices-- whether MCUs or sensors-- have you heard that are heading to the similar direction?
JUNKO YOSHIDA: 我本周在读你的同事Anne-Francoise Pele的文章。 她写了一个有关压电MEMS麦克风公司的故事，该公司名叫Vesper。 她采访了CEO，该CEO暗示，未来AI将被嵌入传感器本身。那么，你有听说过还有哪些其他器件（无论是MCU还是传感器）也朝着相似的方向发展？
SALLY WARD-FOXTON: Yeah. Definitely. I mean, artificial intelligence is coming to microcontrollers, and that’s a fact. Google is making a version of TensorFlow called TensorFlow Lite that is specifically for microcontrollers, very small devices, so no doubt about that.
SALLY WARD-FOXTON: 确实是的。 我的意思是，AI正在进入微控制器，这是事实。 Google正在制作一个名为TensorFlow Lite的TensorFlow版本，专门用于微控制器，是非常小的设备，因此毫无疑问AI是正在进入微控制器。
In terms of sensor nodes, similar to PicoVoice, there is a company in Seattle called XNOR. Same as PicoVoice, they use the instruction set of the CPU. XNOR, as you might imagine, they use the exclusive NOR instruction, but their models are built for image processing, object detection, face recognition. Whereas PicoVoice is for speech. So it’s not just natural language processing that is coming to microcontrollers. Image processing is coming, too. XNOR had some good demos showing image recognition, maybe like person detection, on tiny little boards, something like a sensor node, where there was no batteries. They were using energy harvesting. So I could easily imagine something like that in a sensor node somewhere, yes. Definitely.
在传感器方面，西雅图有一家公司有点像PicoVoice，公司名叫XNOR。与PicoVoice一样，该公司也使用CPU指令集。正如你所料想的那样，XNOR使用专门的NOR指令，但是它们构建的模型旨在处理图像、对象检测和面部识别。 而PicoVoice是用于语音的。 因此，微控制器不仅仅是针对自然语言处理，也针对图像处理。 XNOR有一些很好的演示例子，它们在小板展示了图像识别，如人脸识别，就像一个没有电池的传感器节点。 他们运用了能量采集技术。我可以轻松举出诸如此类的案例。
JUNKO YOSHIDA: I remember... I'm old enough to remember when a fridge or an elevator started to talk. Probably it's more than a decade ago. They weren’t listening to us, but they, somehow out of blue, blurted out and warned us when a door of our fridge is ajar or when the elevator door is about to close. I found it incredibly annoying. Do you think that a coffee maker or a washing machine suddenly listening to your command is a good thing?
JUNKO YOSHIDA: 我记得...我的年龄大到可以记住冰箱、电梯是从什么时候开始讲话的了。大概是十多年前了。 它们不是在听我们说话，但是当冰箱的门半开或电梯门即将关闭时，它们用某种方式突然出声警告我们。 我发现这是很烦人的。 你认为，你的咖啡机或洗衣机突然开始听你指挥了是件好事吗？
SALLY WARD-FOXTON: Yeah, with all these new technologies, you’ve got to use it judiciously. You’ve got to think carefully about what consumers will accept or even what consumers will enjoy using, and what is going to quickly become annoying. Especially if it’s every appliance in your kitchen suddenly piping up with something, and they are all talking at once, that’s going to be annoying. It may be less annoying if they only speak when spoken to. Perhaps their speech could be minimized, like maybe they understand you. Maybe they just make a beep or some kind of response. Yeah, it doesn't have to be a vocal response. I guess it’s up to device manufacturers really to find the right balance there, between useful and annoying.
SALLY WARD-FOXTON: 面对所有这些新技术，你必须得明智地使用它。 你必须仔细思考消费者会接受什么，甚至消费者会喜欢使用什么，以及这些设备会不会很快变得烦人等事情。 尤其是，如果你厨房中的所有电器突然都在干活，并且它们都在同时讲话，那会很烦人的。 如果它们只在你对它们说话的时候回应，可能会没那么烦人。 也许它们的“长篇大论”会短一点，就好像它们了解了你一样。 也许它们只是发出哔哔声或某种回应。 是的，都不一定需要是声音回应。 我想这取决于设备制造商真正能在有用和烦人之间找到合适的平衡。
JUNKO YOSHIDA: Your example is really interesting, Sally, because you said multiple devices, right? Think about that. You go into the kitchen. It's not just your coffee maker, your toaster, your fridge. Everything is listening to you. What if they all understood what you're saying and started to respond?
JUNKO YOSHIDA: Sally，你讲的例子非常有趣，因为你谈到了多种设备，对吗？ 试想一下，当你走进厨房，不只是你的咖啡机，还有烤面包机和冰箱，它们都在听你指令。 如果它们同时都理解了你的意图并开始回应，那该怎么办？
SALLY WARD-FOXTON: Yeah so your fridge is trying to make you a coffee, and washing machine's trying to make toast. No, they all have to operate in their own little domains I guess. It's a crazy thought to imagine the future where you're just speaking out loud to all these devices. Absolutely crazy.
SALLY WARD-FOXTON: 是的，所以你的冰箱正在尝试为你煮咖啡，而洗衣机正在尝试烤面包。 不，我猜它们都必须在自己的特定领域运行。 想象未来，你只是对所有这些设备大声说出指令，这是一个疯狂的想法。 绝对疯狂。
JUNKO YOSHIDA: I have to think, how many consumers really want this? Just because technology can do this, doesn't necessarily mean it's a good idea to add this. I'm not putting any cold water to this thing, but I think it might not be a bad idea to step back a little and think about if there is really such a need. I think what happens is that companies will start building these things anyway and shove them into our throats I think.
JUNKO YOSHIDA: 我不得不思考，有多少消费者真正想要这个？ 仅仅因为技术可以做到这一点，并不一定意味着添加它是一个好主意。 我没有任何给这东西泼冷水的意思，但我认为，稍微退后一步，并考虑消费者是否真的有这种需求可能不是一个坏事。 我认为现在正在发生的事情是，这些公司无论怎样都会开始构建这些东西，并强行塞给我们。
SALLY WARD-FOXTON: Absolutely. So something like a lighting system where you might be sitting down the light switch is on the other side of the room or something, and you want to turn it on and off with voice. But where it's a washing machine or a toaster where you have to physically be at the device anyway to control it, how useful is that? I don't know. But I think device manufacturers maybe will go a bit crazy and add it to try to differentiate their products at first. And maybe we'll see some pushback. Who knows?
SALLY WARD-FOXTON: 确实如此。诸如照明系统之类的东西，你可能会坐在房间里照明开关位置的对面一侧，或者你想通过声音控制打开/关闭它。 但是无论是洗衣机还是烤面包机，当你要使用它们时，无论如何都必须本人在设备上进行操作，那语音控制有什么用？ 我不知道。 但是我认为设备制造商可能会有些狂热，并增添这类功能以尝试让他们的产品有辨识度。 也许我们会看到一些抵制。 谁知道呢？
JUNKO YOSHIDA: Yeah. Or the first thing I’m going to look for is the disable button.
JUNKO YOSHIDA: 是啊。 可能我要做的第一件事就是寻找禁用按钮。
SALLY WARD-FOXTON: Right? Hopefully you can turn it off.
SALLY WARD-FOXTON: 是吧？ 希望你能把它关得掉。
JUNKO YOSHIDA: Yeah. All right. Well thank you so much. It’s always fun to talk to you, Sally.
JUNKO YOSHIDA: 是的。好的Sally，非常感谢你联线，和你聊天总是令人愉悦。
SALLY WARD-FOXTON: Thanks, Junko. Have a great day.
SALLY WARD-FOXTON: 谢谢Junko。祝你有美好的一天。