You may have previously seen in videos, early versions of the Amulet software being used with Kinect. Now the music functionality is finished and we’re ready to give it away. If it proves popular then we’ll release a paid version that will include all the other functionality currently in our Amulet Voice Remote product and more…
This video shows how it works:
I thought I’d make a posting to the Amulet blog to explain a bit of the background behind how and why “Amulet Voice Kinect” works the way it does.
The design of Amulet Voice Kinect addresses two issues that can have a negative effect on the usability of speech recognition solutions.
- The first issue particularly affects speech control of media and it’s where the media you’re playing is loud enough that it swamps the user’s voice and so you lose control over the system.
- The second issue is where noise or unintended speech produces a “false positive” and so the system does something that you didn’t tell it to do.
Our existing product, the Amulet Voice Remote, solves these two problems by ensuring that a user’s mouth is close to a microphone by virtue of the mode of operation being to tilt the remote and raise it to your lips. This ensures the system only listens when the user is talking (remote has to be tilted) and that the user’s voice can’t be swamped by environmental noise or loud music as the mic will be very close to their mouth any time it’s being used.
Amulet Voice Kinect doesn’t have the benefit of a close proximity mic or an on/off switch. The Microsoft Kinect that it’s used with does have some very impressive echo cancellation technology built-in and if you keep to playing your media at moderate volume levels it works well. In practice, though, it’s all too easy to get into the situation where a particular song is too loud and the system can’t hear you and you can’t recover without resorting to a keyboard or remote control.
So we have addressed that issue in Amulet Voice Kinect by dipping the music volume when the user is talking. How does the system know the user is talking? He or she signals it by raising a hand (left hand in this case). This also addresses the second issue, because if the user hasn’t signalled that he’s giving a command then the system isn’t listening and there’s zero chance of picking up any false positives.
So raising your hand is one way of implementing this signalling that you want to speak. Some systems use a keyword so you end up saying things like “Computer, do this” or “Xbox, do that”. While that does help with the second issue of minimising false positives, it doesn’t help with the first issue of the media playing so loudly that you can’t be heard; at that point, it doesn’t matter what you’re saying or barking at the system!
So using Kinect, the logical design decision might be to use the in-built skeletal recogniser to recognise, say, a waving hand gesture and we did try that first. If you watch some of our earlier videos you can see it working like that, but unfortunately there is one pretty big draw back. The skeletal recogniser works well if you use it while standing in front of it but in my experience it doesn’t work consistently and reliably if you sit down. It’s particularly poor if you lay down or lounge around on a couch, and as that’s basically the modus operandi of a speech remote control type application, we had to drop any notion of using the skeletal recogniser.
So what to try next? I considered using the RGB camera and trying to use some open source computer vision routines; there’s loads of stuff available out there and I’ve seen some great software that tracks faces and eyes. The thing that put me off using the RGB camera was, would it work in the dark? A remote control system that only works in a well lit room would be as bad as one that doesn’t work when the music is too loud!
So I figured the thing to do would be to use some of those same computer vision routines with the Kinect depth camera. The depth camera works in the dark, as instead of measuring pixel light intensity it uses infra-red and outputs pixel distance, so you essentially get the shape of someone, even in complete darkness.
I’ve seen HaarClassifiers (or ”a cascade of boosted classifiers based on Haar-like features”) from OpenCV used very effectively to recognise faces, but I couldn’t find any evidence of people using them with a depth stream. I decided to try that; you can read more details in Part 2 of this blog article.
Or just jump to the free download right now on our Amulet Voice Kinect page.
– Steve Collins