Author Topic: How to interface with TcpMic? (Read 4906 times)

roel_v · « **on:** September 19, 2015, 05:58:10 PM »

Hi,

I'm experimenting with VoxCommando to drive my home automation system. Getting commands right is the 'easy' part (for some values of 'easy'), my main problem is with getting coverage throughout the house. I have a server in the basement but no way to directly hook up mics to that machine, so in my 'production' build (I'm just testing right now) I'd need 'satellite' mics in each room and a way to transmit audio to VoxCommado. What I'm thinking right now is to have my USB mic (MXL AC404 USB) hooked up to a Raspberry Pi, have the Raspberry do the preprocessing (equalizing, filtering) and then pipe that (the raw audio) to my main server with VoxCommando. I want to write my own (C++) daemon on the Raspberry that will read from the sound card, maybe do the preprocessing through alsa filters or when worse comes to worst write my own signal processing software, and send the raw sound data to the VoxCommando server.

Now I have the following questions:

- The TcpMic input format doesn't seem to be documented anywhere; in fact all documentation just points to a wiki page that doesn't exist any more. Do I just open a socket and send 16-bit PCM data there? Will that also work for an 'always on' situation, i.e. do I just send a never-ending stream of audio data?

- How do I handle multiple 'senders'? Do I need to start multiple VoxCommando instances (haven't tested if that's even possible) or can the TcpMic plugin listen on multiple ports (doesn't look so from the UI)? Would you be willing to modify it so that it can, or give me the source of the TcpMic plugin so that I can build my own custom version?

- Once I get multiple senders working (presuming that's possible), how does that work out in VoxCommando? My goal is of course location awareness - when I say 'lights on' in the office, I want it to turn on the office lights, and same in the living room. Let's assume that no mic can pick up sounds from 'other' rooms. Is there a way to pass the 'source id' to the voice command and have it decide on what actions to perform? My goal in the end is just to do a single GET request to my (home-build) home automation system.

Thanks.

cheers,

roel

jitterjames · « **Reply #1 on:** September 19, 2015, 07:07:13 PM »

Hi roel_v

Welcome to VoxCommando. If you have access to an Android device you should try VoxWav to get a better idea of how your software would need to work. It's not necessary but it might help conceptually.

If you wrote something for the rPi that worked with the TCPMic plugin I'm sure many users would be very happy so you should not have too much trouble finding beta testers.

I am happy to send you the source code for the TCP-Mic plugin. It should help you even if you don't decide to modify it. In theory it should be able to handle multiple simultaneous inputs but for some reason it does not and I haven't spent the time to troubleshoot it, but it can accept multiple inputs from different devices as long as they don't overlap. Originally VoxWav and the TcpMic plugin were not designed for always on open air use. The idea was to press a button or tilt your phone when you wanted to speak and that is still how I use it. If you do want to do an always on solution you will still need to decide when to start and stop streaming audio to the plugin. This could be done using volume levels which is how VoxWav does it, or if you are able to use basic speech recognition on the pi you could use a keyword or phrase to tell it when you want to give a command. So although VoxWav can process multiple voice commands in a single longer sound stream, that stream must be of finite length, it can't just send continuously or it will never be processed and will just fill up the buffer.

So given that you have to break your audio up into "chunks" anyway, one option that might be easier to implement would be to just save a wav file to a folder that VoxCommando is set to watch. Whenever it sees a new wav file appear in that folder it will automatically open it and process the audio.

The TcpMic plugin can accept additional information before the audio that it receives which can include the "name" of the device that is sending it. An event is triggered immediately before the recognition of audio takes place.

I can also give you some code that is written in java for the sending part if that helps.

If you have more questions, it might be easier for us to have a quick chat on Skype or Google Hangouts. PM me if you like. I could probably be free in the next few days if you are.

Thanks for the heads-up on the missing wiki page. Nime5ter has readded that page but it may be a bit out of date, and it does not include technical specs on how to send data to the plugin. It is written more for general user setup which you can probably figure out for yourself anyway.

http://voxcommando.com/mediawiki/index.php?title=Plugin_TcpMic

roel_v · « **Reply #2 on:** September 20, 2015, 05:16:46 AM »

Hi,

Thanks for your elaborate response. I asked my question with one big implicit assumption, which was that VoxCommander did online processing of incoming sound, i.e. that it would start processing sounds as they came in. Looking back I could have deduced myself that it doesn't work that way, so then I might as well as you mention copy wav files across SMB to the VoxCommando machine. My main reason for interfacing with VoxCommander directly was to reduce latency, but I'm probably getting ahead of myself and should start with something that runs and then shift my focus to making it faster

Of course as you say when feeding wav files the problem becomes how to chunk the data. Judging from https://wolfpaulus.com/journal/embedded/raspberrypi2-sr/ , the speed of speech recognition with sphinx on the rPi2 isn't something to write home about. It would mean that I'd have to record all sound, buffer it and also pipe it to sphinx, then as soon as sphinx recognizes a command start word look up what audio came after it, find the next silence, then copy the audio between the start word and the silence to VoxCommando. At which point, to be honest, VoxCommondo doesn't add much any more

I mean, at that point I might as well do all processing on the rPi and call the home automation server directly. Well maybe for 'free form' recognition Vox Commando would be better (like when I'd say 'Jarvis google how many inhabitants in Ecuador' or whatever) - for me most initial applications are just keyword based though.

Thinking about it though, maybe some of the delay in the video above comes from waiting for a long enough period of silence; and maybe the time it takes to process the sound is linear to the size of the input corpus, in which case it could be made much faster by doing more aggressive chunking and having to recognize only one trigger word. The heavy lifting could then be offloaded to the beefier hardware that VoxCommando runs on, and where it can use faster speech recognition software.

Anyway just thinking out loud here, it's clear that I have a lot of experimentation to do. I have a bunch of work before I'd get to writing software that would directly interface with TCPMic, so no need for you to invest your time in that now - I'll get back to you when I get to that point. Or maybe someone else will beat me to it and I can just use that

Thanks for your time so far.

cheers,

roel

jitterjames · « **Reply #3 on:** September 20, 2015, 09:43:04 AM »

If you are not able to try VoxWav yourself you can watch this video to get an idea of the latencies. I do a few commands with the watch that show how it works when your silence thresholds are properly calibrated:

In your case it sounds like using volume is the way to go. However, the main problem with this technique is that it needs to be somewhat adaptable. I have not perfected it, but one thing that I found I had to do was to have the silence threshold increase over time while you are recording, so that if there is a lot of background noise the threshold will keep rising until the recording is forced to stop. This is still not ideal because it means that when there is a lot of noise it will still be constantly streaming audio, but broken up into pieces. If you had multiple mics doing this simultaneously it would probably be problematic.

One thing to note is that once the .wav file actually gets saved, VoxCommando will open and process it in a fraction of a second. On a fast machine it seems almost instantaneous. Given that the RasPi is probably not so fast for either processing or file transfer, you will probably find that it's much faster to use the TcpMic plugin. And if you use the plugin then you won't need to worry about how to create a wav file. The other advantage of using the plugin (and the main reason I created it), is that it's much easier for the end user to set up because you don't need to worry about samba shares and permissions etc. You can also use UDP to ask for available machines running VC and get their IP addresses.

Then again if there are existing libraries or sample code available for recording to .wav format on a RasPi you should definitely give that a shot first to get some momentum going.

Let me know what I can do to help.

Nodo · « **Reply #4 on:** February 25, 2016, 03:09:19 PM »

Quote from: jitterjames on September 19, 2015, 07:07:13 PM

Hi roel_v

I am happy to send you the source code for the TCP-Mic plugin. It should help you even if you don't decide to modify it. In theory it should be able to handle multiple simultaneous inputs but for some reason it does not and I haven't spent the time to troubleshoot it, but it can accept multiple inputs from different devices as long as they don't overlap. Originally VoxWav and the TcpMic plugin were not designed for always on open air use. The idea was to press a button or tilt your phone when you wanted to speak and that is still how I use it. If you do want to do an always on solution you will still need to decide when to start and stop streaming audio to the plugin. This could be done using volume levels which is how VoxWav does it, or if you are able to use basic speech recognition on the pi you could use a keyword or phrase to tell it when you want to give a command. So although VoxWav can process multiple voice commands in a single longer sound stream, that stream must be of finite length, it can't just send continuously or it will never be processed and will just fill up the buffer.

Good day James,

I totally understand that this topic has not been active for a long time. But I thought that instead of starting a new thread I should try here first

.

I have to say that the amount of effort in coding all the pieces and plugins is amazing and if you do not mind, I was trying to better understand the tcpMic plugin and I found this thread and I couldn`t help but ask for the source code since it would help me decide whether or not to try a Raspberry Pi daemon. With all honesty socket programming is quite a nightmare

.

Keep up the great work,
Nodo

jitterjames · « **Reply #5 on:** February 26, 2016, 06:26:00 PM »

I sent you a PM.

Nodo · « **Reply #6 on:** February 27, 2016, 09:14:20 AM »

Quote from: jitterjames on February 26, 2016, 06:26:00 PM

I sent you a PM.

Thank you. Totally appreciated

Author Topic: How to interface with TcpMic? (Read 4906 times)

roel_v

How to interface with TcpMic?

jitterjames

Re: How to interface with TcpMic?

roel_v

Re: How to interface with TcpMic?

jitterjames

Re: How to interface with TcpMic?

Nodo

Re: How to interface with TcpMic?

jitterjames

Re: How to interface with TcpMic?

Nodo

Re: How to interface with TcpMic?