OK, let me throw a more specific question, since perhaps the problem can be solved another way.
Is there any difference between using the Kinect streaming mode of VoxCommandoSP and grabbing audio through Kinect API with the same settings (automatic beam, noise suppression etc.) and outputting it to a virtual recording device (e.g. with VAC) to which then VoxCommandoSP listens?*
So the question is whether VoxCommando explicitly grabs the Kinect-processed audio stream that later goes through the usual speech recognition pipeline or does it rely on the Kinect API to return recognized words?
*I have actually tried that, but I am not getting recognition rates as good as when using Kinect streaming mode of VoxCommandoSP