Author Topic: Best arrangement and/or standardization of phrases in commands (Read 3653 times)

marcusvdt · « **on:** April 08, 2015, 01:10:52 AM »

I'm currently translating and adapting many the Kodi standard configuration that came with VC.
On a side note, Portuguese has a different logic than English, for example we place adjectives after sustantives (not before), plus many commands must still be used in English (like Play and Stop). Easy to solve since I can use phonetic phrases for those cases.

Back to the subject, I'm gradually rebuilding some phrases, adjusting the payloads, etc.
Before I put too much the effort for later discovering I should have done it differently, I'd like to kindly ask if there is a recommended way for constructing the phrases when there are similar parts of phrases that may appear in many different commands For example, I suppose it would be better to have a different begining keywords for each different device that I wan to control, like "TV turn on", "Receiver volume up", "Kodi play movie", "Light living room off", and so on.
For example, If I want to have different commands with similar phrases like "What album is this" and "What movie is this", won't these common words present in both commands make it harder for VC to correctly understand what command I want to execute?

Which of the examples below are more likely to ease the speech recognition in VC?

Code: [Select]

<?xml version="1.0" encoding="utf-16"?>
<!--VoxCommando 2.1.3.8-->
<command id="491" name="What album" enabled="true" alwaysOn="False" confirm="False" requiredConfidence="0" loop="False" loopDelay="0" loopMax="0" description="">
  <action>
    <cmdType>XJson.SoftMute</cmdType>
    <params>
      <param>60</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>VC.Pause</cmdType>
    <params>
      <param>300</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>XJson.Raw</cmdType>
    <params>
      <param>Player.GetItem</param>
      <param>"playerid":0, "properties": ["album"]</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>XJson.ParseTokens</cmdType>
    <params>
      <param>This is from the album: {item.album}</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>OSD.ShowText</cmdType>
    <params>
      <param>{LastResult}</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>TTS.SpeakSync</cmdType>
    <params>
      <param>{LastResult}</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>XJson.SoftUnMute</cmdType>
    <params />
    <cmdRepeat>1</cmdRepeat>
  </action>
  <phrase>What,Which</phrase>
  <phrase>album, C D, disc, recording</phrase>
  <phrase optional="true">is this, is playing, I'm hearing</phrase>
</command>

or

Code: [Select]

<?xml version="1.0" encoding="utf-16"?>
<!--VoxCommando 2.1.3.8-->
<command id="540" name="What movie is this" enabled="true" alwaysOn="False" confirm="False" requiredConfidence="0" loop="False" loopDelay="0" loopMax="0" description="">
  <action>
    <cmdType>XJson.Raw</cmdType>
    <params>
      <param>Player.GetItem</param>
      <param>"playerid":1,  "properties": ["title","cast"]</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>Results.RegEx</cmdType>
    <params>
      <param>"name".*?"(.*)"</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>Results.MatchConcat</cmdType>
    <params>
      <param>, </param>
      <param>3</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>XJson.ParseTokens</cmdType>
    <params>
      <param>Now playing {item.title} starring {lastresult}</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>XJson.SoftMute</cmdType>
    <params>
      <param>60</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>TTS.SpeakSync</cmdType>
    <params>
      <param>{LastResult}</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>XJson.SoftUnMute</cmdType>
    <params />
    <cmdRepeat>1</cmdRepeat>
  </action>
  <phrase>What movie is this, which movie is this, what video is this, which video is this, what film is this, which film is this, what movie I'm watching, which movie I'm watching</phrase>
</command>

this previous one compared to the one below, is likely to make things harder or easier for VC to differentiate between the commands?

Code: [Select]

<?xml version="1.0" encoding="utf-16"?>
<!--VoxCommando 2.1.3.8-->
<command id="540" name="What movie is this" enabled="true" alwaysOn="False" confirm="False" requiredConfidence="0" loop="False" loopDelay="0" loopMax="0" description="">
  <action>
    <cmdType>XJson.Raw</cmdType>
    <params>
      <param>Player.GetItem</param>
      <param>"playerid":1,  "properties": ["title","cast"]</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>Results.RegEx</cmdType>
    <params>
      <param>"name".*?"(.*)"</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>Results.MatchConcat</cmdType>
    <params>
      <param>, </param>
      <param>3</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>XJson.ParseTokens</cmdType>
    <params>
      <param>Now playing {item.title} starring {lastresult}</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>XJson.SoftMute</cmdType>
    <params>
      <param>60</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>TTS.SpeakSync</cmdType>
    <params>
      <param>{LastResult}</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>XJson.SoftUnMute</cmdType>
    <params />
    <cmdRepeat>1</cmdRepeat>
  </action>
  <phrase>What,Which</phrase>
  <phrase>movie, film, video</phrase>
  <phrase optional="true">is this, I'm watching</phrase>

</command>

jitterjames · « **Reply #1 on:** April 08, 2015, 12:34:00 PM »

Quote from: marcusvdt on April 08, 2015, 01:10:52 AM

... plus many commands must still be used in English (like Play and Stop)...

Why do you say that? You should be able to translate anything.

If it is because actual macro uses the payload word in an action, then you can just switch to using a payload XML instead of using a payload List. Use the portuguese word as the phrase, and the English word as the value.

jitterjames · « **Reply #2 on:** April 08, 2015, 12:55:28 PM »

Quote from: marcusvdt on April 08, 2015, 01:10:52 AM

Which of the examples below are more likely to ease the speech recognition in VC?

The most important thing is that if you have all other words equal in two different commands, then the words that are different must not sound similar to each other. And of course the part of the phase that is unique must not be optional.

I'm a bit confused because you gave us three commands to compare but one was an album and two were movies. So I will ignore the album and just look at the two methods for movie:

It seems you are really asking if it is better to put complete phrases all into one phrase in the tree, or to break them up. The answer is that the second method (break them up) is usually better for a number of reasons.
1) Easier to edit
2) Less likely to run out of space in the tree field
3) Probably uses less memory
4) Assuming the total possible phrases you can say comes out identically for both commands, then the recognition engine will probably get equally good results using either method.

Sometimes you can't break phrases up this way because you would need to change your sentence structure a lot to maintain proper grammar. In this case you can use the single phrase method with lots of aliases or you can create two commands. Just don't create multiple commands if you are using large payloads.

Now lets compare the first and last commands with each other: ("What album" / "What movie #2")
Using these two commands together makes sense. They are identical to each other except for the second phrase. The most important thing here is that none of the words(album, C D, disc, recording) should sound too similar to any of the words (movie, film, video).

And of course this rule applies to any other commands in your tree as well.

Sometimes you will have words that the engine cannot differentiate very well depending on your accent and the quality of your recording. For example, sometimes the words "on" and "off" are too similar. If that is the case then "turn the TV on" and "turn the TV off" may not work for you. It won't matter how your tree looks, if your end phrases end up allowing for you to say either "turn the TV on" or "turn the TV off" then it will only matter if the engine can differentiate between the sound of "on" and the sound of "off".

One solution is to use different words like: "enable/disable" "activate/deactivate" "fire up/kill" etc. instead of "on/off"

Or you can use completely different phrasing for each command: "Turn the TV on" / "Turn off the TV". This will be easy for the engine to differentiate but it will be hard for the user to remember and if they say "Turn the TV off" then the computer is almost certain to think they said "Turn the TV on".

One note about the order of phrases and payloads. I understand that sentence structure will be different from one language to another but there may be a situation where you will want to break your language rules a bit.

For example (the best example really) the command "Play song X" where X is a payload XML with a very large number of items. In this case, I think it is really best if you don't reverse this. It is OK to say "song play X" but I think it would cause problems to say "X song play" or "X play song". I think it is easier for the engine if there are phrases before the payloadXML that give it an idea of what the general command will be. This will be especially true if you have a lot of songs, and if you are using subset matching in your payloadXML. I could be wrong about this, but even if it doesn't affect the end accuracy it will cause your "guessed text" to be all over the place while you are speaking.

marcusvdt · « **Reply #3 on:** April 08, 2015, 01:46:40 PM »

Quote from: jitterjames on April 08, 2015, 12:34:00 PM

Why do you say that? You should be able to translate anything.

If it is because actual macro uses the payload word in an action, then you can just switch to using a payload XML instead of using a payload List. Use the portuguese word as the phrase, and the English word as the value.

Yes I know, but I was referring to the cases where we do use English words anyway. For example, for play and stop, we are used to say these words in English here in Brazil. But since the recognizer culture will be set to PT-BR, I will need to use the phonetic spelling and will end up with phrases like
plêi instead of play
istóp instead of stop

Otherwise, the VC would look for plei or plêi phrase when I actually said play (in English).

These are just two examples of many cases that I will need to deal with.

Just as a matter of curiosity, it is common here in Brazil that we use international words in their original form, and this is more present in tech related stuff.
Other examples for this behavior are the words:
game: should be translated to jogo
ENTER (from the keyboard): should be translated to ENTRA
notebook: should be translated to computador portátil or literally livro de notas
PC: should be CP standing for Computador Pessoal
Mouse: should be translated to rato
LAN: should be translated to RAL standing for rede de área local
receiver: should be translated to recebedor

In the opposite manner, in Portugal 99% of international terms are translated, being natural for them to use such translated words in their day to day. This sounds very weird for us brazilians.

marcusvdt · « **Reply #4 on:** April 08, 2015, 02:07:40 PM »

Quote from: jitterjames on April 08, 2015, 12:55:28 PM

The most important thing is that if you have all other words equal in two different commands, then the words that are different must not sound similar to each other. And of course the part of the phase that is unique must not be optional.

I'm a bit confused because you gave us three commands to compare but one was an album and two were movies. So I will ignore the album and just look at the two methods for movie:

It seems you are really asking if it is better to put complete phrases all into one phrase in the tree, or to break them up. The answer is that the second method (break them up) is usually better for a number of reasons.
1) Easier to edit
2) Less likely to run out of space in the tree field
3) Probably uses less memory
4) Assuming the total possible phrases you can say comes out identically for both commands, then the recognition engine will probably get equally good results using either method.

Sometimes you can't break phrases up this way because you would need to change your sentence structure a lot to maintain proper grammar. In this case you can use the single phrase method with lots of aliases or you can create two commands. Just don't create multiple commands if you are using large payloads.

Now lets compare the first and last commands with each other: ("What album" / "What movie #2")
Using these two commands together makes sense. They are identical to each other except for the second phrase. The most important thing here is that none of the words(album, C D, disc, recording) should sound too similar to any of the words (movie, film, video).

And of course this rule applies to any other commands in your tree as well.

Sometimes you will have words that the engine cannot differentiate very well depending on your accent and the quality of your recording. For example, sometimes the words "on" and "off" are too similar. If that is the case then "turn the TV on" and "turn the TV off" may not work for you. It won't matter how your tree looks, if your end phrases end up allowing for you to say either "turn the TV on" or "turn the TV off" then it will only matter if the engine can differentiate between the sound of "on" and the sound of "off".

One solution is to use different words like: "enable/disable" "activate/deactivate" "fire up/kill" etc. instead of "on/off"

Or you can use completely different phrasing for each command: "Turn the TV on" / "Turn off the TV". This will be easy for the engine to differentiate but it will be hard for the user to remember and if they say "Turn the TV off" then the computer is almost certain to think they said "Turn the TV on".

One note about the order of phrases and payloads. I understand that sentence structure will be different from one language to another but there may be a situation where you will want to break your language rules a bit.

For example (the best example really) the command "Play song X" where X is a payload XML with a very large number of items. In this case, I think it is really best if you don't reverse this. It is OK to say "song play X" but I think it would cause problems to say "X song play" or "X play song". I think it is easier for the engine if there are phrases before the payloadXML that give it an idea of what the general command will be. This will be especially true if you have a lot of songs, and if you are using subset matching in your payloadXML. I could be wrong about this, but even if it doesn't affect the end accuracy it will cause your "guessed text" to be all over the place while you are speaking.

You responded exactly what I was asking for, thanks!

So if I understand correctly, if I have the entire set of Kodi commands starting with broken phrases like these two below, having the word kodi in the beginning of the phrases will help the recognizer to discard similar commands that I may have to other devices because I've chosen to put the kodi word and it will differentiate clearly. But since the words movie and music may sound similar, the recognizer will still be susceptible to errors. In this case it would be better to replace the words music by song and movie by film, which are much more different.

command: play music
phrase: Kodi
phrase: play
phrase: music
payload: x

command: play movie
phrase: Kodi
phrase: play,
phrase: movie
payload: x

Supposing the speech recognition worked pretty good and my command phrase has been converted to text perfectly, is there a difference in speed or accuracy for VC to react to my command if the phrases are too similar or very different?
I'm asking this because I suppose VC uses the speech recognition api to transform the spoken audio in text, to later find a match in the command tree, so I'm wondering if in this process, it would be more accurate and/or more likely to find the corresponding phrase if I only have very different phrases in my tree.

The whole purpose of my questions is that I'm more inclined to sacrifice the natural speaking in benefit of accuracy.

jitterjames · « **Reply #5 on:** April 08, 2015, 06:05:47 PM »

Quote from: marcusvdt on April 08, 2015, 02:07:40 PM

So if I understand correctly, if I have the entire set of Kodi commands starting with broken phrases like these two below, having the word kodi in the beginning of the phrases will help the recognizer to discard similar commands that I may have to other devices because I've chosen to put the kodi word and it will differentiate clearly. But since the words movie and music may sound similar, the recognizer will still be susceptible to errors. In this case it would be better to replace the words music by song and movie by film, which are much more different.

Yes it can sometimes help, especially if you notice that there are commands in other groups that get confused with your Kodi commands. There are two other things you can do here though.
1) you can set your Kodi groups to use a prefix override, and set it to Kodi. This means that saying Kodi before the commands in this groups is optional (unless you are in standby in which case you must say it).
2) you can enable and disable groups of commands depending on context. You can for example have your Kodi groups enabled only when Kodi is focused. This not only helps to avoid similar commands, it can allow you to use identical commands in different groups. For example, I use "next track" as a command for both Kodi and MediaMonkey. There are a few ways to turn groups on and off depending on your needs. You can use groups properties, or events and actions to turn multiple groups on and off in one shot.

Quote from: marcusvdt on April 08, 2015, 02:07:40 PM

Supposing the speech recognition worked pretty good and my command phrase has been converted to text perfectly, is there a difference in speed or accuracy for VC to react to my command if the phrases are too similar or very different?
I'm asking this because I suppose VC uses the speech recognition api to transform the spoken audio in text, to later find a match in the command tree, so I'm wondering if in this process, it would be more accurate and/or more likely to find the corresponding phrase if I only have very different phrases in my tree.

NO! That is not how VC works at all. VC does not use a recognizer that blindly converts audio into text. That is how most recognizers work on mobile devices (Android, iPhone etc). The main reason VC works so well is that it does not do this. The recognizer already knows all the voice commands that are possible in advance and it analyzes your audio directly in the context of these possible commands. So even if you have an artist with a crazy name, it will still get it right as long as it does not sound very similar to another artist name. This is why normal commands are much more accurate than ones that use dictation payloads. Dictation is dumb, and is only based on whatever data Microsoft implemented when it created the first recognizer for Vista. I don't think they have really updated it since then.

Quote from: marcusvdt on April 08, 2015, 02:07:40 PM

The whole purpose of my questions is that I'm more inclined to sacrifice the natural speaking in benefit of accuracy.

This is about finding your own balance. I agree that accuracy is important, except that it should be natural feeling to say a command. "Natural language" where you can just say anything you want is not going to work without making a lot of mistakes, but you should still construct phrases in a way that is easy to remember and easy to say. If you find it's hard to say a command, or you find it's hard to remember it, or you find that VC does not recognize you very well, then you can tweak that command, and after a while you will get a feel for what works best.

This is why I have dedicated so much time and energy to making it as easy as possible to create and adjust your own voice commands to suit your own needs, language, and style. It comes at the cost of a steeper learning curve for new users but if we could not customize VC this way, I don't see what the point would be. It would become like so many "assistants" that are fun for about 10 minutes and then you never use them again.

jitterjames · « **Reply #6 on:** April 08, 2015, 06:16:22 PM »

One note about speed. Sometimes if you have a terrible sounding mic or a lot of noise it can cause some delays, but usually the main thing that affects speed is this:

After hearing each word you say, depending on whether or not there are other possible commands that exist that start with that same phrase, it will wait a certain amount of time. If it doesn't think there are any other likely phrases that start with what you said so far then it will not take long.

Here is a simple example.

If you have 2 simple commands with the phrases:
- "I'm going to sleep now"
- "I'm going to sleep later"

When you say "I'm going to sleep..." it will wait for a while to see if you are going to say "now" or "later". But as soon as you say one of these words it will register the command almost instantly, as long as you don't have any other commands that could start with this or sound very similar. If you wait too long it will give up and usually not trigger the command.

Obviously as you add more complex sets of commands, and especially when you use payloads with subset matching, or use a lot of optional phrases it can introduce slightly less predictable results but generally it comes down to this. I have actually set two delays in the engine: one very short delay for when it has determined that command and there are no other likely ones that it could be turned into by saying more words, and a longer delay for when it's still not sure which command you are saying.

Once you understand how this works you can try to test some of your commands and you will see how the speed works with these two types of situations.

None of this applies to VoxWav by the way. VoxWav will analyse the audio, even if you give it several commands in a row, in a fraction of a second, because it knows the contents of the entire sound file, and is not listening to see if you are going to say more.

Author Topic: Best arrangement and/or standardization of phrases in commands (Read 3653 times)

marcusvdt

Best arrangement and/or standardization of phrases in commands

jitterjames

Re: Best arrangement and/or standardization of phrases in commands

jitterjames

Re: Best arrangement and/or standardization of phrases in commands

marcusvdt

Re: Best arrangement and/or standardization of phrases in commands

marcusvdt

Re: Best arrangement and/or standardization of phrases in commands

jitterjames

Re: Best arrangement and/or standardization of phrases in commands

jitterjames

Re: Best arrangement and/or standardization of phrases in commands