I haven't found the size of payloads or of the voicecommands.xml file to be an issue with performance. Very large payloads of artist and albums etc. will obviously take longer to load, especially the fist time after an update.
One thing that is important as your commands grow, and especially important if you have multiple users is to choose your command syntax carefully. I have noticed that the way you select channels, simply by saying the channel name, is a very poor choice. You are creating a large number of single word commands. In general, you should try to have no single word commands at all. By using a simple word like channel, you group all those choices together and isolate them from all your other commands. Also, when the computer hears the first word "channel" it is then listening specifically for the list of channels you have provided, and so it will be more likely to have a higher confidence when it hears a word like golf. "channel golf" is MUCH easier to identify than just "golf", especially if you have a command like "go left". Slur your words a bit and it sounds a bit like golf right?
Most systems that claim that you don't need to train them, are making this claim on the basis of a carefully chosen set of commands.
We use commands like "play artist <name>" and "play album <name>" for the same reason. We could just use "play <name>" for both, and let the computer sort it out, but we increase the chances of something going wrong. We could never create a command where you just say the artist's name. That would lead to disaster!
Using the key word at the beginning of your commands also allows you to create just one command with a payload instead of having to create a separate command for each channel as you have done.
The other issues you are dealing with are.
1) multiple users
2) microphone
they are both pretty big topics, but everything I said above will address them both to some extent. I can say that a busy home with a lot of people around probably is not a great place to try to use a very sensitive conference microphone that picks up everything. In this case, something like the Amulet remote will probably work much better. The conference mic would be better suited to a bachelor. A headset obviously doesn't work for multiple users, but you do want something that is designed to be close to your mouth so as not to pick up background sounds.
In terms of user profiles, you can try 3 different things.
1- Use a single profile and don't train it at all. If you train it to your voice, it won't like your wife's or your kids.
2 - Use a single profile and let everyone train it. This might work, or it might end in tears...
3 - Set up separate profiles and let people train their own, and switch between them. It is possible to create a command like "This is John" and have it switch to your profile. It will take some time to reload each time you change to a new profile, but for me it is pretty quick (a couple seconds).
I don't know if you are all native English speakers. Obviously accents are a problem. In my testing, I have found that most English speaking men and women are understood quite well on my profile (which I have only trained a little). My friend's wife (Russian) speaks perfect English but with a strong accent and the computer misunderstood most of what she said when she tried it. I have also found that very young children are usually not well understood. I'm not sure if it is the high pitch or something else. Anyway, I only mention it because you might be able to use one profile for adults and one for children.
Ah the joy of experimentation! Just remember that with great power comes great opportunity to really mess things up. I want people to be able to do whatever they want with VoxCommando, but if you don't choose your commands carefully you will no doubt run into problems. Also VoxCommando is a good way to remind ourselves to articulate!