Author Topic: Scrape and How to use REGEX  (Read 5099 times)

0 Members and 1 Guest are viewing this topic.

deco123411

  • Jr. Member
  • **
  • Posts: 27
  • Karma: 1
    • View Profile
Scrape and How to use REGEX
« on: September 24, 2016, 05:00:14 PM »
I have a question regarding the reg ex.
SO i scraped feeds from my local news but when i do that i will get
<title>NEWS 24 | New story uncovered </title>
<title>NEWS 24 | New car uncovered </title>
when i use regex helper. its definitely the XML as they are all like this. so the problem is that headlines will read back
match 1 news 24
match 2 news 24 etc.

Its the same with BBC feeds
http://feeds.bbci.co.uk/news/rss.xml#

I get    <title><![CDATA[Junior doctors call off five-day strikes over contracts]]></title>.
Regex then gives me error

Question
A) how do i use regex to get to get the second part of the title
B) how do i get the description that is in the XML. Regex shows no match found when i copy and paste <description>...</description> in the Regex helper.

saw the videos but no info on omitting certain strings
« Last Edit: September 24, 2016, 05:26:06 PM by deco123411 »

PegLegTV

  • $upporter
  • Sr. Member
  • *****
  • Posts: 498
  • Karma: 43
    • View Profile
Re: Scrape and How to use REGEX
« Reply #1 on: September 24, 2016, 05:27:47 PM »
Based off the little bit of XML you shared this Regex

Code: [Select]
<title>NEWS\s24\s.(.*?)</
should retrieve:

Quote
New story uncovered
New car uncovered

as for the description part you will have to play with it as each site is different, if you still need help you will need to post the xml in its entirety so we aren't making guesses

deco123411

  • Jr. Member
  • **
  • Posts: 27
  • Karma: 1
    • View Profile
Re: Scrape and How to use REGEX
« Reply #2 on: September 24, 2016, 05:56:48 PM »
Yes please.
been wrecking my brain trying to understand why they are not working.
I even tried your regex
Is there a certain pattern i must follow like creat an XML to hold the news in.
and can scrape only get from certain FEEDS

http://feeds.news24.com/articles/news24/TopStories/rss
http://feeds.bbci.co.uk/news/rss.xml#
These are the ones in question

PegLegTV

  • $upporter
  • Sr. Member
  • *****
  • Posts: 498
  • Karma: 43
    • View Profile
Re: Scrape and How to use REGEX
« Reply #3 on: September 25, 2016, 12:33:59 AM »
Quote
as for the description part you will have to play with it as each site is different

what have you tried, I'm willing to help those who show they have tried to help them self's,

...I have a life too 8)
« Last Edit: September 25, 2016, 12:37:52 AM by PegLegTV »

deco123411

  • Jr. Member
  • **
  • Posts: 27
  • Karma: 1
    • View Profile
Re: Scrape and How to use REGEX
« Reply #4 on: September 25, 2016, 01:50:28 AM »
 Hahaha yes of coarse. ;D

Please i dont want handouts only explanations as i am still new but getting far quite fast.
I solved the first solution of
<title>NEWS 24 | New story uncovered </title>
and getting rid of the first title by using regex
      <title>.*?\s(.*?)</title>
this worked with the exception that I am still getting the | before the headline. so it wont read the headline

My other issue is if i want a headline and a description which is normally the next match.
how do i get both. Do they need a special payload, especially if say i were to use this to scrape movie info that is out on circuit.

so say headline is <title>...</
description is <desc>....</
or
 name of movie <title>....</

length of that movie <running time>...</

Last thing is that James uses {match.1} etc
but in the forums i see a lot of them are actually more specific
{i}. {Match.{i}.1}.
what would i change to make it match 3.2 or 4.2.
This creates an issue in that if i want info on a specific headline or movie, e.g i have enter match 42.1 in the logical command builder(where movie ...is match 42.1). I cannot simply ask info about any movie or headline and get that info
This is where my confusion lies.


I know this would help others as well. Please Guys
Thank you in advance


« Last Edit: September 25, 2016, 01:53:54 AM by deco123411 »

PegLegTV

  • $upporter
  • Sr. Member
  • *****
  • Posts: 498
  • Karma: 43
    • View Profile
Re: Scrape and How to use REGEX
« Reply #5 on: September 25, 2016, 03:00:35 AM »
I'll do my best to explain what I can, I'm not the RegEx guru on here but I'll explain what I can

when creating a regex pattern it will most likely only work with the file (xml) you create it for, which means you will need to create a new RexEx pattern for each xml,

Quote
so say headline is <title>...</
description is <desc>....</
or
 name of movie <title>....</

length of that movie <running time>...</

looking at what you posted the Bold words are an example of why it wont work on every site, if you create a pattern for your news site it can not work with your movie site because the XML is not the same, so you will need to create a new pattern for every xml you scrape

this is why I said
Quote
as for the description part you will have to play with it as each site is different

in order for Regex to work its best, it needs a pattern to follow, in order to make sure your matches are exactly what you want, then you will need to include the words that are between the angle brackets, the more precise you are with your Regex patterns the less likely you are to get incorrect matches. attached is an example of needing to create a new Regex for each site

Code: [Select]
<?xml version="1.0" encoding="utf-16"?>
<!--VoxCommando 2.2.2.8-->
<commandGroup open="True" name="example why you need to create regex pattern for each site" enabled="True" prefix="" priority="0" requiredProcess="" description="">
  <command id="375" name="News24 Example (Results.RegExSingle)" enabled="true" alwaysOn="False" confirm="False" requiredConfidence="0" loop="False" loopDelay="0" loopMax="0" description="">
    <action>
      <cmdType>Scrape</cmdType>
      <params>
        <param>http://feeds.news24.com/articles/news24/TopStories/rss</param>
      </params>
      <cmdRepeat>1</cmdRepeat>
    </action>
    <action>
      <cmdType>Results.RegExSingle</cmdType>
      <params>
        <param>&lt;title&gt;.*?\s\|(.*?)&lt;/title&gt;.*?&lt;description&gt;(.*?)&lt;/</param>
      </params>
      <cmdRepeat>1</cmdRepeat>
    </action>
    <action>
      <cmdType>OSD.ShowText</cmdType>
      <params>
        <param>{Match.1.1}</param>
      </params>
      <cmdRepeat>1</cmdRepeat>
    </action>
    <action>
      <cmdType>OSD.AddText</cmdType>
      <params>
        <param>{Match.1.2}</param>
      </params>
      <cmdRepeat>1</cmdRepeat>
    </action>
  </command>
  <command id="381" name="bbci - Does not work - only an example " enabled="true" alwaysOn="False" confirm="False" requiredConfidence="0" loop="False" loopDelay="0" loopMax="0" description="">
    <action>
      <cmdType>Scrape</cmdType>
      <params>
        <param>http://feeds.bbci.co.uk/news/rss.xml#</param>
      </params>
      <cmdRepeat>1</cmdRepeat>
    </action>
    <action>
      <cmdType>Results.RegExSingle</cmdType>
      <params>
        <param>&lt;title&gt;.*?\s\|(.*?)&lt;/title&gt;.*?&lt;description&gt;(.*?)&lt;/</param>
      </params>
      <cmdRepeat>1</cmdRepeat>
    </action>
    <action>
      <cmdType>OSD.ShowText</cmdType>
      <params>
        <param>{Match.1.1}</param>
      </params>
      <cmdRepeat>1</cmdRepeat>
    </action>
    <action>
      <cmdType>OSD.AddText</cmdType>
      <params>
        <param>{Match.1.2}</param>
      </params>
      <cmdRepeat>1</cmdRepeat>
    </action>
  </command>
</commandGroup>

the Results.RegExSingle action in the "News24 Example (Results.RegExSingle)" command is the regex you where looking for for your news24 site

the second command "bbci - Does not work - only an example" will fail when using the same regex for the news24 site you will need to give it a try for your self and see what you can up with,

"News24 Example (Results.RegExSingle)" - Results.RegExSingle pattern break down

Results.RegExSingle is needed because in this case we want to make sure that the title and the descriptions are in the same match set, because if we where to do two different  Results.RegEx actions (1 to get the title, 2 to get the descriptions) then they may not match up correctly and give a wrong description for the title

Code: [Select]
<title>.*?\s\|(.*?)</title>.*?<description>(.*?)</
adding a \ (forward slash) before the | will escape it

>.*?<
is used to transcend the line break to the description section

when I first learned Regex it took me several attempts to get what I was after, and I still run into sites that are a pain to figure out the right pattern, it takes a lot of tinkering



deco123411

  • Jr. Member
  • **
  • Posts: 27
  • Karma: 1
    • View Profile
Re: Scrape and How to use REGEX
« Reply #6 on: September 25, 2016, 04:33:13 AM »
Thank you so much for your help

the \| and  .*? I completely forgot about.although adding .*? didnt work

as for the tinkering i can believe you.
the bbc regex from
http://feeds.bbci.co.uk/news/rss.xml# is
 <title>.*?\!\[\w*\[(.*?)\]\]></
to get the title
but when i read it or have osd show it only said match.1.1.

not to mention IMDB and XE exchange is such a pain i still have not gotten it right

My main question is that
If i search for say a movies and want to ask it different returns like
i ask what movies are playing
osd shows the movies playing 1-10
now i ask it how long will .....(which is movie 3) playing running time(match3.4)
(Usually the next match shows the lenth of the movie).
How do i get VC to get the info without actually putting in the LCB builder movie ....(match3.4)
so as to not have to do it for all the movies.

This has to do with all regex as well.
So if i get headlines 1-10
then i want the description of 3 after.
I saw what nim5ter did but im trying to understand the work
THIS IS MY MAIN STRUGGLE AS WELL
 :-[ :bonk
« Last Edit: September 25, 2016, 04:45:53 AM by deco123411 »

nime5ter

  • Administrator
  • Hero Member
  • *****
  • Posts: 2012
  • Karma: 61
    • View Profile
    • Getting Started with VoxCommando
Re: Scrape and How to use REGEX
« Reply #7 on: September 25, 2016, 09:52:43 AM »
1. It sounds like you're trying to do a lot of things at once. Since every line of 'code' has to be exactly right when programming, this approach is likely to lead to a lot of user error and confusion.

2. When asking for help on the forum, please start by posting the command xml you are trying to use and having problems with, rather than explaining things in the abstract. This is something we mention in our "getting support" guidelines. It saves a lot of time, and it is much easier for others to understand and test the problems you're encountering.

For example, in your initial post here you wrote that the web page you were scraping had headlines of the form:
Code: [Select]
<title>NEWS 24 | New story uncovered </title>
PegLegTV accordingly provided a regular expression that would capture title string patterns that appear after "NEWS 24 | ", but in fact, the information initially provided was not at all accurate, so of course the solution provided is not going to work either.

It's also difficult to explain a problem in the abstract when concepts are still not fully learned, because it's easy to think a problem is one thing when it's another.

As another example, you say that:
Quote
the bbc regex from
http://feeds.bbci.co.uk/news/rss.xml# is
 <title>.*?\!\[\w*\[(.*?)\]\]></
to get the title
but when i read it or have osd show it only said match.1.1.

I have run the command that you posted above, using the regular expression you use here in this quote.

If I "save and execute" the command that you posted with that regular expression, the correct OSD list of 4 headlines is displayed, no problem.
Code: [Select]
<?xml version="1.0" encoding="utf-16"?>
<!--VoxCommando 2.2.2.6-->
<command id="277" name="BBC" enabled="true" alwaysOn="False" confirm="False" requiredConfidence="0" loop="False" loopDelay="0" loopMax="0" description="">
  <action>
    <cmdType>Scrape</cmdType>
    <params>
      <param>http://feeds.bbci.co.uk/news/rss.xml#</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>Results.RegEx</cmdType>
    <params>
      <param>&lt;title&gt;.*?\!\[\w*\[(.*?)\]\]&gt;&lt;/</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>OSD.ShowText</cmdType>
    <params>
      <param>Today's BBC headlines ({#M} total):</param>
      <param>10000</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>OSD.AddText</cmdType>
    <params>
      <param>{i}. {Match.{i}.1}.</param>
    </params>
    <cmdRepeat>4</cmdRepeat>
  </action>
  <action>
    <cmdType>TTS.SpeakSync</cmdType>
    <params>
      <param>Here are the top  {#M} BBC headlines| These are the latest head lines!</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>TTS.Speak</cmdType>
    <params>
      <param>{i}. {Match.{i}.1}.</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
</command>

However, only *one* headline -- the first one -- is read aloud. This has nothing to do with the regular expression. This has to do with the fact that the TTS.Speak action is not set to repeat (iterate) 4 times, whereas the OSD.AddText line is.

If you change the repeat value of the TTS.Speak action from 1 to 4, it will read 4 headlines aloud, just as 4 headlines are displayed.

If you want all headlines (matches) found, then the iteration value should be changed from the static 4 to {#M}.
http://voxcommando.com/mediawiki/index.php?title=Variables#Matches

You are saying that OSD is not displaying multiple headlines for you (I think -- though originally I thought you were saying that it was not working at all; it's not quite clear). That doesn't make sense based on the command you posted above, so either you are testing a different command on your end, or you are not accurately explaining what's happening.

My suggestion is that rather than trying to scrape many different websites and figure out the regular expressions needed for each, you concentrate on creating a working solution for one website, and one objective. Within that, work on understanding what each line of a command is doing, what information is passed to the command (if any), etc.
« Last Edit: September 25, 2016, 10:55:43 AM by nime5ter »
TIPS: POST VC VERSION #. Explain what you want VC to do. Say what you've tried & what happened, or post a video demo. Attach VC log. Link to instructions followed.  Post your command (xml)

deco123411

  • Jr. Member
  • **
  • Posts: 27
  • Karma: 1
    • View Profile
Re: Scrape and How to use REGEX
« Reply #8 on: September 25, 2016, 11:25:47 AM »
Thank you for getting back to me Nime5ter. Your feedback is always appreciated.
As for that regex I used and never worked. I Only had to purge cache and restart and then it did.

Please let me ask my main question as to not confuse anyone anymore.
 I will focus only on say a movie topic to be more concise. The problems i run into are the all same as even though i use and get the regex pattern, I tend to struggle but seem to get the info eventually but still fall into the same hole.

If i use regex for a local movie circuit and get what movies are playing.
I get the lists of movies playing from 1-10.

1-titanic
2-bad moms
3- ghostbusters({match3.1})
4....etc
but now i want info on movie 3 only.( what is the running time {Match3.2} what cinema its playing at {match 3.4}) etc..

A)
So how would VC know info on that specific movie if since the results.regex wont specefically point to {match3.2} or if its a longer list of movies {match 30.2}

I hope you can understand what im trying to achieve.
Same applies to the news feeds. Its essentially the same problem since i need to store some info or variabe to be able to get more info about that movie or news feed with out having to overpopulate the osd or tts with name of movies and the time running.

or as for the the news feeds. the headline and the description all together on the osd. Tried it... didnt look pretty.

I really hope i have pinpointed my issue.

I think I am asking you Nim5ster cause you did something similar with Reuters news. But in there is payloads and XML.

B) Just briefly, if its a complicated page like a busy HTML is it better to use RoboB or scrape.HTML
Like the XE exchange XML is insane
« Last Edit: September 25, 2016, 11:37:19 AM by deco123411 »

nime5ter

  • Administrator
  • Hero Member
  • *****
  • Posts: 2012
  • Karma: 61
    • View Profile
    • Getting Started with VoxCommando
Re: Scrape and How to use REGEX
« Reply #9 on: September 25, 2016, 01:25:28 PM »

but now i want info on movie 3 only.( what is the running time {Match3.2} what cinema its playing at {match 3.4}) etc..

A)
So how would VC know info on that specific movie if since the results.regex wont specefically point to {match3.2} or if its a longer list of movies {match 30.2}

Based on the xml you attached to your command, there is no {Match.3.2} or {Match.3.4} because you're only capturing the name of the movie.

There are myriad ways to skin this cat. But basically, it's probably best to store movie title information so that you can then retrieve it afterward for your follow-up commands. We often use payloadXML for this.

I provide three different solutions in the attached command collection. Explanations of each command are included within the commands themselves. These should be tested by issuing the appropriate voice commands.

Since phrasing is similar or sometimes identical in the different command groups, I've disabled two of the three. You can switch which group is enabled as you go along.

There are other ways this could be handled. Just keep in mind that the only "matches" that exist at a given time are those that you have captured in (brackets). Those will continue to be retrievable as {match.x.x} until overwritten by a subsequent action that returns matches. But that doesn't mean they will be easy to ask for in a 'friendly' way afterward, so it's not always useful that these matches remain in memory.
« Last Edit: September 25, 2016, 01:45:17 PM by nime5ter »
TIPS: POST VC VERSION #. Explain what you want VC to do. Say what you've tried & what happened, or post a video demo. Attach VC log. Link to instructions followed.  Post your command (xml)

nime5ter

  • Administrator
  • Hero Member
  • *****
  • Posts: 2012
  • Karma: 61
    • View Profile
    • Getting Started with VoxCommando
Re: Scrape and How to use REGEX
« Reply #10 on: September 25, 2016, 02:12:43 PM »
Forgot the most flexible option (arguably). This allows you to ask for any possible info of interest. Requires the attached payload xml as well. Variations of this solution would apply to most xml feeds.

Code: [Select]
<?xml version="1.0" encoding="utf-16"?>
<!--VoxCommando 2.2.2.6-->
<commandGroup open="True" name="MOVIES INFO (method 4)" enabled="True" prefix="" priority="0" requiredProcess="" description="">
  <command id="275" name=" movies playing today" enabled="true" alwaysOn="False" confirm="False" requiredConfidence="0" loop="False" loopDelay="0" loopMax="0" description="If you want to use this info in subsequent commands you need to capture it. Here I store the movie name and its match number in a payloadXML file. This is used in the next command.">
    <action>
      <cmdType>PayloadXML.Clear</cmdType>
      <params>
        <param>payloads/todays_movies.xml</param>
      </params>
      <cmdRepeat>1</cmdRepeat>
    </action>
    <action>
      <cmdType>Scrape</cmdType>
      <params>
        <param>http://www.sterkinekor.com/website/scripts/xml_feed.php?name=movies</param>
      </params>
      <cmdRepeat>1</cmdRepeat>
    </action>
    <action>
      <cmdType>File.Write</cmdType>
      <params>
        <param>sterinkor.xml</param>
        <param>{LastResult}</param>
      </params>
      <cmdRepeat>1</cmdRepeat>
    </action>
    <action>
      <cmdType>Results.RegEx</cmdType>
      <params>
        <param>name&gt; (.*?) &lt;/</param>
        <param>\s</param>
      </params>
      <cmdRepeat>1</cmdRepeat>
    </action>
    <action>
      <cmdType>PayloadXML.AddPair</cmdType>
      <params>
        <param>payloads/todays_movies.xml</param>
        <param>{i}</param>
        <param>{Match.{i}}</param>
        <param>True</param>
      </params>
      <cmdRepeat>10</cmdRepeat>
    </action>
    <action>
      <cmdType>OSD.ShowText</cmdType>
      <params>
        <param>Movies playing are:  </param>
        <param>20000</param>
      </params>
      <cmdRepeat>1</cmdRepeat>
    </action>
    <action>
      <cmdType>OSD.AddText</cmdType>
      <params>
        <param>{i}. {Match.{i}}.</param>
      </params>
      <cmdRepeat>10</cmdRepeat>
    </action>
    <action>
      <cmdType>TTS.SpeakSync</cmdType>
      <params>
        <param> Movies currently showing are| This weeks movies are| SterrKiinakorr is displaying </param>
      </params>
      <cmdRepeat>1</cmdRepeat>
    </action>
    <action>
      <cmdType>TTS.Speak</cmdType>
      <params>
        <param>{i}. {Match.{i}}.</param>
      </params>
      <cmdRepeat>10</cmdRepeat>
    </action>
    <phrase optional="true">Pandora</phrase>
    <phrase>What movies , what is new , what can i watch</phrase>
    <phrase optional="true">are playing, is playing</phrase>
    <phrase>are out, on circuit , on ster kinekor, are out on circuit</phrase>
  </command>
  <command id="282" name="other movie info" enabled="true" alwaysOn="False" confirm="False" requiredConfidence="0" loop="False" loopDelay="0" loopMax="0" description="Scraping a website can take longer than reading a local file. This is why I stored the scrape result locally. It will be overwritten the next time you run the other command.">
    <action>
      <cmdType>File.Read</cmdType>
      <params>
        <param>sterinkor.xml</param>
      </params>
      <cmdRepeat>1</cmdRepeat>
    </action>
    <action>
      <cmdType>Results.RegEx</cmdType>
      <params>
        <param>{1}&gt;(.*?)&lt;/</param>
      </params>
      <cmdRepeat>1</cmdRepeat>
    </action>
    <action>
      <cmdType>OSD.ShowText</cmdType>
      <params>
        <param>{PF.1} for the movie {PF.2} is</param>
      </params>
      <cmdRepeat>1</cmdRepeat>
    </action>
    <action>
      <cmdType>OSD.AddText</cmdType>
      <params>
        <param>{Match.{2}}</param>
      </params>
      <cmdRepeat>1</cmdRepeat>
    </action>
    <phrase>What is the</phrase>
    <payloadFromXML phraseOnly="False" use2partPhrase="False" phraseConnector="by" Phrase2wildcard="anyone" optional="False">payloads\movie_data.xml</payloadFromXML>
    <phrase>for the movie, for the film</phrase>
    <payloadFromXML phraseOnly="False" use2partPhrase="False" phraseConnector="by" Phrase2wildcard="anyone" optional="False">payloads\todays_movies.xml</payloadFromXML>
  </command>
</commandGroup>

TIPS: POST VC VERSION #. Explain what you want VC to do. Say what you've tried & what happened, or post a video demo. Attach VC log. Link to instructions followed.  Post your command (xml)

nime5ter

  • Administrator
  • Hero Member
  • *****
  • Posts: 2012
  • Karma: 61
    • View Profile
    • Getting Started with VoxCommando
Re: Scrape and How to use REGEX
« Reply #11 on: September 25, 2016, 04:22:22 PM »
B) Just briefly, if its a complicated page like a busy HTML is it better to use RoboB or scrape.HTML
Like the XE exchange XML is insane

I don't know what you want to do on the XE exchange site so I can't really help. The command xml you posted isn't trying to do anything.

The short answer is there is no 'one size fits all' rule. We need to decide how we want a specific command structured, and then analyze the data source--whether it's html or xml or json--by trying out some example queries to figure out how to best solve the problem each time.

Generally we would scrape a URL that provides answers to a specific query. This requires knowing what kind of information is sought.

I would not start with "http://www.xe.com/pt/currencyconverter/" but with, for example:
Code: [Select]
http://www.xe.com/pt/currencyconverter/convert/?Amount=1&From=USD&To=EURif I wanted to convert USD to Euros, or whatnot.

By looking at a sample source html at this page, I can figure out how to isolate the resulting currency rate.

I would create a command where I say, "How much is X US dollars in Euros?"

I would pass X, USD and EUR as payloads into a scrape of a URL of form:
Code: [Select]
http://www.xe.com/pt/currencyconverter/convert/?Amount={1}&From={2}&To={3}
So, deciding exactly how one wants to structure a voice command and the response one wants for that command is really a more important starting point.

But if a site's source html looks too unfriendly, as a rule it's best to find another data source. I would only grapple with a problematic site if there were no better options.

For currency exchange, there are several sites that have RESTful APIs that potentially make things much easier and more versatile. I'd probably use one of those. If my needs were basic, an API like this would suffice: http://fixer.io/

e.g., scraping http://api.fixer.io/latest?base=USD gives me results from 1 USD to 30 other currencies.

RoboBrowser is useful when you have no choice but to emulate interacting with a web interface, or if elements are well structured so that you can grab text reliably by an element ID value or name.

In the case of XE.com, the same regular expression is required, I think, whether one is using regular scrape or grabbing the relevant table element using RoboBrowser. So it makes no difference. To capture just the resulting currency amount it would be something like:
Code: [Select]
class="rightCol">(\d.*?)\&nbsp;<span\sclass="uccResCde"
It's ugly, but not that problematic. The question is how often does xe.com change its design, and also why use a website that specifically has disclaimers telling us not to automatically grab its data when there are other sources we can use instead?

« Last Edit: September 25, 2016, 04:35:16 PM by nime5ter »
TIPS: POST VC VERSION #. Explain what you want VC to do. Say what you've tried & what happened, or post a video demo. Attach VC log. Link to instructions followed.  Post your command (xml)

deco123411

  • Jr. Member
  • **
  • Posts: 27
  • Karma: 1
    • View Profile
Re: Scrape and How to use REGEX
« Reply #12 on: September 26, 2016, 07:30:13 AM »
Wow it's a lot to take in :o

Guess I'll start practising.
Thank you for taking the time to help me.. this was exactly my issue with storing payloads xml then using them later. just have to see what works best.
Will try and take the most out of your " most flexible option" and learn why it's so flexible for future commands.

As for the XE exchange I use the app and wanted something I could see for verification as I am not used to the other sites. But you are right. Why make life harder.

You guys are awesome. Keep up the great work and thanks  for all your patience

nime5ter

  • Administrator
  • Hero Member
  • *****
  • Posts: 2012
  • Karma: 61
    • View Profile
    • Getting Started with VoxCommando
Re: Scrape and How to use REGEX
« Reply #13 on: September 26, 2016, 06:00:07 PM »
You can use XE.com too. If you wanted to retrieve a list of different exchange rates from your home currency to others, rather than wanting to ask for specific exchange values, this is possible on XE.com if you find the right page on their site.

I provide an example for the Brazilian Real below that you can test and 'deconstruct'. Here is an explanation of how I did this.

First, I explored the site to try to locate the simplest XE.com page that would provide a one-stop list of exchange values for the BRL. It turns out to be: http://www.xe.com/pt/currency/brl-brazilian-real

To find the proper regular expression, I kept that web page open so that I could consult it. Then I scraped the site and used the RegEx tool (with 'WordWrap' selected) to view its source code.

I knew I wanted to capture the 2 currencies and the rate of exchange for each exchange rate.

I compared the page in my web browser to the scraped results in the Regex Tool in order to find the section I needed. (I searched for a specific exchange rate in the Regex Tool to locate the section of the page of interest.)

Then I analysed the pattern. In this case, each exchange rate is presented in a table cell. They follow a consistent pattern, e.g.:
Code: [Select]
rel='GBP,BRL,1,3'>0,23779</a></td>
Code: [Select]
rel='AUD,BRL,1,4'>0,40392</a></td>
There is extra information that I do not need, but it is fairly simple from this pattern to capture the data that I want.

So I developed my regular expression accordingly:
Code: [Select]
rel='(\w{3}),(\w{3}),.*?>(\d.*?)<
Note that there are 3 sets of (brackets) in my regular expression. One for each currency code and the third one captures the exchange rate.

Code: [Select]
<?xml version="1.0" encoding="utf-16"?>
<!--VoxCommando 2.2.2.6-->
<command id="258" name="Show me current exchange rates" enabled="true" alwaysOn="False" confirm="False" requiredConfidence="0" loop="False" loopDelay="0" loopMax="0" description="">
  <action>
    <cmdType>Scrape</cmdType>
    <params>
      <param>http://www.xe.com/pt/currency/brl-brazilian-real</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>Results.RegEx</cmdType>
    <params>
      <param>rel='(\w{3}),(\w{3}),.*?&gt;(\d.*?)&lt;</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>OSD.ShowText</cmdType>
    <params>
      <param>XE.com From X to Y:</param>
      <param>15000</param>
      <param>-5</param>
    </params>
    <cmdRepeat>1</cmdRepeat>
  </action>
  <action>
    <cmdType>OSD.AddText</cmdType>
    <params>
      <param>{Match.{i}.1} to {Match.{i}.2} = {Match.{i}.3}</param>
    </params>
    <cmdRepeat>{#M}</cmdRepeat>
  </action>
  <phrase>Show me current exchange rates</phrase>
</command>

As you can see, my process for creating this command pretty much duplicates the procedure that James demonstrates in his RegEx video tutorials.
TIPS: POST VC VERSION #. Explain what you want VC to do. Say what you've tried & what happened, or post a video demo. Attach VC log. Link to instructions followed.  Post your command (xml)

deco123411

  • Jr. Member
  • **
  • Posts: 27
  • Karma: 1
    • View Profile
Re: Scrape and How to use REGEX
« Reply #14 on: September 27, 2016, 07:24:11 PM »
Yes thank for taking the time to show and explain in detail.

I think the most intimidating aspect of regex is not the pattern findings but the digging from the site.  Especially if they are all over the place.  I struggle several times cause the divs and <a href are so long. I don't know where to start.

Appreciate all the help and support you guys have offered