When crawling with Import.io, we have the advanced option to set an URL Pattern to determine with pages should have data extracted.
I'm used to use Regex, so I'm having a hard time to use the Import.io URL Patterns.
The pattern in Regex would be
http://www.site.com/.[0-9]+.html.
How to do that with the Import.io Pattern?
I'd tried the following but it didn't work:
www.site.com/{any}{num}.html
Some examples that should be extracted:
- www.site.com/foo/bar/foo234.html
- www.site.com/bla890.html
- www.site.com/bar/bar/bar/bar/bar/bar/aaa123.html
These are the Import.io Notation:
- {any} - anything (including nothing) {num} - a number, e.g. 8767
- {alpha} - a-z characters, e.g. Dog {alpha-num} - either alpha or
num, e.g. 435h5k
- {words-num} - words containing numbers separated
by -, _ or +, e.g. this-is_a+2nd example
- {not-slash} - anything
apart from a slash
- {uuid} - a UUID, e.g.
439a110f-bba1-46a5-befd-1f32cfb63dc8
- {query-string} - a query
string, e.g. ?a=1&b=2%c=3
- {query-params} - a partial query string,
e.g. a=1&b=2
- {ref} - a reference, otherwise known as an anchor,
e.g. #foo $ - match the end of the URL
More details: http://support.import.io/knowledgebase/articles/247574-advanced-crawler-options
Thanks!