Import.io - URL Pattern for "where to extract data from" optimization

时间:2015-07-28 17:13:41

标签: regex web-crawler import.io

When crawling with Import.io, we have the advanced option to set an URL Pattern to determine with pages should have data extracted.

I'm used to use Regex, so I'm having a hard time to use the Import.io URL Patterns.

The pattern in Regex would be

http://www.site.com/.[0-9]+.html.

How to do that with the Import.io Pattern?

I'd tried the following but it didn't work:

www.site.com/{any}{num}.html

Some examples that should be extracted:

  • www.site.com/foo/bar/foo234.html
  • www.site.com/bla890.html
  • www.site.com/bar/bar/bar/bar/bar/bar/aaa123.html

These are the Import.io Notation:

  • {any} - anything (including nothing) {num} - a number, e.g. 8767
  • {alpha} - a-z characters, e.g. Dog {alpha-num} - either alpha or num, e.g. 435h5k 
  • {words-num} - words containing numbers separated by -, _ or +, e.g. this-is_a+2nd example 
  • {not-slash} - anything apart from a slash 
  • {uuid} - a UUID, e.g. 439a110f-bba1-46a5-befd-1f32cfb63dc8 
  • {query-string} - a query string, e.g. ?a=1&b=2%c=3
  • {query-params} - a partial query string, e.g. a=1&b=2 
  • {ref} - a reference, otherwise known as an anchor, e.g. #foo $ - match the end of the URL

More details: http://support.import.io/knowledgebase/articles/247574-advanced-crawler-options

Thanks!

0 个答案:

没有答案