Question

我正在进行scrapy 0.20.2。

$ scrapy shell "http://newyork.craigslist.org/ata/"

我想列出指向index.html的广告页面的所有链接

$ sel.xpath('//a[contains(@href,html)]')
... 
<Selector xpath='//a[contains(@href,"html")]' data=u'<a href="/mnh/atq/4243973984.html">Wicke'>,
<Selector xpath='//a[contains(@href,"html")]' data=u'<a href="/mnh/atd/4257230057.html" class'>,
<Selector xpath='//a[contains(@href,"html")]' data=u'<a href="/mnh/atd/4257230057.html">Recla'>,
<Selector xpath='//a[contains(@href,"html")]' data=u'<a href="/ata/index100.html" class="butt'>]

我想使用XPath匹配函数来匹配正则表达式[0-9]+.html的形式的链接。

$ sel.xpath('//a[matches(@href,"[0-9]+.html")]')
...
ValueError: Invalid XPath: //a[matches(@href,"[0-9]+.html")]

怎么了？谢谢。

Answer 1

matches是一个XPath 2.0函数，scrapy只支持XPath 1.0（它没有内置任何正则表达式支持）。您必须使用scrapy选择器提取所有链接，然后在Python级别而不是在XPath中进行正则表达式过滤。

Answer 2

对于这个特殊用例，使用translate(...)：

有一个XPath 1.0解决方法

//a[
  translate(substring-before(@href, '.html'), '0123456789', '') = ''
  and @href != '.html'
  and substring-after(@href, '.html') = '']

translate(...)调用会在.html扩展名之前删除名称部分中的所有数字。第二行检查确保排除.html（点之前没有任何内容），最后一行确保.html实际上是文件扩展名。

scrapy和xpath函数'匹配'语法

2 个答案: