我有以下案例
...
...
<tr>
<td class="company-info">Phone:</td>
<td> "020 641512" <span class="provider">ABC</span></td>
</tr>
....
我希望
<td>
的值为Phone:
,则从下一个020 641512
<td>
)
我想象过这样的事情
phone = hxs.xpath("//td/text()[contains('Phone:')]", "Not available")
答案 0 :(得分:1)
我认为你需要:
//td[contains(., 'Phone:')]/following-sibling::td/substring-before(substring-after(normalize-space(text()[1]), '"'), '"')
上面的表达式适用于Xquery,如果它不起作用,请尝试
//td[contains(., 'Phone:')]/following-sibling::td/text()[1]
输出[space]"020 641512"
答案 1 :(得分:1)
使用sc Selector
和SelectorList
,你可以use regular expressions via their .re()
method:
>>> hxs.xpath('//td[contains(., "Phone")]/following-sibling::td[1]').re(r'(\d[\d ]+\d)')
[u'020 641512']
>>>
替代使用新的CSS选择器:
>>> from scrapy.selector import Selector
>>> selector = Selector(response)
>>> selector.css('td:contains("Phone") + td').re(r'(\d[\d ]+\d)')
[u'020 641512']
>>>
答案 2 :(得分:-1)
还有一个非常有用的Firefox插件来找出名为Firebug的xpath,看看这些instructions。