Question

我有一部分html如下

<li><label>The Keyword:</label><span><a href="../../..">The text</a></span></li>

我想获得字符串“The keyword：The text”。

我知道我可以使用Chrome检查或FF firebug获取上述html的xpath，然后使用hxs.select（xpath）.extract（），然后剥离html标签以获取字符串。但是，由于xpath在不同页面之间不一致，因此该方法不够通用。

因此，我正在考虑以下方法：首先，使用

搜索“关键字：”

hxs = HtmlXPathSelector(response)
hxs.select('//*[contains(text(), "The Keyword:")]')

什么时候pprint我得到一些回报：

>>> pprint( hxs.select('//*[contains(text(), "The Keyword:")]') )
<HtmlXPathSelector xpath='//*[contains(text(), "The Keyword:")]' data=u'<label>The Keyword:</label>'>

我的问题是如何获取想要的字符串：“关键字：文本”。我正在考虑如何确定xpath，如果xpath已知，那么当然我可以得到想要的字符串。

我对scrapy HtmlXPathSelector以外的任何解决方案都持开放态度。（例如lxml.html可能有更多功能，但我对它很新）。

感谢。

Answer 1

如果我理解你的问题，那么“跟随兄弟”就是你要照顾的。

 //*[contains(text(), "The Keyword:")]/following-sibling::span/a/text()

Xpath Axes

scrapy HtmlXPathSelector通过搜索关键字来确定xpath

1 个答案: