我的XPath查询是查找甚至不在其中的元素。例如(从我下面的代码中)business_div
包含HTML:
<div class="foo">
<div>
<table>
...
<a class="bar" href="A">link</a>
</table>
</div>
</div>
当我运行XPath查询business_div.xpath("//a[@class='bar']/@href").extract()
时,它返回:
["A", "B", "D"] # should just be ["A"]
如何仅business_div
查询"A"
?
<div class="foo">
<div>
<table>
...
<a class="bar" href="A">link</a>
</table>
</div>
</div>
<div class="foo">
<div>
<table>
...
<a class="bar" href="B">link</a>
</table>
</div>
</div>
<div class="foo">
<div>
<table>
...
<!-- Some divs will not contain a link. So I cant do a simple query "//div[contains(@class, "foo")]//a[contains(@class, "bar")]/@href" -->
</table>
</div>
</div>
<div class="foo">
<div>
<table>
...
<a class="bar" href="D">link</a>
</table>
</div>
</div>
我的代码:
class MySpider(CrawlSpider):
name = "MySpider"
...
def parse(self, response):
businesses = []
business_divs = response.xpath("//div[contains(@class, 'foo')]")
for business_div in business_divs:
business = MyItem()
business["link"] = business_div.xpath("//a[@class='bar']/@href").extract()
# business["link"] is ["A", "B", "D"]
# I am expecting business["link"] to simply be ["A"]
# in the first loop then ["B"] and so on
答案 0 :(得分:1)
xpath的微小变化就可以解决问题,
business["link"] = business_div.xpath(".//a[@class='bar']/@href").extract()