XPath查询查找不在选择器内的元素

时间:2015-05-17 01:52:37

标签: python xpath scrapy

我的XPath查询是查找甚至不在其中的元素。例如(从我下面的代码中)business_div包含HTML:

<div class="foo">
    <div>
       <table>
          ...
          <a class="bar" href="A">link</a>
       </table>
    </div>
</div>

当我运行XPath查询business_div.xpath("//a[@class='bar']/@href").extract()时,它返回:

["A", "B", "D"] # should just be ["A"]

如何仅business_div查询"A"

<div class="foo">
    <div>
       <table>
          ...
          <a class="bar" href="A">link</a>
       </table>
    </div>
</div>

<div class="foo">
    <div>
       <table>
          ...
          <a class="bar" href="B">link</a>
       </table>
    </div>
</div>

<div class="foo">
    <div>
       <table>
          ...
          <!-- Some divs will not contain a link. So I cant do a simple query "//div[contains(@class, "foo")]//a[contains(@class, "bar")]/@href" -->
       </table>
    </div>
</div>

<div class="foo">
    <div>
       <table>
          ...
          <a class="bar" href="D">link</a>
       </table>
    </div>
</div>

我的代码:

class MySpider(CrawlSpider):

    name = "MySpider"
    ...

    def parse(self, response):
        businesses = []
        business_divs = response.xpath("//div[contains(@class, 'foo')]")

        for business_div in business_divs:
            business = MyItem()
            business["link"] = business_div.xpath("//a[@class='bar']/@href").extract()

            # business["link"] is ["A", "B", "D"]
            # I am expecting business["link"] to simply be ["A"] 
            # in the first loop then ["B"] and so on

1 个答案:

答案 0 :(得分:1)

xpath的微小变化就可以解决问题,

business["link"] = business_div.xpath(".//a[@class='bar']/@href").extract()