Question

所以我使用XPath相对较新，我在使用我需要用于特定应用程序的确切语法时遇到了一些困难。我构建的刮刀工作得非常好（当我使用一条不那么复杂的路径时）。一旦我尝试使用我的路径更具体，它就不会返回正确的值。

我试图操纵的文档结构的简化模型是

<table class="rightLinks">
  <tbody>
    <tr>
      <td>
        <a href="http://wwww.example.com">Text That I want to Grab</a>
      </td>
      <td>Some</td>
      <td>Text</td>
    </tr>
    <tr>
      <td>
        <a href="http://wwww.example2.com">Text That I want to Grab</a>
      </td>
      <td>Some</td>
      <td>Text</td>
    </tr>
    <tr>
      <td>
        <a href="http://wwww.example3.com">Text That I want to Grab</a>
      </td>
      <td>Some</td>
      <td>Text</td>
    </tr>
    <tr>
      <td>
        <a href="http://wwww.example4.com">Text That I want to Grab</a>
      </td>
      <td>Some</td>
      <td>Text</td>
    </tr>
  </tbody>
</table>

基本上，我想通过链接获取href值和文本。

这是我的刮刀的部分内容和迄今为止我尝试过的内容：

  import scrapy
  from scrapy.selector import HtmlXPathSelector
  from scrapy.http import HtmlResponse

  def parse(self, response):
    for sel in response.xpath('//table[@class="rightLinks"]/tbody/tr/*[1]/a'):
      item = DanishItem()
      item['company_name'] = sel.xpath('/text()').extract()
      item['website'] = sel.xpath('/@href').extract()
      yield item

编辑：我正在使用的新路径

def parse(self, response):
  for sel in response.xpath('//table[@class="rightLinks"]/tr/*[1]/a'):
    item = DanishItem()
    item['company_name'] = sel.text
    item['website'] = sel.attrib['href']
    yield item

最终编辑：工作代码（谢谢大家！）

 def parse(self, response):
  for sel in response.xpath('//table[@class="rightLinks"]/tr/*[1]/a'):
    item = DanishItem()
    item['company_name'] = sel.xpath('./text()').extract()
    item['website'] = sel.xpath('./@href').extract()
    yield item

非常感谢任何建议或提示！

乔伊

Answer 1

sel.xpath('/text()')和sel.xpath('/@href')都是绝对路径;如果您想要相对路径，则可以是./text()或./@href。

如果这是lxml - sel是lxml Element对象 - 只需使用sel.text或sel.attrib['href'] - 不需要XPath。

在Scrapy中使用相对XPath提取文本节点或元素

1 个答案: