如果可用,则将嵌套兄弟姐妹中的文本与父节点中的文本连接起来

时间:2019-05-02 01:04:52

标签: xpath scrapy

我正在尝试从同级中提取文本(如果有)并与父节点中的文本连接。如何在xpath中做到这一点? 下面显示的HTML几乎没有<sup> and <sub>的实例。

我的预期输出:

['2','1/2']

应该像这样['<sup>'+'/'+ '<sub>']

连接
<li data-ingredient="dry+white+wine">
 <span class="qty">2 </span>
 <span class="food">
     "cups"  
     <a href="https://www.test.com">dry white wine</a>
 </span>
</li>
<li data-ingredient="salt">
 <span class="qty">
     <sup>1</sup>
     "⁄"
     <sub>2</sub>
 </span>
 <span class="food"> teaspoon  <a href="https://www.test.com">salt</a>
 </span>
</li>

我尝试使用以下命令,并参考了多个Scrapy文档。但无法提取所需的信息。

response.xpath('//span[@class="qty"][sup and sub]/text()').extract()
response.xpath('//span[@class="qty"]//sub/text()').extract()

2 个答案:

答案 0 :(得分:1)

我的想法是通过span.qty进行迭代,从那里提取文本并将其连接起来。就像这里:

txt = """<li data-ingredient="dry+white+wine">
...  <span class="qty">2 </span>
...  <span class="food">
...      "cups"  
...      <a href="https://www.test.com">dry white wine</a>
...  </span>
... </li>
... <li data-ingredient="salt">
...  <span class="qty">
...      <sup>1</sup>
...      "⁄"
...      <sub>2</sub>
...  </span>
...  <span class="food"> teaspoon  <a href="https://www.test.com">salt</a>
...  </span>
... </li>"""
>>> from scrapy import Selector
>>> sel = Selector(text=txt)
>>> for qty in sel.css('span.qty'):
...     print ''.join([i.replace('"', '').strip() for i in qty.css('::text').extract()])
... 
2
1⁄2

答案 1 :(得分:0)

尝试使用Bs4进行此类任务:

from bs4 import BeautifulSoup

html = response.xpath("//li[@data-ingredient='salt']/span[@class='qty']").extract()
soup = BeautifulSoup( html, "html.parser" ).get_text()