我正在尝试从同级中提取文本(如果有)并与父节点中的文本连接。如何在xpath中做到这一点?
下面显示的HTML几乎没有<sup> and <sub>
的实例。
我的预期输出:
['2','1/2']
应该像这样['<sup>'+'/'+ '<sub>']
<li data-ingredient="dry+white+wine">
<span class="qty">2 </span>
<span class="food">
"cups"
<a href="https://www.test.com">dry white wine</a>
</span>
</li>
<li data-ingredient="salt">
<span class="qty">
<sup>1</sup>
"⁄"
<sub>2</sub>
</span>
<span class="food"> teaspoon <a href="https://www.test.com">salt</a>
</span>
</li>
我尝试使用以下命令,并参考了多个Scrapy文档。但无法提取所需的信息。
response.xpath('//span[@class="qty"][sup and sub]/text()').extract()
response.xpath('//span[@class="qty"]//sub/text()').extract()
答案 0 :(得分:1)
我的想法是通过span.qty
进行迭代,从那里提取文本并将其连接起来。就像这里:
txt = """<li data-ingredient="dry+white+wine">
... <span class="qty">2 </span>
... <span class="food">
... "cups"
... <a href="https://www.test.com">dry white wine</a>
... </span>
... </li>
... <li data-ingredient="salt">
... <span class="qty">
... <sup>1</sup>
... "⁄"
... <sub>2</sub>
... </span>
... <span class="food"> teaspoon <a href="https://www.test.com">salt</a>
... </span>
... </li>"""
>>> from scrapy import Selector
>>> sel = Selector(text=txt)
>>> for qty in sel.css('span.qty'):
... print ''.join([i.replace('"', '').strip() for i in qty.css('::text').extract()])
...
2
1⁄2
答案 1 :(得分:0)
尝试使用Bs4进行此类任务:
from bs4 import BeautifulSoup
html = response.xpath("//li[@data-ingredient='salt']/span[@class='qty']").extract()
soup = BeautifulSoup( html, "html.parser" ).get_text()