我究竟做错了什么?使用lxml解析HTML

时间:2014-12-20 18:02:39

标签: python html lxml

我正在尝试使用lxml解析网页,但我在尝试恢复div中的所有文本元素时遇到了麻烦。这是我到目前为止所拥有的......

import requests
from lxml import html
page = requests.get("https://www.goodeggs.com/sfbay/missionheirloom/seasonal-chicken-stew-16oz/53c68de974e06f020000073f",verify=False)
tree = html.fromstring(page.text)
foo = tree.xpath('//section[@class="product-description"]/div[@class="description-body"]/text()')

截至目前,“foo”带回一个空列表[]。其他页面会带回一些内容,但不包含<div>中标记内的所有内容。其他页面带回所有内容,因为它位于div的顶层。

如何恢复该div中的所有文字内容? 谢谢!

2 个答案:

答案 0 :(得分:3)

text位于两个<p>标记内,因此部分文字位于每个p.text而不是div.text。但是,您可以通过调用text_content方法而不是使用XPath <div>来提取text()所有子项中的所有文本

import requests
import lxml.html as LH
url = ("https://www.goodeggs.com/sfbay/missionheirloom/" 
       "seasonal-chicken-stew-16oz/53c68de974e06f020000073f")
page = requests.get(url, verify=False)
root = LH.fromstring(page.text)

path = '//section[@class="product-description"]/div[@class="description-body"]'
for div in root.xpath(path):
    print(div.text_content())

产量

We’re super excited about the changing seasons! Because the new season brings wonderful new ingredients, we’ll be changing the flavor profile of our stews. Starting with deliveries on Thursday October 9th, the Chicken and Wild Rice stew will be replaced with a Classic Chicken Stew. We’re sure you’ll love it!Mission: Heirloom is a food company based in Berkeley. All of our food is sourced as locally as possible and 100% organic or biodynamic. We never cook with refined oils, and our food is always gluten-free, grain-free, soy-free, peanut-free, legume-free, and added sugar-free.

PS。 dfsq已经建议使用XPath ...//text()。这也有效,但与text_content相反,文本片段作为单独的项目返回:

In [256]: root = LH.fromstring('<a>FOO <b>BAR <c>QUX</c> </b> BAZ</a>')

In [257]: root.xpath('//a//text()')
Out[257]: ['FOO ', 'BAR ', 'QUX', ' ', ' BAZ']

In [258]: [a.text_content() for a in root.xpath('//a')]
Out[258]: ['FOO BAR QUX  BAZ']

答案 1 :(得分:2)

我认为XPath表达式应该是:

//section[@class="product-description"]/div[@class="description-body"]//text()

UPD。正如@unutbu上面指出的那样,表达式会将文本节点作为列表获取,因此您必须循环它们。如果您需要将整个文本内容作为一个文本项,请检查unutbu的其他选项的答案。