然而,在python中使用lxml.html这样做是行不通的:
import requests
import lxml.html
s = requests.session()
page= s.get('http://lxml.de/')
html = lxml.html.fromstring(page.text)
p=html.xpath('p')
此处p
是一个空列表。
我需要使用p=html.xpath('//p')
。
任何人都知道为什么?
答案 0 :(得分:3)
该页面可能不在<p>
(即根),而是<html>
,您假定该xpath表达式。
使用双斜杠//p
来检索所有<p>
元素,或者使用绝对引用向下走去特定<p>
。下面演示第一段内容:
p = html.xpath('/html/body/div/p')
print(p[0].text)
# lxml is the most feature-rich
# and easy-to-use library
# for processing XML and HTML
# in the Python language.
等效地:
p = html.xpath('//p')
print(p[0].text)
# lxml is the most feature-rich
# and easy-to-use library
# for processing XML and HTML
# in the Python language.
解析<p>
没有正斜杠,这需要使用搜索路径斜杠的前一个xpath:
div = p = html.xpath('/html/body/div')[0]
p = div.xpath('p')
print(p[0].text)
# lxml is the most feature-rich
# and easy-to-use library
# for processing XML and HTML
# in the Python language.