Question

我正在尝试创建一个Python应用程序，它使用lxml从网站上抓取HTML并收集国家/地区及其相应的首都。我正在抓取来自http://www.manythings.org/vocabulary/lists/2/words.php?f=countries_and_capitals的HTML，我无法弄清楚如何获取所有国家/地区，以便我可以将它们放入列表中。这就是我到目前为止所做的：

from lxml import html
import requests

page = requests.get("http://www.manythings.org/vocabulary/lists/2/words.php?f=countries_and_capitals")
tree = html.fromstring(page.content)

countries = tree.xpath('//*[@id="yui-main"]/div/div[2]/div/div[1]/ul/li[1]/b')
capitals = tree.xpath('//*[@id="yui-main"]/div/div[2]/div/div[1]/ul/li[1]/i')

print 'Countries: ', countries
print 'Capitals: ', capitals

现在输出是两个空列表，我很确定这是因为XPath不正确但我不熟悉XPath或HTML来纠正它。我宁愿被引导回答而不是回答。

Answer 1

这是一个有趣的问题。原来你的X-Path和HTML是正确的 - 使用Chrome调试工具运行它们选择适当的元素。但是，当通过python交互式shell进行调试时，问题变得明显 - yui-main div实际上并不存在。

使用JavaScript动态更新网页 - 内容在运行时加载到yui-main div中。 xml解析器不会执行JavaScript，因此您的解析树将永远不会有yui-main div。

我通过在浏览器中关闭JavaScript并访问该页面来确认这一点。

之后想出一个XPath选择器是微不足道的：

countries = tree.xpath('//li/b/text()')
capitals = tree.xpath('//li/i/text()')

如何找到HTML元素所需的XPath？

1 个答案: