Question

新手与lxml争吵，目前正在阅读O'Reilly的书。导入html表单lxml后，使用html.parse返回以下错误消息：

读取文件“http://www.emoji-cheat-sheet.com/”时出错：无法加载外部实体“http://www.emoji-cheat-sheet.com/”

以下是代码：

from lxml import html
page = html.parse('http://www.emoji-cheat-sheet.com/')

这也可以在书籍相关的存储库中找到：

https://github.com/jackiekazil/data-wrangling/blob/master/code/chp11-scraping/lxml_emoji_xpath.py

“hmtl.parse”

Answer 1

问题在于，自从发布这本书以来，网站emoji-cheat-sheet.com已经改为https://www.webpagefx.com/tools/emoji-cheat-sheet/，所以它将你重定向到那里，一个简单的html.parse无法处理重定向（并且可能会对加密，因为它现在使用http s （安全加密）连接，就像现在大多数专业网站一样。

我能够使用请求库解析它：

import requests
page = requests.get('https://www.webpagefx.com/tools/emoji-cheat-sheet')
content=page.content
print(content)

如果您尝试向该特定网站发出不安全的http请求，服务器仍会将您重定向到https页面。像这样的安全页面很难用原始库解析。

http://dictionary.com不会自动将您重定向到https网站，相同的代码也能正常运行。（我也试过你的表情符号网站，但它没有工作）..

如果您必须解析该特定网站，我建议使用BeautifulSoup，我会查看是否有效并报告。

lxml html.parse返回错误读取文件无法加载外部实体

1 个答案: