Question

我很高兴加入Stack Overflow :)第一次在我的问题上找不到答案：）

我想废弃＆＃34;元描述＆＃34;在url列表中（在SQL数据库中）。

当我开始编写脚本时，它会被杀死＃34;没有任何错误。阅读第11个网址就会被杀死。

我做了一些测试，并确定了一个网址：＆＃34; http://www.les-calories.com/famille-4.html＆＃34;

所以我做了这个测试，至少减少了我的代码：

# encoding=utf8 
from bs4 import BeautifulSoup
import urllib
html = urllib.urlopen(" http://www.les-calories.com/famille-4.html").read()
soup = BeautifulSoup(html)

这段代码被杀死了＃34;由壳。

screen

我不明白为什么......

感谢您的帮助：）

Answer 1

可能是您未指定解析器，在这种情况下执行以下操作。

soup = BeautifulSoup(html, "html.parser")

但是，我认为更有可能的是HTML页面中的信息太多了。我要做的是使用python-requests包，在GET请求中，我将stream设置为True。像这样：

>>> import requests
>>> resp = requests.get("http://www.les-calories.com/famille-4.html", stream=True)
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(resp.text, "html.parser")
>>> soup.find("a")
<a href="http://www.fitadium.com/79-seche-et-definition-musculaire" target="_blank"><img border="0" height="60px" src="h
ttp://www.les-calories.com/images/234x60_pack-minceur-brule-graisses.gif" width="234px"/></a>

Python BeautifulSoup不适用于URL

1 个答案: