你能建议修复一下吗?它几乎从imgur页面下载所有图像,只有一张图片不确定为什么它在这种情况下不起作用以及如何修复它?
elif 'imgur.com' in submission.url and not (submission.url.endswith('gif')
or submission.url.endswith('webm')
or submission.url.endswith('mp4')
or 'all' in submission.url
or '#' in submission.url
or '/a/' in submission.url):
html_source = requests.get(submission.url).text # download the image's page
soup = BeautifulSoup(html_source, "lxml")
image_url = soup.select('img')[0]['src']
if image_url.startswith('//'):
image_url = 'http:' + image_url
image_id = image_url[image_url.rfind('/') + 1:image_url.rfind('.')]
try:
image_file = urllib2.urlopen(image_url, timeout = 5)
with open('/home/mona/computer_vision/image_retrieval/images/'+ category+ '/'+ 'imgur_'+ datetime.datetime.now().strftime('%y-%m-%d-%s') + image_url[-9:], 'wb') as output_image:
output_image.write(image_file.read())
except urllib2.URLError as e:
print(e)
continue
错误是:
[LOG] Done Getting http://i.imgur.com/FoCjtI7.jpg
submission id is: 1alffm
[LOG] Getting url: http://sphotos-a.ak.fbcdn.net/hphotos-ak-ash4/217834_10151246341237704_484810759_n.jpg
HTTP Error 403: Forbidden
[LOG] Getting url: http://imgur.com/xp386
Traceback (most recent call last):
File "download_images.py", line 67, in <module>
soup = BeautifulSoup(html_source, "lxml")
File "/usr/lib/python2.7/dist-packages/bs4/__init__.py", line 155, in __init__
% ",".join(features))
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
答案 0 :(得分:2)
打开python shell并尝试以下操作:
from bs4 import BeautifulSoup
myHTML = "<html><head></heda><body><strong>Hi</strong></body></html>"
soup = BeautifulSoup(myHTML, "lxml")
这有用吗,还是同样的错误?如果出现相同的错误,您将缺少lxml。安装它:
pip install lxml
我正在完成这些步骤,因为你指出脚本在崩溃前工作了很长时间,在这种情况下,你不能错过解析器吗?
由OP添加:
If you are using Python2.7 in Ubuntu/Debian, this worked for me:
$ sudo apt-get build-dep python-lxml
$ sudo pip install lxml
Test it like:
mona@pascal:~/computer_vision/image_retrieval$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml