Question

我正在尝试使用urllib.urlretrieve来抓取this image。

>>> import urllib
>>> urllib.urlretrieve('http://i9.mangareader.net/one-piece/3/one-piece-1668214.jpg', 
        path) # path was previously defined

此代码成功将文件保存在给定路径中。但是，当我尝试打开文件时，我得到：

Could not load image 'imagename.jpg':
    Error interpreting JPEG image file (Not a JPEG file: starts with 0x3c 0x21)

当我在bash终端中file imagename.jpg时，我得到imagefile.jpg: HTML document, ASCII text。

那么如何将此图像作为JPEG文件抓取？

Answer 1

这是因为托管该图片的服务器的所有者故意阻止来自Python urllib的访问。这就是为什么它与requests合作的原因。您也可以使用纯Python来完成它，但是您必须为其提供一个HTTP User-Agent标头，使其看起来像urllib以外的其他标头。例如：

import urllib2
req = urllib2.Request('http://i9.mangareader.net/one-piece/3/one-piece-1668214.jpg')
req.add_header('User-Agent', 'Feneric Was Here')
resp = urllib2.urlopen(req)
imgdata = resp.read()
with open(path, 'wb') as outfile:
    outfile.write(imgdata)

因此，它可以更多地参与其中，但仍然不会太糟糕。

请注意，网站所有者可能会这样做，因为有些人已经辱骂了。请不要成为其中之一！强大的力量带来了巨大的责任，以及所有这些。

使用urlretrieve将图像作为HTML页面进行抓取

1 个答案: