Question

我使用urlretrieve从网站上抓取图片。除了一个，不是那么细微的细节，这很好。这些文件不可读。我尝试了几个网站，但结果是一样的。我想知道我是否应该表明它是二进制下载但在文档中找不到任何提示。搜索了网络，找到了一些替代请求库，但同样的结果。 Windows Photo Viewer，Paint和Gimp都报告文件已损坏或不可读。我很确定我犯了一些愚蠢的错误。任何帮助将不胜感激！

def get_images(url, soup):
    #this makes a list of bs4 element tags
    print 'URL: ', url
    n = 0
    images = [img for img in soup.findAll('img')]

    #compile our unicode list of image links
    image_links = [each.get('src') for each in images]
    for each in image_links:
        n = n + 1
        path = urlparse.urlparse(each).path
        fn = (os.path.split(path)[1]).strip()
        ext = (os.path.splitext(fn)[1]).strip().lower()
        if (fn == '' or ext == ''):
            continue

        fn = os.path.join ("images", fn)

#        print 'From: ', url
        print 'Each> ', each
#        print 'File< ', fn
#        avatar = open(fn, 'wb')
#        avatar.write(requests.get(url).content)
#        avatar.close()
        result = urllib.urlretrieve(url, fn)
        print result

    return n

更新

Jephron指出了正确的方向，我没有正确地将网址与图像路径组合在一起。他的解决方案通过使用urlparse.urljoin(url, each)，而我最初使用os.path.join，可能会导致突然在Windows系统上的URL中反击。真烦人我添加了相对和绝对URL路径的测试，最终代码如下所示。

def get_images(url, soup):
    #this makes a list of bs4 element tags
    print ' '
    print 'URL: ', url
    n = 0
    images = [img for img in soup.findAll('img')]

    #compile our unicode list of image links
    image_links = [each.get('src') for each in images]

    for each in image_links:
        path = urlparse.urlparse(each).path
        fn = (os.path.split(path)[1]).strip()
        ext = (os.path.splitext(fn)[1]).strip().lower()
        if (fn == '' or ext == ''):
            continue

        fn = os.path.join ("images", fn)
        if (not (each.startswith ('http:') or each.startswith('https:'))):
            image_link = urlparse.urljoin(url, each)
        else:
            image_link = each

        print 'Found: ', fn

        try:
            urllib.urlretrieve(image_link, fn)
            n = n + 1
        except:
            continue

    return n

但请注意，3/4的.png仍然无法读取。我必须找出原因，但可能仍然存在隐藏的障碍。

Answer 1

我运行了你的代码并查看了＆＃34;图像＆＃34;它下载。事实证明，您保存的文件的内容实际上是网站的整个HTML。尝试在文本编辑器中打开它并亲自查看。

要解决此问题，请注意您传递给urlretrieve的参数实际上是您抓取的网页的网址。如果您将图片网址加入网页网址，则会获得正确的网址：

def get_images(url, soup):
    #this makes a list of bs4 element tags
    print 'URL: ', url
    n = 0
    images = [img for img in soup.findAll('img')]

    #compile our unicode list of image links
    image_links = [each.get('src') for each in images]
    for each in image_links:
        print "maybe an image"
        print each
        n = n + 1
        path = urlparse.urlparse(each).path
        fn = (os.path.split(path)[1]).strip()
        ext = (os.path.splitext(fn)[1]).strip().lower()
        if (fn == '' or ext == ''):
            continue

        fn = os.path.join ("images", fn)

        print 'Each> ', each

        result = urllib.urlretrieve(os.path.join(url, each), fn)
        print result

    return n

urlretrieve似乎破坏了图像文件

1 个答案: