Question

因此，我正在构建一个Python脚本来从URL列表中下载图像。该脚本在一定程度上可以正常工作。我不希望它下载具有不存在的url的图像。我会使用一些状态码来处理一些图像，但仍然会得到不良图像。 我仍然得到很多不需要的图像。像这样：

这是我的代码：

import os
import requests
import shutil
import random
import urllib.request

def sendRequest(url):
    try:
        page = requests.get(url, stream = True, timeout = 1)

    except Exception:
        print('error exception')
        pass

    else:
        #HERE IS WHERE I DO THE STATUS CODE
        print(page.status_code)
        if (page.status_code == 200):
            return page

    return False

def downloadImage(imageUrl: str, filePath: str):
    img = sendRequest(imageUrl)

    if (img == False):
        return False

    with open(filePath, "wb") as f:
        img.raw.decode_content = True

        try:
            shutil.copyfileobj(img.raw, f)
        except Exception:
            return False

    return True

os.chdir('/Users/nikolasioannou/Desktop')
os.mkdir('folder')

fileURL = 'http://www.image-net.org/api/text/imagenet.synset.geturls?wnid=n04122825'
data = urllib.request.urlopen(fileURL)

output_directory = '/Users/nikolasioannou/Desktop/folder'

line_count = 0

for line in data:
    img_name = str(random.randrange(0, 10000)) + '.jpg'
    image_path = os.path.join(output_directory, img_name)
    downloadImage(line.decode('utf-8'), image_path)
    line_count = line_count + 1
#print(line_count)

感谢您的宝贵时间。任何想法都值得赞赏。

此致，尼古拉斯

Answer 1

您可以检查jpeg或png标题以及它们各自的魔术序列，这始终是有效图像的良好指示。看看this，那么问题。

您还可以查看文件签名（又称魔术数字）here。然后，您只需检查n的第response.raw个字节

我稍微修改了sendRequest / download函数，您应该能够对更多有效的图像文件扩展名进行硬编码，而不仅仅是JPG幻数。我终于测试了代码，它正在（在我的机器上）正常工作。仅保存了有效的JPG图像。请注意，我删除了stream = True标志，因为图像非常小，您不需要流。而且节省的钱少了一些神秘性。看看：

def sendRequest(url):
    try:
        page = requests.get(url)

    except Exception as e:
        print("error:", e)
        return False

    # check status code
    if (page.status_code != 200):
        return False

    return page

def downloadImage(imageUrl: str, filePath: str):
    img = sendRequest(imageUrl)

    if (img == False):
        return False

    if not img.content[:4] == b'\xff\xd8\xff\xe0': return False

    with open(filePath, "wb") as f:
        f.write(img.content)

    return True

您还可以尝试使用Pillow和BytesIO打开图像

>>> from PIL import Image
>>> from io import BytesIO

>>> i = Image.open(BytesIO(img.content))

，看看是否抛出错误。但是第一个解决方案似乎更轻量级-您不应在那里得到任何误报。您还可以检查"<html>"中的字符串im.content并在找到后放弃它-这非常简单，而且可能也非常有效。

检查图像URL是否在Python中导致真实图像

1 个答案: