Question

当前在网站上编写python自动化脚本。云中托管了50到100张图像，所有图像的结构如下：

<img style="width:80px;height:60px;"
     src="http://someimagehostingsite.net/somefolder/some_random_url_with_timestamp">

该网址没有后缀，例如.jpg或.png来直接获取信息。但是我能够做到这一点，方法是一张一张地下载图像并获取图像文件的大小。但是我需要通过仅访问每个URL并获取文件大小来自动化此过程。有可能吗？

Answer 1

如果您只是尝试通过URL获取文件的内容长度，则可以通过仅下载HTTP标头并检查Content-Length字段来实现：

import requests
url='https://commons.wikimedia.org/wiki/File:Leptocorisa_chinensis_(20566589316).jpg'

http_response = requests.get(url)

print(f"Size of image {url} = {http_response.headers['Content-Length']} bytes")

但是，如果图像在发送之前由服务器压缩，则Content-Length字段将包含压缩的文件大小（实际将下载的数据量），而不是未压缩的图像大小。

要对给定页面上的所有图像执行此操作，可以使用BeautifulSoup HTML processing library提取页面上所有图像的URL列表，并按如下所示检查文件大小：

from time import sleep
import requests
from bs4 import BeautifulSoup as Soup

url='https://en.wikipedia.org/wiki/Agent_Orange'

html = Soup(requests.get(url).text)

image_links = [(url + a['href']) for a in html.find_all('a', {'class': 'image'})]

for img_url in image_links:
    response = requests.get(img_url)
    try:
        print(f"Size of image {img_url} = {response.headers['Content-Length']} bytes")
    except KeyError:
        print(f"Server didn't specify content length in headers for {img_url}")
    sleep(0.5)

您必须针对特定问题对此进行调整，并且可能必须将其他参数传递给soup.find_all()才能将其范围缩小到您感兴趣的特定图像，但是通过类似的操作可以实现您正在尝试做。

Answer 2

您可以尝试查看是否可以从浏览器为每个图像发送HEAD请求。 HTTP HEAD Request in Javascript/Ajax? 这取决于HTTP服务器是否正确支持它。我也不确定如何获取Content-Length标头，但这听起来像您想要的。

获取从服务器托管的特定图像文件大小

2 个答案: