Question

我正在从Web服务器下载整个目录。它工作正常，但我无法想象如何在下载之前获取文件大小以进行比较，如果它在服务器上更新了。可以这样做，就像我从FTP服务器下载文件一样吗？

import urllib
import re

url = "http://www.someurl.com"

# Download the page locally
f = urllib.urlopen(url)
html = f.read()
f.close()

f = open ("temp.htm", "w")
f.write (html)
f.close()

# List only the .TXT / .ZIP files
fnames = re.findall('^.*<a href="(\w+(?:\.txt|.zip)?)".*$', html, re.MULTILINE)

for fname in fnames:
    print fname, "..."

    f = urllib.urlopen(url + "/" + fname)

    #### Here I want to check the filesize to download or not #### 
    file = f.read()
    f.close()

    f = open (fname, "w")
    f.write (file)
    f.close()

@Jon：谢谢你的快速回答。它可以工作，但Web服务器上的文件大小略小于下载文件的文件大小。

示例：

Local Size  Server Size
 2.223.533  2.115.516
   664.603    662.121

它与CR / LF转换有什么关系？

Answer 1

我已经复制了你所看到的内容：

import urllib, os
link = "http://python.org"
print "opening url:", link
site = urllib.urlopen(link)
meta = site.info()
print "Content-Length:", meta.getheaders("Content-Length")[0]

f = open("out.txt", "r")
print "File on disk:",len(f.read())
f.close()


f = open("out.txt", "w")
f.write(site.read())
site.close()
f.close()

f = open("out.txt", "r")
print "File on disk after download:",len(f.read())
f.close()

print "os.stat().st_size returns:", os.stat("out.txt").st_size

输出：

opening url: http://python.org
Content-Length: 16535
File on disk: 16535
File on disk after download: 16535
os.stat().st_size returns: 16861

我在这里做错了什么？ os.stat（）。st_size没有返回正确的大小吗？

编辑：好的，我弄清楚问题是什么：

import urllib, os
link = "http://python.org"
print "opening url:", link
site = urllib.urlopen(link)
meta = site.info()
print "Content-Length:", meta.getheaders("Content-Length")[0]

f = open("out.txt", "rb")
print "File on disk:",len(f.read())
f.close()


f = open("out.txt", "wb")
f.write(site.read())
site.close()
f.close()

f = open("out.txt", "rb")
print "File on disk after download:",len(f.read())
f.close()

print "os.stat().st_size returns:", os.stat("out.txt").st_size

此输出：

$ python test.py
opening url: http://python.org
Content-Length: 16535
File on disk: 16535
File on disk after download: 16535
os.stat().st_size returns: 16535

确保打开两个文件进行二进制读/写。

// open for binary write
open(filename, "wb")
// open for binary read
open(filename, "rb")

Answer 2

使用returned-urllib-object方法info()，您可以获得有关已审阅文档的各种信息。抓取当前Google徽标的示例：

>>> import urllib
>>> d = urllib.urlopen("http://www.google.co.uk/logos/olympics08_opening.gif")
>>> print d.info()

Content-Type: image/gif
Last-Modified: Thu, 07 Aug 2008 16:20:19 GMT  
Expires: Sun, 17 Jan 2038 19:14:07 GMT 
Cache-Control: public 
Date: Fri, 08 Aug 2008 13:40:41 GMT 
Server: gws 
Content-Length: 20172 
Connection: Close

这是一个字典，所以为了得到文件的大小，你可以urllibobject.info()['Content-Length']

print f.info()['Content-Length']

要获取本地文件的大小（用于比较），可以使用os.stat（）命令：

os.stat("/the/local/file.zip").st_size

Answer 3

文件大小作为Content-Length标头发送。以下是如何使用urllib获取它：

>>> site = urllib.urlopen("http://python.org")
>>> meta = site.info()
>>> print meta.getheaders("Content-Length")
['16535']
>>>

Answer 4

此外，如果您要连接的服务器支持它，请查看Etags以及If-Modified-Since和If-None-Match标题。

使用这些将利用网络服务器的缓存规则，如果内容未更改，将返回304 Not Modified状态代码。

Answer 5

在Python3中：

>>> import urllib.request
>>> site = urllib.request.urlopen("http://python.org")
>>> print("FileSize: ", site.length)

Answer 6

对于python3（在3.5上测试）方法，我建议：

with urlopen(file_url) as in_file, open(local_file_address, 'wb') as out_file:
    print(in_file.getheader('Content-Length'))
    out_file.write(response.read())

Answer 7

@PabloG关于本地/服务器文件大小的差异

以下是其可能发生原因的高级说明性解释：

磁盘上的大小有时与数据的实际大小不同。它取决于基础文件系统及其对数据的操作方式。就像您在Windows中格式化闪存驱动器时所看到的那样，系统会要求您提供“块/群集大小”，并且大小会有所不同[512b-8kb]。将文件写入磁盘后，会将其存储在磁盘块的“排序链表”中。当某个块用于存储文件的一部分时，其他文件内容都不会存储在同一块中，因此，即使该块没有占据整个块空间，该块也会被其他文件禁止使用。

示例：当文件系统划分为512b块，并且我们需要存储600b文件时，将占用两个块。第一块将被充分利用，而第二块将仅利用88b，其余的（512-88）b将不可用，从而导致“磁盘上文件大小”为1024b。这就是Windows对于“文件大小”和“磁盘大小”使用不同符号的原因。

注意：较小/较大的FS块具有不同的优缺点，因此在使用文件系统之前请做一个更好的研究。

在使用Python下载之前获取文件大小

7 个答案: