Question

我有从网站下载Gbs数据的任务。数据采用.gz文件的形式，每个文件的大小为45mb。

获取文件的简便方法是使用“wget -r -np -A files url”。这将以递归格式下载数据并镜像网站。下载率非常高，为4mb / sec。

但是，只是为了玩游戏我也使用python来构建我的urlparser。

通过Python的urlretrieve下载速度很慢，可能比wget慢4倍。下载速率为500kb /秒。我使用HTMLParser来解析href标签。

我不确定为什么会这样。有没有任何设置。

由于

Answer 1

您可能需要单位数学错误。

注意到500KB/s (kilobytes) is equal to 4Mb/s (megabits)。

Answer 2

urllib和wget一样快。试试这段代码。它以百分比的形式显示百分比的进度。

import sys, urllib
def reporthook(a,b,c): 
    # ',' at the end of the line is important!
    print "% 3.1f%% of %d bytes\r" % (min(100, float(a * b) / c * 100), c),
    #you can also use sys.stdout.write
    #sys.stdout.write("\r% 3.1f%% of %d bytes" 
    #                 % (min(100, float(a * b) / c * 100), c)
    sys.stdout.flush()
for url in sys.argv[1:]:
     i = url.rfind('/')
     file = url[i+1:]
     print url, "->", file
     urllib.urlretrieve(url, file, reporthook)

Answer 3

至于html解析，你可能会得到的最快/最简单的是使用lxml 至于http请求本身：httplib2非常容易使用，并且可能加速下载，因为它支持http 1.1 keep-alive连接和gzip压缩。还有pycURL声称速度非常快（但更难以使用），并且基于curllib，但我从未使用过。

您也可以尝试同时下载不同的文件，但请记住，尝试优化下载时间过长可能对相关网站不太有礼貌。

很抱歉没有超链接，但SO告诉我“抱歉，新用户最多只能发布一个超链接”

Answer 4

传输速度很容易产生误导。你可以尝试使用以下脚本，它只需同时下载wget和urllib.urlretrieve的相同网址 - 运行它几次，因为你落后于代理在第二次尝试时缓存URL。

对于小文件，由于外部进程的启动时间，wget将花费更长的时间，但是对于应该变得无关紧要的较大文件。

from time import time
import urllib
import subprocess

target = "http://example.com" # change this to a more useful URL

wget_start = time()

proc = subprocess.Popen(["wget", target])
proc.communicate()

wget_end = time()


url_start = time()
urllib.urlretrieve(target)
url_end = time()

print "wget -> %s" % (wget_end - wget_start)
print "urllib.urlretrieve -> %s"  % (url_end - url_start)

Answer 5

也许你可以在Python中检查并检查数据？

Answer 6

import subprocess

myurl = 'http://some_server/data/'
subprocess.call(["wget", "-r", "-np", "-A", "files", myurl])

Answer 7

由于python建议使用urllib2而不是urllib，我会在urllib2.urlopen和wget之间进行测试。

结果是，两个人下载同一个文件几乎需要相同的时间。有时，urllib2表现得更好。

wget的优势在于动态进度条显示已完成的百分比以及转移时的当前下载速度。

我的测试中的文件大小是5MB。我没有在python中使用任何缓存模块，我不知道下载大文件时wget如何工作。

Answer 8

真的不应该有区别。所有urlretrieve都会做一个简单的HTTP GET请求。您是否取出了数据处理代码并对wget与纯Python进行了直接的吞吐量比较？

Answer 9

请告诉我们一些代码。我很确定它必须与代码相关而不是urlretrieve。

我过去曾与它合作，从未遇到任何与速度相关的问题。

Answer 10

您可以使用wget -k在所有网址中使用相关链接。

wget与python的urlretrieve

10 个答案: