HTML Python请求库 - 太慢了

时间:2015-07-13 18:50:50

标签: python python-requests

我正在使用python请求库来获取URL的源代码,并使用以下代码应用正则表达式来提取一些数据:

=IIF(DateDiff("d", First(Fields!OrderDate.Value, "Invoice"), Now())<=0, RunningValue(Fields!LineAmount.Value, Sum, "Invoice"), 0)
=IIF(DateDiff("d", First(Fields!OrderDate.Value, "Invoice"), now())>=1 and DateDiff("d", First(Fields!OrderDate.Value, "Invoice"),Now())<=30, RunningValue(Fields!LineAmount.Value, Sum, "Invoice"), 0)
=IIF(DateDiff("d", First(Fields!OrderDate.Value, "Invoice"), now())>=31 and DateDiff("d", First(Fields!OrderDate.Value, "Invoice"),Now())<=60, RunningValue(Fields!LineAmount.Value, Sum, "Invoice"), 0)
=IIF(DateDiff("d", First(Fields!OrderDate.Value, "Invoice"), now())>=61 and DateDiff("d", First(Fields!OrderDate.Value, "Invoice"),Now())<=90, RunningValue(Fields!LineAmount.Value, Sum, "Invoice"), 0)
=IIF(DateDiff("d", First(Fields!OrderDate.Value, "Invoice"), now())>=91, RunningValue(Fields!LineAmount.Value, Sum, "Invoice"), 0)

此代码有效,但速度太慢;每个请求需要超过5秒。有什么建议让它更快吗?

另外 - 我应该添加任何try / error代码以获得健壮性吗?

2 个答案:

答案 0 :(得分:0)

我同意上面的评论速度分析是一个很好的方式,可以看到什么减慢了你的速度。如果这是一个选项,加速代码的一个明显方法是并行化。这是一个简单的建议

from multiprocessing.dummy import Pool as Threadpool
import requests
import re


def parallelURL(url):
    print url
    page = requests.get(url)
    matches = re.findall('btn btn-primary font-bold">\s*<span>([^<]*)', page.text)
    for match in matches:
       print match

pool = Threadpool(6)  #play around with this number depends on processor

pool.map(parallelURL,urlList)

在我的计算机上,这可以加快谷歌访问10次,从1.9秒到0.3秒。

答案 1 :(得分:0)

我发现,对于更大的文件下载,将块体放入块中要快得多。默认情况下,我认为get(uri, stream=False)使用的块大小为1。

import StringIO, requests

# Get the HTTP header
r = requests.get(uri, stream=True)
# Read the body in 1KB chunks
http_body_str = StringIO.StringIO()
for chunk in r.iter_content(chunk_size=1024):
    http_body_str.write(chunk)
http_body = http_body_str.getvalue()
http_body_str.close()

对于二进制数据,我认为您可以使用io.BytesIO代替StringIO