我正在使用python请求库来获取URL的源代码,并使用以下代码应用正则表达式来提取一些数据:
=IIF(DateDiff("d", First(Fields!OrderDate.Value, "Invoice"), Now())<=0, RunningValue(Fields!LineAmount.Value, Sum, "Invoice"), 0)
=IIF(DateDiff("d", First(Fields!OrderDate.Value, "Invoice"), now())>=1 and DateDiff("d", First(Fields!OrderDate.Value, "Invoice"),Now())<=30, RunningValue(Fields!LineAmount.Value, Sum, "Invoice"), 0)
=IIF(DateDiff("d", First(Fields!OrderDate.Value, "Invoice"), now())>=31 and DateDiff("d", First(Fields!OrderDate.Value, "Invoice"),Now())<=60, RunningValue(Fields!LineAmount.Value, Sum, "Invoice"), 0)
=IIF(DateDiff("d", First(Fields!OrderDate.Value, "Invoice"), now())>=61 and DateDiff("d", First(Fields!OrderDate.Value, "Invoice"),Now())<=90, RunningValue(Fields!LineAmount.Value, Sum, "Invoice"), 0)
=IIF(DateDiff("d", First(Fields!OrderDate.Value, "Invoice"), now())>=91, RunningValue(Fields!LineAmount.Value, Sum, "Invoice"), 0)
此代码有效,但速度太慢;每个请求需要超过5秒。有什么建议让它更快吗?
另外 - 我应该添加任何try / error代码以获得健壮性吗?
答案 0 :(得分:0)
我同意上面的评论速度分析是一个很好的方式,可以看到什么减慢了你的速度。如果这是一个选项,加速代码的一个明显方法是并行化。这是一个简单的建议
from multiprocessing.dummy import Pool as Threadpool
import requests
import re
def parallelURL(url):
print url
page = requests.get(url)
matches = re.findall('btn btn-primary font-bold">\s*<span>([^<]*)', page.text)
for match in matches:
print match
pool = Threadpool(6) #play around with this number depends on processor
pool.map(parallelURL,urlList)
在我的计算机上,这可以加快谷歌访问10次,从1.9秒到0.3秒。
答案 1 :(得分:0)
我发现,对于更大的文件下载,将块体放入块中要快得多。默认情况下,我认为get(uri, stream=False)
使用的块大小为1。
import StringIO, requests
# Get the HTTP header
r = requests.get(uri, stream=True)
# Read the body in 1KB chunks
http_body_str = StringIO.StringIO()
for chunk in r.iter_content(chunk_size=1024):
http_body_str.write(chunk)
http_body = http_body_str.getvalue()
http_body_str.close()
对于二进制数据,我认为您可以使用io.BytesIO
代替StringIO
。