python:urlopen& threading不必要地慢?有更快的方法吗?

时间:2012-06-14 18:31:32

标签: python multithreading performance http io

我正在编写一个加载&一次解析许多页面&将数据从它们发送到服务器。如果我只是一次运行一个页面处理器,情况就相当不错了:

********** Round-trip (with 0 sends/0 loads) for (+0/.0/-0) was total 1.98s (1.60s load html, 0.24s parse, 0.00s on queue, 0.14s to process) **********
********** Round-trip (with 0 sends/0 loads) for (+0/.0/-0) was total 1.87s (1.59s load html, 0.25s parse, 0.00s on queue, 0.03s to process) **********
********** Round-trip (with 0 sends/0 loads) for (+0/.0/-0) was total 2.79s (1.78s load html, 0.28s parse, 0.00s on queue, 0.72s to process) **********
********** Round-trip (with 0 sends/1 loads) for (+0/.0/-0) was total 2.18s (1.70s load html, 0.34s parse, 0.00s on queue, 0.15s to process) **********
********** Round-trip (with 0 sends/1 loads) for (+0/.0/-0) was total 1.91s (1.47s load html, 0.21s parse, 0.00s on queue, 0.23s to process) **********
********** Round-trip (with 0 sends/1 loads) for (+0/.0/-0) was total 1.84s (1.59s load html, 0.22s parse, 0.00s on queue, 0.03s to process) **********
********** Round-trip (with 0 sends/0 loads) for (+0/.0/-0) was total 1.90s (1.67s load html, 0.21s parse, 0.00s on queue, 0.02s to process) **********

但是,如果同时运行~20(每个都在自己的线程中),HTTP流量变得非常慢:

********** Round-trip (with 2 sends/7 loads) for (+0/.0/-0) was total 23.37s (16.39s load html, 0.30s parse, 0.00s on queue, 6.67s to process) **********
********** Round-trip (with 2 sends/5 loads) for (+0/.0/-0) was total 20.99s (14.00s load html, 1.99s parse, 0.00s on queue, 5.00s to process) **********
********** Round-trip (with 4 sends/4 loads) for (+0/.0/-0) was total 17.89s (9.17s load html, 0.30s parse, 0.12s on queue, 8.31s to process) **********
********** Round-trip (with 3 sends/5 loads) for (+0/.0/-0) was total 26.22s (15.34s load html, 1.63s parse, 0.01s on queue, 9.24s to process) **********

load html位是读取我正在处理的网页的HTML所需的时间(resp = self.mech.open(url)resp.read(); resp.close())。 to process位是从此客户端到处理它的服务器(fp = urllib2.urlopen(...); fp.read(); fp.close())往返所花费的时间。 X sends/Y loads位是向服务器发送的同时发送的数量,以及当我向服务器发出请求时正在运行的网页的加载。

我最关心to process位。服务器上的实际处理只需0.2s左右。仅发送 400字节,因此不会占用太多带宽。有趣的是,如果我运行一个程序(同时解析所有这些同时发送/加载)打开5个线程并重复执行to process位,它会非常快:

1 took 0.04s
1 took 1.41s in total
0 took 0.03s
0 took 1.43s in total
4 took 0.33s
2 took 0.49s
2 took 0.08s
2 took 0.01s
2 took 1.74s in total
3 took 0.62s
4 took 0.40s
3 took 0.31s
4 took 0.33s
3 took 0.05s
3 took 2.18s in total
4 took 0.07s
4 took 2.22s in total

此独立程序中的每个to process只需0.01s0.50s,远远低于完整版本中的6-10秒,并且它不是{&1;}。使用任何较少的发送线程(它使用5,并且完整版本的上限为5)。

也就是说,在完整版本运行时,运行一个单独的版本,每个请求发送400个字节的(+0/.0/-0)个请求,每个请求只需0.31个。所以,它不像我正在运行的机器被点击......似乎相反,其他线程中的多个同时负载正在减慢应该快速的速度(实际上是快速的,在同一台机器上运行的另一个程序中)发送其他线程。

使用urllib2.urlopen完成发送,而正在使用mechanize(最终使用urllib2.urlopen的分支)完成读取。

有没有办法让这个完整的程序像这个迷你独立版本一样快速运行,至少在他们发送同样的东西的时候?我正在考虑编写另一个程序,该程序只接受通过命名管道或其他东西发送的内容,以便发送在另一个进程中完成,但这看起来很愚蠢。欢迎大家提出意见。

关于如何更快地获得多个同时页面加载的任何建议(所以时间看起来更像是1-3秒而不是10-20秒)也是受欢迎的。


编辑:补充说明:我依赖于机械化的cookie处理功能,所以任何答案都可以理想地提供一种处理它的方法,以及...


编辑:我使用不同的配置进行相同的设置,只打开一个页面,同时将~10-20个内容添加到队列中。那些像刀子一样通过黄油加工,例如这是添加了一大堆的尾声:

********** Round-trip (with 4 sends/0 loads) for (+0/.0/-0) was total 1.17s (1.14s wait, 0.04s to process) **********
********** Round-trip (with 4 sends/0 loads) for (+0/.0/-0) was total 1.19s (1.16s wait, 0.03s to process) **********
********** Round-trip (with 4 sends/0 loads) for (+0/.0/-0) was total 1.26s (0.80s wait, 0.46s to process) **********
********** Round-trip (with 4 sends/0 loads) for (+0/.0/-0) was total 1.35s (0.77s wait, 0.58s to process) **********
********** Round-trip (with 4 sends/0 loads) for (+2/.4/-0) was total 1.44s (0.24s wait, 1.20s to process) **********

(我添加了wait时间,这是信息在发送之前在队列中的时间长度。)请注意,to process与独立程序一样快,是。问题只出现在不断阅读和解析网页的问题上。 (注意,解析本身需要很多CPU)。


编辑:一些初步测试表明我应该为每个网页加载使用一个单独的进程...一旦启动并运行,将发布更新。

1 个答案:

答案 0 :(得分:1)

可能是全球口译员锁(GIL)。您是否尝试过多处理模块(主要是线程的替代品,IIRC)?

另见Python code performance decreases with threading