我正在编写一个加载&一次解析许多页面&将数据从它们发送到服务器。如果我只是一次运行一个页面处理器,情况就相当不错了:
********** Round-trip (with 0 sends/0 loads) for (+0/.0/-0) was total 1.98s (1.60s load html, 0.24s parse, 0.00s on queue, 0.14s to process) **********
********** Round-trip (with 0 sends/0 loads) for (+0/.0/-0) was total 1.87s (1.59s load html, 0.25s parse, 0.00s on queue, 0.03s to process) **********
********** Round-trip (with 0 sends/0 loads) for (+0/.0/-0) was total 2.79s (1.78s load html, 0.28s parse, 0.00s on queue, 0.72s to process) **********
********** Round-trip (with 0 sends/1 loads) for (+0/.0/-0) was total 2.18s (1.70s load html, 0.34s parse, 0.00s on queue, 0.15s to process) **********
********** Round-trip (with 0 sends/1 loads) for (+0/.0/-0) was total 1.91s (1.47s load html, 0.21s parse, 0.00s on queue, 0.23s to process) **********
********** Round-trip (with 0 sends/1 loads) for (+0/.0/-0) was total 1.84s (1.59s load html, 0.22s parse, 0.00s on queue, 0.03s to process) **********
********** Round-trip (with 0 sends/0 loads) for (+0/.0/-0) was total 1.90s (1.67s load html, 0.21s parse, 0.00s on queue, 0.02s to process) **********
但是,如果同时运行~20(每个都在自己的线程中),HTTP流量变得非常慢:
********** Round-trip (with 2 sends/7 loads) for (+0/.0/-0) was total 23.37s (16.39s load html, 0.30s parse, 0.00s on queue, 6.67s to process) **********
********** Round-trip (with 2 sends/5 loads) for (+0/.0/-0) was total 20.99s (14.00s load html, 1.99s parse, 0.00s on queue, 5.00s to process) **********
********** Round-trip (with 4 sends/4 loads) for (+0/.0/-0) was total 17.89s (9.17s load html, 0.30s parse, 0.12s on queue, 8.31s to process) **********
********** Round-trip (with 3 sends/5 loads) for (+0/.0/-0) was total 26.22s (15.34s load html, 1.63s parse, 0.01s on queue, 9.24s to process) **********
load html
位是读取我正在处理的网页的HTML所需的时间(resp = self.mech.open(url)
到resp.read(); resp.close()
)。 to process
位是从此客户端到处理它的服务器(fp = urllib2.urlopen(...); fp.read(); fp.close()
)往返所花费的时间。 X sends/Y loads
位是向服务器发送的同时发送的数量,以及当我向服务器发出请求时正在运行的网页的加载。
我最关心to process
位。服务器上的实际处理只需0.2s
左右。仅发送 400字节,因此不会占用太多带宽。有趣的是,如果我运行一个程序(同时解析所有这些同时发送/加载)打开5个线程并重复执行to process
位,它会非常快:
1 took 0.04s
1 took 1.41s in total
0 took 0.03s
0 took 1.43s in total
4 took 0.33s
2 took 0.49s
2 took 0.08s
2 took 0.01s
2 took 1.74s in total
3 took 0.62s
4 took 0.40s
3 took 0.31s
4 took 0.33s
3 took 0.05s
3 took 2.18s in total
4 took 0.07s
4 took 2.22s in total
此独立程序中的每个to process
只需0.01s
到0.50s
,远远低于完整版本中的6-10秒,并且它不是{&1;}。使用任何较少的发送线程(它使用5,并且完整版本的上限为5)。
也就是说,在完整版本运行时,运行一个单独的版本,每个请求发送400个字节的(+0/.0/-0)
个请求,每个请求只需0.31
个。所以,它不像我正在运行的机器被点击......似乎相反,其他线程中的多个同时负载正在减慢应该快速的速度(实际上是快速的,在同一台机器上运行的另一个程序中)发送其他线程。
使用urllib2.urlopen
完成发送,而正在使用mechanize(最终使用urllib2.urlopen
的分支)完成读取。
有没有办法让这个完整的程序像这个迷你独立版本一样快速运行,至少在他们发送同样的东西的时候?我正在考虑编写另一个程序,该程序只接受通过命名管道或其他东西发送的内容,以便发送在另一个进程中完成,但这看起来很愚蠢。欢迎大家提出意见。
关于如何更快地获得多个同时页面加载的任何建议(所以时间看起来更像是1-3秒而不是10-20秒)也是受欢迎的。
编辑:补充说明:我依赖于机械化的cookie处理功能,所以任何答案都可以理想地提供一种处理它的方法,以及...
********** Round-trip (with 4 sends/0 loads) for (+0/.0/-0) was total 1.17s (1.14s wait, 0.04s to process) **********
********** Round-trip (with 4 sends/0 loads) for (+0/.0/-0) was total 1.19s (1.16s wait, 0.03s to process) **********
********** Round-trip (with 4 sends/0 loads) for (+0/.0/-0) was total 1.26s (0.80s wait, 0.46s to process) **********
********** Round-trip (with 4 sends/0 loads) for (+0/.0/-0) was total 1.35s (0.77s wait, 0.58s to process) **********
********** Round-trip (with 4 sends/0 loads) for (+2/.4/-0) was total 1.44s (0.24s wait, 1.20s to process) **********
(我添加了wait
时间,这是信息在发送之前在队列中的时间长度。)请注意,to process
与独立程序一样快,是。问题只出现在不断阅读和解析网页的问题上。 (注意,解析本身需要很多CPU)。
编辑:一些初步测试表明我应该为每个网页加载使用一个单独的进程...一旦启动并运行,将发布更新。
答案 0 :(得分:1)
可能是全球口译员锁(GIL)。您是否尝试过多处理模块(主要是线程的替代品,IIRC)?