更多工作的更多流程

Question

我目前正在编写一个python程序，它会检查代理是否响应并测量需要多长时间。我正在获取的网址是一个允许每秒数百万个请求（ipify.org）的公共API，因此不应该成为瓶颈。如果我设置超时= 15秒（即100 * 15s = 25分钟），那么测试数百甚至数千个课程的速度很慢，因此我在我的程序中引入了线程。出现以下行为：

当我启动256个处理5000个代理列表的线程时，其中10％做出响应的响应时间会增加...

当我只启动16个线程时，响应时间会有所不同，即列表中的代理响应速度有时会比之前测试的更快（这应该是这样）

我或多或少是一个网络初学者，现在我想到的问题是，我每秒应该做的线程/请求的限制是什么，而不会扭曲测量值！

我使用的代码：

def consumer(id):
    while True:
        if len(q)==0:
            break
        proxy = q.popleft()

        # Give them a different and only small overhead to avoid simultaneous tcp/ip bombing... (maybe ??)
        time.sleep(id*0.01)

        s_t = time.time()
        state = check_proxy(proxy)
        response_time = time.time()-s_t

        proxy_list.append({
            'proxy_ip': proxy,
            'working': state[0],
            'resp_time': response_time if state[0] else None
        })

threads = []

# 256 Threads
for i in range(256):
    t = Thread(target=consumer, args=(i,))
    t.daemon = True
    t.start()
    threads.append(t)

for thr in threads:
    thr.join()

check_proxy函数：

def check_proxy(proxy, conn_type='http', site='http://api.ipify.org', timeout=15):
# Format to i.e. { "http": "http://183.207.232.119:8080" }
proxy_req = {conn_type: "%s://%s" % (conn_type, proxy.rstrip())}

try:
    r = requests.get(site, proxies=proxy_req, timeout=timeout)
    return True, r
except requests.exceptions.RequestException as e:    # This is the correct syntax
    return False, e

1000个主题和请求的测试结果：

[758 rows x 3 columns]
                 proxy_ip working  resp_time
26      212.66.42.98:8080    True   1.417061
60     50.97.212.199:3128    True   2.986519
62      23.88.238.46:8081    True   2.002400
63     183.207.229.202:80    True   2.452403
64     183.207.229.194:80    True   2.283683
65     183.207.229.195:80    True   2.501426
66       60.194.100.51:80    True   2.108991
67    83.222.221.137:8080    True   3.075372
68        37.239.46.26:80    True   2.776244
69       80.94.114.197:80    True   1.707185
71     41.75.201.146:8080    True   3.287514
72     42.202.146.58:8080    True   3.874238
75     222.45.196.19:8118    True   3.375033
76     120.202.249.230:80    True   2.778418
77   222.124.198.136:3129    True   2.638542
78       61.184.192.42:80    True   3.474871
79   101.251.238.123:8080    True   2.216384
80      222.87.129.218:80    True   2.541614
81      113.6.252.139:808    True   4.340471
82      218.240.156.82:80    True   3.737869
83       221.176.14.72:80    True   2.408369
84      58.253.238.242:80    True   4.351352
86    219.239.236.49:8888    True   4.693788
87      222.88.236.236:83    True   5.213140
88        119.6.144.73:82    True   3.002683
..                    ...     ...        ...
256     36.85.88.179:8080    True  10.218517
257       117.21.192.9:80    True  10.322229
258     120.193.146.95:83    True   6.408998
259    91.241.18.129:3129    True   7.596714
260    58.213.19.134:2311    True   6.430531
261    27.131.190.66:8080    True   8.047689
262     222.88.236.236:82    True   8.649196
263       119.6.144.73:83    True   8.205048
265     176.31.138.187:80    True  11.444282
266   195.88.192.144:8080    True   6.716996
267    91.188.39.232:8888    True   7.986101
268    202.95.149.62:8080    True  12.453279
269     121.31.5.188:8080    True   6.956209
271      5.53.16.183:8080    True  10.354440
272    37.187.101.28:3128    True  10.922564
273    60.207.63.124:8118    True   9.908007
274   223.195.87.101:8081    True  13.230916
275   89.251.103.130:8080    True  13.350009
276      121.14.138.56:81    True  12.367794
277    118.244.213.6:3128    True   9.533521
278  218.92.227.170:13669    True  12.410708
280       212.68.51.58:80    True  10.599926
446  190.121.148.229:8080    True  15.064356
450  220.132.214.103:9064    True  17.016748
451  164.138.237.251:8080    True  16.171984
454   222.124.28.188:8080    True  15.233777
455     62.176.13.22:8088    True  17.180487
456      82.146.44.39:443    True  15.448998
755     85.9.209.244:8080    True  26.002548
757    201.86.94.166:8080    True  25.771388

稍后检查过的代理显然有更长的响应时间。我试着在开始时对队列进行洗牌，以验证列表中的代理不仅仅是速度慢，实际上并非如此，这里看到的结果是可重现的。

Answer 1

如果您只有一个进程，那么您只能得到一片CPU。该切片分为256个线程。这可能会导致大量的上下文切换。

使用更多进程来获取更多切片（有一个很好的multiprocessing模块）
使用较少的线程
您的check_proxy实施将成为瓶颈（它是基于套接字select功能还是一些阻塞实现？）

有了这么多线程并假设您使用的是常规桌面计算机（现在大多数是8核？），这是很多上下文切换。使用requests库可能会隐藏您需要的许多样板代码，但您可能没有正确使用连接池。

低级别实施

Use select.select()用于更有效的I / O处理;这也适用于socket.fileno()的套接字。

`requests`使用阻止IO

以下是文档： http://docs.python-requests.org/en/latest/user/advanced/#blocking-or-non-blocking

默认情况下，您正在使用阻止IO。查看备选文档。

线程减慢响应时间 - python

我使用的代码：

check_proxy函数：

1000个主题和请求的测试结果：

1 个答案:

更多工作的更多流程

低级别实施

`requests`使用阻止IO

线程减慢响应时间 - python

我使用的代码：

check_proxy函数：

1000个主题和请求的测试结果：

1 个答案:

更多工作的更多流程

低级别实施

requests使用阻止IO

`requests`使用阻止IO