Question

所以我有这段代码（来自BeautifulSoup : how to show the inside of a div that won't show?），但我想知道是否有人知道如何加快结果处理速度？

它需要一个网站的词汇表条目，并使用它们创建文本文件，但是由于我将使用几种语言对多个网站进行相同的操作，因此现在有点太慢了。

如果有人有任何想法或见解，我将很高兴阅读它们！

import requests
import json

r = requests.get('http://winevibe.com/wp-json/glossary/key/?l=en').json()
data = json.loads(r)
result = ([(item['key'], item['id']) for item in data])
text = []
for item in result:
    try:
        r = requests.get(
            f"http://winevibe.com/wp-json/glossary/text/?id={item[1]}").json()
        data = json.loads(r)
        print(f"Getting Text For: {item[0]}")
        text.append(data[0]['text'])
    except KeyboardInterrupt:
        print('Good Bye')
        break

with open('result.txt', 'w+') as f:
    for a, b in zip(result, text):
        lines = ', '.join([a[0], b.replace('\n', '')]) + '\n'
        f.write(lines)

Answer 1

如果您正在寻找一种简便的方法来提高性能而又没有太多开销，那么线程库很容易上手。这是一个用法示例（尽管不是很实用）：

import threading
import requests as r 

#store request into list.  Do any altering to response here
def get_url_content(url,idx,results):
  results[idx] = str(r.get(url).content)

urls = ['https://tasty.co/compilation/10-supreme-lasagna-recipes' for i in range(1000)]
results = [None for ele in urls]
num_threads = 20
start = time.time()
threads = []
i = 0
while len(urls) > 0:
  if len(urls) > num_threads:
    url_sub_li = urls[:num_threads]
    urls = urls[num_threads:]
  else:
    url_sub_li = urls
    urls = []

  #create a thread for each url to scrape
  for url in url_sub_li:
    t = threading.Thread(target=get_url_content, args=(url,i,results))
    threads.append(t)
    i+=1
  #start each thread
  for t in threads:
    t.start()
  #wait for each thread to finish
  for t in threads:
    t.join()
  threads = []

当num_threads从5变为95递增5时，结果如下：

5 threads took 15.603618860244751 seconds
10 threads took 12.467495679855347 seconds
15 threads took 12.416464805603027 seconds
20 threads took 12.120754957199097 seconds
25 threads took 11.872958421707153 seconds
30 threads took 11.743015766143799 seconds
35 threads took 11.87484860420227 seconds
40 threads took 11.65029239654541 seconds
45 threads took 11.6738121509552 seconds
50 threads took 11.400196313858032 seconds
55 threads took 11.399579286575317 seconds
60 threads took 11.302385807037354 seconds
65 threads took 11.301892280578613 seconds
70 threads took 11.088538885116577 seconds
75 threads took 11.60099172592163 seconds
80 threads took 11.280904531478882 seconds
85 threads took 11.361995935440063 seconds
90 threads took 11.376339435577393 seconds
95 threads took 11.090314388275146 seconds

如果同一程序以串行方式运行：

urls = ['https://tasty.co/compilation/10-supreme-lasagna-recipes' for i in range(1000)]
results = [str(r.get(url).content) for url in urls]

时间是：51.39667201042175秒

这里的线程库将性能提高了约5倍，并且很容易与您的代码集成。如评论中所述，还有其他一些库可以提供更好的性能，但这对于简单集成来说是非常有用的。

网络抓取中的多线程python请求

1 个答案: