我正在开发一个webscraping工具,我正在使用并发进程。我想知道您是否需要使用的线程数量的一般规则。目前,我已将它设置为10,但我注意到当我将数字超过线程数量时,我会丢失更多的数据值。
URLs = loadit()
start = time.time()
with ThreadPoolExecutor(max_workers=10) as executor:
# start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url): url for url in URLs}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
print(data.values())
# scraped_data = restaurant_parse(link)
# time.sleep(random.randrange(3, 5))
writeit(outName, data.values())
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
end = time.time()
print(end - start)
在ubuntu linux框上的lscpu显示
ubuntu-dev@ubuntu:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 3
On-line CPU(s) list: 0-2
Thread(s) per core: 1
Core(s) per socket: 3
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 61
Model name: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Stepping: 4
CPU MHz: 3100.000
BogoMIPS: 6200.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 4096K
NUMA node0 CPU(s): 0-2
谢谢!