Question

我正在开发一个webscraping工具，我正在使用并发进程。我想知道您是否需要使用的线程数量的一般规则。目前，我已将它设置为10，但我注意到当我将数字超过线程数量时，我会丢失更多的数据值。

URLs = loadit()
start = time.time()
with ThreadPoolExecutor(max_workers=10) as executor:
    # start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url): url for url in URLs}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
            print(data.values())
            # scraped_data = restaurant_parse(link)
            # time.sleep(random.randrange(3, 5))
            writeit(outName, data.values())
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
end = time.time()
print(end - start)

在ubuntu linux框上的lscpu显示

ubuntu-dev@ubuntu:~$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                3
On-line CPU(s) list:   0-2
Thread(s) per core:    1
Core(s) per socket:    3
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 61
Model name:            Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Stepping:              4
CPU MHz:               3100.000
BogoMIPS:              6200.00
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              4096K
NUMA node0 CPU(s):     0-2

谢谢！

并发期货和正确使用的线程数。

0 个答案: