我有一个3000万个字符串的列表,我想使用python对所有这些字符串运行dns查询。我不了解此操作如何占用大量内存。我假设线程将在完成工作后退出,并且还有1分钟的超时时间({'dns_request_timeout':1})。
在运行脚本时,这是机器资源的先睹为快:
我的代码如下:
# -*- coding: utf-8 -*-
import dns.resolver
import concurrent.futures
from pprint import pprint
from json import json
bucket = json.load(open('30_million_strings.json','r'))
def _dns_query(target, **kwargs):
global bucket
resolv = dns.resolver.Resolver()
resolv.timeout = kwargs['function']['dns_request_timeout']
try:
resolv.query(target + '.com', kwargs['function']['query_type'])
with open('out.txt', 'a') as f:
f.write(target + '\n')
except Exception:
pass
def run(**kwargs):
global bucket
temp_locals = locals()
pprint({k: v for k, v in temp_locals.items()})
with concurrent.futures.ThreadPoolExecutor(max_workers=kwargs['concurrency']['threads']) as executor:
future_to_element = dict()
for element in bucket:
future = executor.submit(kwargs['function']['name'], element, **kwargs)
future_to_element[future] = element
for future in concurrent.futures.as_completed(future_to_element):
result = future_to_element[future]
run(function={'name': _dns_query, 'dns_request_timeout': 1, 'query_type': 'MX'},
concurrency={'threads': 15})
答案 0 :(得分:0)
尝试一下:
def sure_ok(future):
try:
with open('out.txt', 'a') as f:
f.write(str(future.result()[0]) + '\n')
except:
pass
with concurrent.futures.ThreadPoolExecutor(max_workers=2500):
for element in json.load(open('30_million_strings.json','r')):
resolv = dns.resolver.Resolver()
resolv.timeout = 1
future = executor.submit(resolv.query, target + '.com', 'MX')
future.add_done_callback(sure_ok)
删除global bucket
,因为它是多余的,不需要。
删除字典中30+万种期货的引用,这也是多余的。
您可能还没有使用足够新的
concurrent.futures
的版本: