我希望我能足够具体。我为解析器编程了特定的设备数据。我也有存档,可用于将数据填充到新系统或测试解析器。现在,我正在测试如何使用multiproccesing
加快整个过程,最后得到multiprocessing.Pool
。
所以我有一个这样的工作包装器
def worker(packet):
packet_object = None
try:
packet_object = MyParser(packet).parse()
except Exception as e:
# *** more code here ***
return packet_object
然后我有一个数据生成器:
def get_data():
dir_companies = listdir(archive_dir)
for company_dir in dir_companies:
self.current_dir = company_dir
dir_dates = listdir(archive_dir + company_dir + "/")
for dir_date in dir_dates:
archive_files = glob.glob(archive_dir + company_dir + "/" + dir_date + "/" + mask)
for file in archive_files:
file = file.replace("\\", "/")
with lzma.open(file) as f:
matches = re.finditer(regex, str(f.read()), re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
out = "{group}".format(group=match.group(groupNum))
rows = out.split('),(')
for lines in rows:
for m in re.finditer(regex_internal, lines):
try:
packet = m.group('packet')
except IndexError:
print(f'regex error packet not found')
yield unhexlify(bytes(packet, 'ascii'))
然后我可以像这样运行专业代码:
data = get_data()
p = Pool()
for result in p.imap(worker, data):
# look at the result here, store to db, whatever
pass
由于我看不到生成器或工作器内部,因此我决定将其包装到类中以引入一些计数器,统计信息等...
class PoolWorker:
def __init__(self):
self.start = dt.now()
self.total_packets = 0
self.error_packets = 0
self.total_receipts = 0
self.error_critical = 0
self.total_files = 0
self.db_size = None
self.current_file = None
self.current_dir = None
def worker(self, packet):
# code above
def get_data(self):
# code above
然后我正在运行类似的代码
if __name__ == '__main__':
pw = PoolWorker()
p = Pool()
directory = None
for res in p.imap_unordered(pw.worker, pw.get_data()):
if directory is None:
directory = pw.current_dir
elif directory != pw.current_dir:
directory = pw.current_dir
pw.get_stats()
pw.get_stats()
我的问题是数据包计数器没有从0移动
我知道为什么会这样(也许PoolWorker
类有很多(子)实例),但是我能弄清楚如何使这些计数器起作用
我试图将以下代码添加到该类中,但没有帮助
def __getstate__(self):
return self.__dict__.copy()
def __setstate__(self, dict):
self.__dict__ = dict
此代码应解析大约1+十亿个归档数据包,因此任何提速都是不错的。有趣的是get_data(self)
类生成器中的PoolWorker
可以正确更改self.current_file
,self.current_dir
,但worker(self, packet)
不能。