Question

我希望我能足够具体。我为解析器编程了特定的设备数据。我也有存档，可用于将数据填充到新系统或测试解析器。现在，我正在测试如何使用multiproccesing加快整个过程，最后得到multiprocessing.Pool。

所以我有一个这样的工作包装器

def worker(packet):
    packet_object = None
    try:
        packet_object = MyParser(packet).parse()
    except Exception as e:
        # *** more code here ***

    return packet_object

然后我有一个数据生成器：

def get_data():
    dir_companies = listdir(archive_dir)
    for company_dir in dir_companies:
        self.current_dir = company_dir
        dir_dates = listdir(archive_dir + company_dir + "/")
        for dir_date in dir_dates:
            archive_files = glob.glob(archive_dir + company_dir + "/" + dir_date + "/" + mask)
            for file in archive_files:
                file = file.replace("\\", "/")
                with lzma.open(file) as f:
                    matches = re.finditer(regex, str(f.read()), re.MULTILINE)
                    for matchNum, match in enumerate(matches, start=1):
                        for groupNum in range(0, len(match.groups())):
                            groupNum = groupNum + 1

                            out = "{group}".format(group=match.group(groupNum))
                            rows = out.split('),(')
                            for lines in rows:
                                for m in re.finditer(regex_internal, lines):
                                    try:
                                        packet = m.group('packet')
                                    except IndexError:
                                        print(f'regex error packet not found')
                                    yield unhexlify(bytes(packet, 'ascii'))

然后我可以像这样运行专业代码：

data = get_data()
p = Pool()

for result in p.imap(worker, data):
    # look at the result here, store to db, whatever
    pass

由于我看不到生成器或工作器内部，因此我决定将其包装到类中以引入一些计数器，统计信息等...

class PoolWorker:
    def __init__(self):
        self.start = dt.now()
        self.total_packets = 0
        self.error_packets = 0
        self.total_receipts = 0
        self.error_critical = 0
        self.total_files = 0
        self.db_size = None
        self.current_file = None
        self.current_dir = None

    def worker(self, packet):
        #  code above

    def get_data(self):

        # code above

然后我正在运行类似的代码

if __name__ == '__main__':
    pw = PoolWorker()
    p = Pool()

    directory = None

    for res in p.imap_unordered(pw.worker, pw.get_data()):
        if directory is None:
            directory = pw.current_dir
        elif directory != pw.current_dir:
            directory = pw.current_dir
            pw.get_stats()

    pw.get_stats()

我的问题是数据包计数器没有从0移动我知道为什么会这样（也许PoolWorker类有很多（子）实例），但是我能弄清楚如何使这些计数器起作用

我试图将以下代码添加到该类中，但没有帮助

    def __getstate__(self):
        return self.__dict__.copy()

    def __setstate__(self, dict):
        self.__dict__ = dict

此代码应解析大约1+十亿个归档数据包，因此任何提速都是不错的。有趣的是get_data(self)类生成器中的PoolWorker可以正确更改self.current_file，self.current_dir，但worker(self, packet)不能。

如何使用Pool imap同步类中的初始化变量

0 个答案: