Question

我设置自定义深文件扫描器对于非常大的文件大小（在10GB +的范围）。我必须优化这些文件在“N”关键字数量的搜索，像线数值和计数来计算的。因此，必须修改算法的常规应用以适应特殊需求。

我还使用多重处理将每个文件的扫描分为不同的进程。在每个这样的进程中，由于文件大小从几个KB到几个100Mb不等，所以我决定拆分行，并在产生的每个进程中使用多线程，以扫描行集中的'n'关键字。因此，每个线程只有一个文件的一行，并且将在该行上扫描'n'关键字并获得必要的详细信息（多线程，因为文件内容和其他数据是共享的）。

首先：这是一个好的设计吗？

第二：否则如何有效地实现深度关键字扫描器？

第三：我想使用Pool.map函数来处理所述多线程和multiprocessess如何最好地具有使用该共享变量。？

我使用队列尝试，价值和atlast Manager服务器进程传递进程和线程之间的值。我很困惑如何获得结果。

from multiprocessing import cpu_count, Pool, Manager, 
from contextlib import contextmanager
from multiprocessing.dummy import Pool as ThreadPool
from itertools import count, product

key_list = []

@contextmanager
def poolContext(*args, **kwargs):
    pool = Pool(*args, **kwargs)
    yield pool
    pool.terminate()

class FileDetails:
    listvar = [] 
    def __init__ (self, *args):
        file_name =filename
        file_path = filepath


class KeyWord:
    dictvar1 = defaultdict()
    dictvar2 = defaultdict ()

    def __init__(self,key_name,lang_name):
        self.name = key_name
        self.lang_name = lang_name

    def __call__(self,*args):
        * update the dictvars1,dictvar2 with the arguments passed *

def scanEngine(line_no, line, key, filename,  key_obj_list)
    * scan algorithm *
    *if condition is satisfied *
    for keyobj in key_obj_list:
        if keyobj.name == key:
            * update the object parameter *
        else: 
            continue

def unpacker2(args):
    scan_engine(*args)

def threadingFunct(file_obj,key_obj_list):
    args_list_thread = []
    tpool = ThreadPool(4)
    file_name = file_obj.file_name

    with open(file_obj.file_path,'r') as content : 
        file_content = content.readlines()

    for line_no,line in enumerate(file_content):
       for keyword in file_obj.key_list:
           args_list_thread.append(line_no+1,line, keyword,file_name,  key_obj_list)    #i want to update the details in the key_object_list for each keyword scanned.so its passed to the main scan engine function.

    results = tpool.map(unpacker2, args_list_thread)

def unpacker(args):
    return threadingFunct(*args)

def multiprocessingFunct(filtered_file_lst):
    args_list = [] 
    manager = Manager() 
    manager_key_list = manager.List(key_list) #so that i can access is the processes.
    for obj in filtered_file_lst):
        args_list.append((obj, key_list)) # for the pool arguments.. getting each object a copy of the object2_list
    with poolContext(processes = cpu_count) as p:
        results = p.map(unpacker ,args_list[])


def main(arg1, arg2):
    file_list = getObject1List(*args) #function defined to create file objects
    key_list = getObject2List(*args)  #function defined to create keyword objects
    for obj in file_list:
        if condition== True:
            filtered_file_list.append(obj)

    multiprocessingFunct(filtered_file_list)

预期的结果：在每次执行一个正线程之后，带有相应关键字的Keyword对象应在其中附加值，包括行号无需重复。

如何在每个都有多个线程的多进程上共享和更新类/变量对象？

0 个答案: