在python中对大型数据集进行多处理(查找重复项)

时间:2018-12-18 02:36:41

标签: python multithreading multiprocessing large-data

我有一个json文件,我想从中删除重复的行,但是它太大而无法放入内存。我找到了一种方法来完成它,但是我猜这不是最好的方法。

我的问题是,对于12gb数据集,它可以在8分钟内运行。但是要求是扩展代码,使其可以在100gb数据集上运行。关于如何执行此操作的任何指示?我是否应该在python中使用多线程或多处理来实现这一目标?还是其他方法?

这是代码:

导入json 导入时间

“”“此评估包含用于识别重复项并创建输出文件以进一步处理”“”

的业务逻辑

BusinessService类:

""" The method identiifes the duplicate """

def service(ipPath,opPath):

        start_time = time.time()    #We start the timer to see how much time the method takes to work #

        uniqueHandleSet = set();     #Creating a set to store unique values #

        try:
            duplicateHandles = open(opPath,'w+',encoding='utf-8')     #Opening and creating an output file to catch the duplicate hanndles #                     
            with open(ipPath,buffering = 200000000,encoding = 'utf-8') as infile:     #Reading the JSON File by buffering and using 20mb as it is too big to read at once #       


                for line in infile:

                    tweetJsonObject = json.loads(line);

                    if tweetJsonObject["name"] not in uniqueHandleSet:

                        uniqueHandleSet.add(tweetJsonObject["name"]);
                    else:
                            duplicateHandles.write(line);



            print("--- %s seconds --- memory 200mb while buffering" % (time.time() - start_time));  #Printing the total time required to execute 

        except:
            print("Error")

        finally:
            duplicateHandles.close();

2 个答案:

答案 0 :(得分:0)

要进行缩放,您将需要队列来处理多个进程,并需要两个共享列表来跟踪结果。主要思想是将文件逐行馈送到队列,然后由某些使用者进程对其进行处理。但是,这些过程共享两个列表以存储中间结果。 Manager负责进程之间的同步。

以下代码只是一些粗略的准则,未经实际测试:

from multiprocessing import Process, Manager, Queue

def findDuplicate(inputQueue, uniqueValues, duplicates):
    for line in iter(inputQueue.get, 'STOP'): #get line from Queue, stop if 'STOP' is received
        if line not in uniqueValues: # check if duplicate
            uniqueValues.append(line)
        else:
            duplicates.append(line) # store it

manager = Manager() # get a new SyncManager
uniqueValues = manager.list() # handle for shared list
duplicates = manager.list() # a 2nd handle for a shared list
inputQueue = Queue() # a queue to provide tasks to the processes

# setup workers, provide shared lists and tasks
numProc = 4
process = [Process(target=findDuplicate,
                      args=(inputQueue, uniqueValues, duplicates)) for x in range(numProc)]

# start processes, they will idle if nothing is in queue
for p in process:
    p.start()

with open(ipPath) as f:
    for line in f:
        inputQueue.put(line, block=True) # put line in queue, only if free slot avaible
for p in process:
    inputQueue.put('STOP') # signal workers to stop as no further input

    # wait for processes to finish
for p in process:
    p.join()

答案 1 :(得分:0)

from multiprocessing import Process, Manager, Queue
import json

output = open ('output', 'w+', encoding='utf-8')

def findDuplicate(inputQueue, uniqueValues, output):
    for line in iter(inputQueue.get, 'STOP'): #get line from Queue, stop if 'STOP' is received
        if line['name'] not in uniqueValues: # check if duplicate
            uniqueValues.append(line)
        else:
            output.write(line) # store it

manager = Manager() # get a new SyncManager
uniqueValues = manager.list() # handle for shared list
duplicates = manager.list() # a 2nd handle for a shared list
inputQueue = Queue() # a queue to provide tasks to the processes

# setup workers, provide shared lists and tasks
numProc = 4
process = [Process(target=findDuplicate,
                      args=(inputQueue, uniqueValues, output)) for x in range(numProc)]

# start processes, they will idle if nothing is in queue
for p in process:
    p.start()

with open('username_sample.jsonrows', buffering= 20000000, encoding='utf-8') as f:
    for line in f:
        inputQueue = json.loads(line, block=True) # put line in queue, only if free slot avaible
for p in process:
    inputQueue.put('STOP') # signal workers to stop as no further input

    # wait for processes to finish
for p in process:
    p.join()


output.close()

我尝试这样做,但出现错误  TypeError:无法序列化'_io.TextIOWrapper'对象