我有一个json文件,我想从中删除重复的行,但是它太大而无法放入内存。我找到了一种方法来完成它,但是我猜这不是最好的方法。
我的问题是,对于12gb数据集,它可以在8分钟内运行。但是要求是扩展代码,使其可以在100gb数据集上运行。关于如何执行此操作的任何指示?我是否应该在python中使用多线程或多处理来实现这一目标?还是其他方法?
这是代码:
导入json 导入时间
“”“此评估包含用于识别重复项并创建输出文件以进一步处理”“”
的业务逻辑BusinessService类:
""" The method identiifes the duplicate """
def service(ipPath,opPath):
start_time = time.time() #We start the timer to see how much time the method takes to work #
uniqueHandleSet = set(); #Creating a set to store unique values #
try:
duplicateHandles = open(opPath,'w+',encoding='utf-8') #Opening and creating an output file to catch the duplicate hanndles #
with open(ipPath,buffering = 200000000,encoding = 'utf-8') as infile: #Reading the JSON File by buffering and using 20mb as it is too big to read at once #
for line in infile:
tweetJsonObject = json.loads(line);
if tweetJsonObject["name"] not in uniqueHandleSet:
uniqueHandleSet.add(tweetJsonObject["name"]);
else:
duplicateHandles.write(line);
print("--- %s seconds --- memory 200mb while buffering" % (time.time() - start_time)); #Printing the total time required to execute
except:
print("Error")
finally:
duplicateHandles.close();
答案 0 :(得分:0)
要进行缩放,您将需要队列来处理多个进程,并需要两个共享列表来跟踪结果。主要思想是将文件逐行馈送到队列,然后由某些使用者进程对其进行处理。但是,这些过程共享两个列表以存储中间结果。 Manager负责进程之间的同步。
以下代码只是一些粗略的准则,未经实际测试:
from multiprocessing import Process, Manager, Queue
def findDuplicate(inputQueue, uniqueValues, duplicates):
for line in iter(inputQueue.get, 'STOP'): #get line from Queue, stop if 'STOP' is received
if line not in uniqueValues: # check if duplicate
uniqueValues.append(line)
else:
duplicates.append(line) # store it
manager = Manager() # get a new SyncManager
uniqueValues = manager.list() # handle for shared list
duplicates = manager.list() # a 2nd handle for a shared list
inputQueue = Queue() # a queue to provide tasks to the processes
# setup workers, provide shared lists and tasks
numProc = 4
process = [Process(target=findDuplicate,
args=(inputQueue, uniqueValues, duplicates)) for x in range(numProc)]
# start processes, they will idle if nothing is in queue
for p in process:
p.start()
with open(ipPath) as f:
for line in f:
inputQueue.put(line, block=True) # put line in queue, only if free slot avaible
for p in process:
inputQueue.put('STOP') # signal workers to stop as no further input
# wait for processes to finish
for p in process:
p.join()
答案 1 :(得分:0)
from multiprocessing import Process, Manager, Queue
import json
output = open ('output', 'w+', encoding='utf-8')
def findDuplicate(inputQueue, uniqueValues, output):
for line in iter(inputQueue.get, 'STOP'): #get line from Queue, stop if 'STOP' is received
if line['name'] not in uniqueValues: # check if duplicate
uniqueValues.append(line)
else:
output.write(line) # store it
manager = Manager() # get a new SyncManager
uniqueValues = manager.list() # handle for shared list
duplicates = manager.list() # a 2nd handle for a shared list
inputQueue = Queue() # a queue to provide tasks to the processes
# setup workers, provide shared lists and tasks
numProc = 4
process = [Process(target=findDuplicate,
args=(inputQueue, uniqueValues, output)) for x in range(numProc)]
# start processes, they will idle if nothing is in queue
for p in process:
p.start()
with open('username_sample.jsonrows', buffering= 20000000, encoding='utf-8') as f:
for line in f:
inputQueue = json.loads(line, block=True) # put line in queue, only if free slot avaible
for p in process:
inputQueue.put('STOP') # signal workers to stop as no further input
# wait for processes to finish
for p in process:
p.join()
output.close()
我尝试这样做,但出现错误 TypeError:无法序列化'_io.TextIOWrapper'对象