Question

我想从python中的大json文件（12gb）中找到重复项。现在，我读取了整个文件，使用set（）作为查找表并存储唯一值，然后将重复项写入文件。

但是，如果输入数据为100gb，则set（）将无法处理唯一值，因此我的代码不是scalabe。

有什么办法可以替代吗？

使用python

import json
import time

""" This calss contains the business logic for identifying the duplicates and creating an output file for further processing """

class BusinessService:

    """ The method identiifes the duplicate """

    def service(ipPath,opPath):
        start_time = time.time()    #We start the timer 
        uniqueHandleSet = set();     #Creating a set to store unique values #

        try:
            # Opening and creating an output file to catch the duplicate hanndles #                     
            duplicateHandles = open(opPath,'w+',encoding='utf-8') 
            #Reading the JSON File by buffering and using 20mb as it is too big to read at once #       
            with open(ipPath,buffering = 200000000, encoding = 'utf-8') as infile:     
                for line in infile:
                    tweetJsonObject = json.loads(line);
                    if tweetJsonObject["name"] not in uniqueHandleSet:
                        uniqueHandleSet.add(tweetJsonObject["name"]);
                    else:
                        duplicateHandles.write(line);  
            print("--- %s seconds --- memory 200mb while buffering" % (time.time() - start_time));  #Printing the total time required to execute 
        except:
            print("Error")

        finally:
            duplicateHandles.close();

我需要为set（）使用替代项

使用python在大型数据集中查找重复项

0 个答案: