使用python在大型数据集中查找重复项

时间:2018-12-23 02:36:38

标签: python optimization duplicates largenumber

我想从python中的大json文件(12gb)中找到重复项。 现在,我读取了整个文件,使用set()作为查找表并存储唯一值,然后将重复项写入文件。

但是,如果输入数据为100gb,则set()将无法处理唯一值,因此我的代码不是scalabe。

有什么办法可以替代吗?

使用python

import json
import time

""" This calss contains the business logic for identifying the duplicates and creating an output file for further processing """

class BusinessService:

    """ The method identiifes the duplicate """

    def service(ipPath,opPath):
        start_time = time.time()    #We start the timer 
        uniqueHandleSet = set();     #Creating a set to store unique values #

        try:
            # Opening and creating an output file to catch the duplicate hanndles #                     
            duplicateHandles = open(opPath,'w+',encoding='utf-8') 
            #Reading the JSON File by buffering and using 20mb as it is too big to read at once #       
            with open(ipPath,buffering = 200000000, encoding = 'utf-8') as infile:     
                for line in infile:
                    tweetJsonObject = json.loads(line);
                    if tweetJsonObject["name"] not in uniqueHandleSet:
                        uniqueHandleSet.add(tweetJsonObject["name"]);
                    else:
                        duplicateHandles.write(line);  
            print("--- %s seconds --- memory 200mb while buffering" % (time.time() - start_time));  #Printing the total time required to execute 
        except:
            print("Error")

        finally:
            duplicateHandles.close();

我需要为set()使用替代项

0 个答案:

没有答案