我想从python中的大json文件(12gb)中找到重复项。 现在,我读取了整个文件,使用set()作为查找表并存储唯一值,然后将重复项写入文件。
但是,如果输入数据为100gb,则set()将无法处理唯一值,因此我的代码不是scalabe。
有什么办法可以替代吗?
使用python
import json
import time
""" This calss contains the business logic for identifying the duplicates and creating an output file for further processing """
class BusinessService:
""" The method identiifes the duplicate """
def service(ipPath,opPath):
start_time = time.time() #We start the timer
uniqueHandleSet = set(); #Creating a set to store unique values #
try:
# Opening and creating an output file to catch the duplicate hanndles #
duplicateHandles = open(opPath,'w+',encoding='utf-8')
#Reading the JSON File by buffering and using 20mb as it is too big to read at once #
with open(ipPath,buffering = 200000000, encoding = 'utf-8') as infile:
for line in infile:
tweetJsonObject = json.loads(line);
if tweetJsonObject["name"] not in uniqueHandleSet:
uniqueHandleSet.add(tweetJsonObject["name"]);
else:
duplicateHandles.write(line);
print("--- %s seconds --- memory 200mb while buffering" % (time.time() - start_time)); #Printing the total time required to execute
except:
print("Error")
finally:
duplicateHandles.close();
我需要为set()使用替代项