Question

我有一个非常大的csv文件，其中包含大约70,000条推文，其中包含我必须删除的重复版本。该文件有三列（ID，Creation_Date，Text）。

csv文件的示例如下：

       ID                          Date                                  Text
"745828866334269441"     "Thu Jun 23 04:05:33 +0000 2017"              "Any TEXT"
"745828863334269434"     "Thu Jun 23 04:06:33 +0000 2017"              "Any TEXT"
"745828343334269425"     "Thu Jun 23 04:07:33 +0000 2017"              "Any TEXT"  
      ................ and so on

我在Python中使用来自Difflib的sequenceMatcher。脚本工作得非常好。该脚本如下所列：

import csv
from difflib import SequenceMatcher

csvInputFile=open('inputFileWithDups.csv', 'r', encoding="utf-8", newline='') # Input file name with duplicates
csvOutputFile=open('outputFileWithoutDups.csv', 'w', encoding="utf-8", newline='') # Output file name without duplicates

csvReader = csv.reader(csvInputFile)
csvWriter = csv.writer(csvOutputFile, delimiter=',',quotechar='"', quoting=csv.QUOTE_ALL)
cleanData = set() # an empty set that will be used to compare and then store the clean tweets without duplicates

for row in csvReader: # reading the inputfile 
   add=True 
   a=row[2] # our third csv column with tweets text that we have to compare for duplicates
   for cleantweet in cleanData:# reading the cleanData set to compare tweet texts.
        f= SequenceMatcher(None,cleantweet,a).ratio() #cleantweet vs row[2] which is text  
        if f > 0.73:
            print(f)
            add=False

   if add: # This will add all the tweets that have a similarty lower than 0.73 (here 1.0 means a 100 percent similarity)
       cleanData.add(row[2])
       csvWriter.writerow(row) # adding all the tweets without duplicates into the new csv file.
csvOutputFile.close()
csvInputFile.close()

但只有4GB Ram的PC需要花费太多时间来处理。例如：只有5000条推文的文件花了将近7个小时来处理。我必须比较的下一个文件包含50,000条推文，这意味着可能需要3天的工作:(
如果有人能帮助我加快这个过程，我将非常感激感谢

Answer 1

在Linux系统上，您可以使用以下命令从文本文件中删除重复的行：

awk '!seen[$0]++' duplicate.csv > uniqlines.csv

使用3,700,000行文件，耗时49秒。我的电脑是16Go RAM但它没有达到4.3Go使用率，开始在4.1Go操作。

删除重复项的过程耗时太长

1 个答案: