我有一个非常大的csv文件,其中包含大约70,000条推文,其中包含我必须删除的重复版本。该文件有三列(ID,Creation_Date,Text)。
csv文件的示例如下:
ID Date Text
"745828866334269441" "Thu Jun 23 04:05:33 +0000 2017" "Any TEXT"
"745828863334269434" "Thu Jun 23 04:06:33 +0000 2017" "Any TEXT"
"745828343334269425" "Thu Jun 23 04:07:33 +0000 2017" "Any TEXT"
................ and so on
我在Python中使用来自Difflib的sequenceMatcher。脚本工作得非常好。该脚本如下所列:
import csv
from difflib import SequenceMatcher
csvInputFile=open('inputFileWithDups.csv', 'r', encoding="utf-8", newline='') # Input file name with duplicates
csvOutputFile=open('outputFileWithoutDups.csv', 'w', encoding="utf-8", newline='') # Output file name without duplicates
csvReader = csv.reader(csvInputFile)
csvWriter = csv.writer(csvOutputFile, delimiter=',',quotechar='"', quoting=csv.QUOTE_ALL)
cleanData = set() # an empty set that will be used to compare and then store the clean tweets without duplicates
for row in csvReader: # reading the inputfile
add=True
a=row[2] # our third csv column with tweets text that we have to compare for duplicates
for cleantweet in cleanData:# reading the cleanData set to compare tweet texts.
f= SequenceMatcher(None,cleantweet,a).ratio() #cleantweet vs row[2] which is text
if f > 0.73:
print(f)
add=False
if add: # This will add all the tweets that have a similarty lower than 0.73 (here 1.0 means a 100 percent similarity)
cleanData.add(row[2])
csvWriter.writerow(row) # adding all the tweets without duplicates into the new csv file.
csvOutputFile.close()
csvInputFile.close()
但只有4GB Ram的PC需要花费太多时间来处理。例如:只有5000条推文的文件花了将近7个小时来处理。我必须比较的下一个文件包含50,000条推文,这意味着可能需要3天的工作:(
如果有人能帮助我加快这个过程,我将非常感激
感谢
答案 0 :(得分:0)
在Linux系统上,您可以使用以下命令从文本文件中删除重复的行:
awk '!seen[$0]++' duplicate.csv > uniqlines.csv
使用3,700,000行文件,耗时49秒。 我的电脑是16Go RAM但它没有达到4.3Go使用率,开始在4.1Go操作。