在test.txt中:
1 a
2 b
3 c
4 a
5 d
6 c
我想删除重复内容并将其余部分保存在test2.txt中:
2 b
5 d
我试着从下面的代码开始。
file1 = open('../test.txt').read().split('\n')
#file2 = open('../test2.txt', "w")
word = set()
for line in file1:
if line:
sline = line.split('\t')
if sline[1] not in word:
print sline[0], sline[1]
word.add(sline[1])
#file2.close()
代码的结果显示:
1 a
2 b
3 c
5 d
有什么建议吗?
答案 0 :(得分:3)
您可以在此处使用collections.Orderedict
:
>>> from collections import OrderedDict
with open('abc') as f:
dic = OrderedDict()
for line in f:
v,k = line.split()
dic.setdefault(k,[]).append(v)
现在dic
看起来像:
OrderedDict([('a', ['1', '4']), ('b', ['2']), ('c', ['3', '6']), ('d', ['5'])])
现在我们只需要列表中只包含1个项目的那些键。
for k,v in dic.iteritems():
if len(v) == 1:
print v[0],k
...
2 b
5 d
答案 1 :(得分:1)
您正在做的是确保每一件物品(字母)只打印一次。这显然不是你想要的。
您必须将代码分成两半 - 阅读和收集有关字母数的统计信息,以及仅打印具有count == 1
的部分的部分。
转换原始代码(我只是简单一点):
file1 = open('../test.txt')
words = {}
for line in file1:
if line:
line_num, letter = line.split('\t')
if letter not in words:
words[letter] = [1, line_num]
else:
words[letter][0] += 1
for letter, (count, line_num) in words.iteritems():
if count == 1:
print line_num, letter
答案 2 :(得分:1)
我尽量让它与你的风格保持一致:
file1 = open('../test.txt').read().split('\n')
word = set()
test = []
duplicate = []
sin_duple = []
num_lines = 0;
num_duplicates = 0;
for line in file1:
if line:
sline = line.split(' ')
test.append(" ".join([sline[0], sline[1]]))
if (sline[1] not in word):
word.add(sline[1])
num_lines = num_lines + 1;
else:
sin_duple.append(sline[1])
duplicate.append(" ".join([sline[0], sline[1]]))
num_lines = num_lines + 1;
num_duplicates = num_duplicates + 1;
for i in range (0,num_lines+1):
for item in test:
for j in range(0, num_duplicates):
#print((str(i) + " " + str(sin_duple[j])))
if item == (str(i) + " " + str(sin_duple[j])):
test.remove(item)
file2 = open("../test2.txt", 'w')
for item in test:
file2.write("%s\n" % item)
file2.close()
答案 3 :(得分:0)
一些熊猫怎么样
import pandas as pd
a = pd.read_csv("test_remove_dupl.txt",sep=",")
b = a.drop_duplicates(cols="a")