删除并保存另一个文件中的副本

时间:2013-07-10 13:31:31

标签: python

在test.txt中:

1   a
2   b
3   c
4   a
5   d
6   c

我想删除重复内容并将其余部分保存在test2.txt中:

2   b
5   d

我试着从下面的代码开始。

file1 = open('../test.txt').read().split('\n')
#file2 = open('../test2.txt', "w")
word = set()
for line in file1:
    if line:
        sline = line.split('\t')
        if sline[1] not in word:
            print sline[0], sline[1]              
            word.add(sline[1])
#file2.close()

代码的结果显示:

1   a
2   b
3   c
5   d

有什么建议吗?

4 个答案:

答案 0 :(得分:3)

您可以在此处使用collections.Orderedict

>>> from collections import OrderedDict
with open('abc') as f:
    dic = OrderedDict()
    for line in f:
        v,k = line.split()
        dic.setdefault(k,[]).append(v)

现在dic看起来像:

OrderedDict([('a', ['1', '4']), ('b', ['2']), ('c', ['3', '6']), ('d', ['5'])])

现在我们只需要列表中只包含1个项目的那些键。

for k,v in dic.iteritems():
    if len(v) == 1:          
        print v[0],k
...         
2 b
5 d

答案 1 :(得分:1)

您正在做的是确保每一件物品(字母)只打印一次。这显然不是你想要的。

您必须将代码分成两半 - 阅读和收集有关字母数的统计信息,以及仅打印具有count == 1的部分的部分。

转换原始代码(我只是简单一点):

file1 = open('../test.txt')
words = {}
for line in file1:
    if line:
        line_num, letter = line.split('\t')
        if letter not in words:
            words[letter] = [1, line_num]
        else:
            words[letter][0] += 1

for letter, (count, line_num) in words.iteritems():
    if count == 1:
        print line_num, letter

答案 2 :(得分:1)

我尽量让它与你的风格保持一致:

file1 = open('../test.txt').read().split('\n')

word = set()
test = []
duplicate = []
sin_duple = []
num_lines = 0;
num_duplicates = 0;
for line in file1:
    if line:
        sline = line.split('   ')
        test.append("   ".join([sline[0], sline[1]]))
        if (sline[1] not in word):
            word.add(sline[1])
            num_lines = num_lines + 1;
        else:
            sin_duple.append(sline[1])
            duplicate.append("   ".join([sline[0], sline[1]]))
            num_lines = num_lines + 1;
            num_duplicates = num_duplicates + 1;

for i in range (0,num_lines+1):
    for item in test:
        for j in range(0, num_duplicates):
            #print((str(i) + "   " + str(sin_duple[j])))
            if item == (str(i) + "   " + str(sin_duple[j])):
                test.remove(item)


file2 = open("../test2.txt", 'w')
for item in test:
    file2.write("%s\n" % item)
file2.close()

答案 3 :(得分:0)

一些熊猫怎么样

import pandas as pd

a = pd.read_csv("test_remove_dupl.txt",sep=",")

b = a.drop_duplicates(cols="a")