从文本文件中删除重复项

时间:2013-04-05 09:26:23

标签: python string duplicates

我想从文本文件中删除重复的单词。

我有一些文本文件包含如下内容:

None_None

ConfigHandler_56663624
ConfigHandler_56663624
ConfigHandler_56663624
ConfigHandler_56663624

None_None

ColumnConverter_56963312
ColumnConverter_56963312

PredicatesFactory_56963424
PredicatesFactory_56963424

PredicateConverter_56963648
PredicateConverter_56963648

ConfigHandler_80134888
ConfigHandler_80134888
ConfigHandler_80134888
ConfigHandler_80134888

结果输出必须是:

None_None

ConfigHandler_56663624

ColumnConverter_56963312

PredicatesFactory_56963424

PredicateConverter_56963648

ConfigHandler_80134888

我只使用了这个命令: EN =设定(开( 'file.txt的') 但它不起作用。

任何人都可以帮我解决如何从文件

中仅提取唯一集的问题

谢谢

7 个答案:

答案 0 :(得分:6)

这是一个简单的解决方案,使用集合从文本文件中删除重复项。

lines = open('workfile.txt', 'r').readlines()

lines_set = set(lines)

out  = open('workfile.txt', 'w')

for line in lines_set:
    out.write(line)

答案 1 :(得分:4)

这是关于保留顺序的选项(与集合不同),但仍然具有相同的行为(请注意,故意剥离EOL字符并忽略空行)...

from collections import OrderedDict

with open('/home/jon/testdata.txt') as fin:
    lines = (line.rstrip() for line in fin)
    unique_lines = OrderedDict.fromkeys( (line for line in lines if line) )

print unique_lines.keys()
# ['None_None', 'ConfigHandler_56663624', 'ColumnConverter_56963312',PredicatesFactory_56963424', 'PredicateConverter_56963648', 'ConfigHandler_80134888']

然后你只需要将上面的内容写入你的输出文件。

答案 2 :(得分:1)

以下是使用集合(无序结果)的方法:

from pprint import pprint

with open('input.txt', 'r') as f:
    print pprint(set(f.readlines()))

此外,您可能想要摆脱新的行字符。

答案 3 :(得分:0)

如果您只想获得不重复的输出,可以使用uniqsort

hvn@lappy: /tmp () $ sort -nr dup | uniq
PredicatesFactory_56963424
PredicateConverter_56963648
None_None
ConfigHandler_80134888
ConfigHandler_56663624
ColumnConverter_56963312

对于python:

In [2]: with open("dup", 'rt') as f:
    lines = f.readlines()
   ...:     

In [3]: lines
Out[3]: 
['None_None\n',
 '\n',
 'ConfigHandler_56663624\n',
 'ConfigHandler_56663624\n',
 'ConfigHandler_56663624\n',
 'ConfigHandler_56663624\n',
 '\n',
 'None_None\n',
 '\n',
 'ColumnConverter_56963312\n',
 'ColumnConverter_56963312\n',
 '\n',
 'PredicatesFactory_56963424\n',
 'PredicatesFactory_56963424\n',
 '\n',
 'PredicateConverter_56963648\n',
 'PredicateConverter_56963648\n',
 '\n',
 'ConfigHandler_80134888\n',
 'ConfigHandler_80134888\n',
 'ConfigHandler_80134888\n',
 'ConfigHandler_80134888\n']

In [4]: set(lines)
Out[4]: 
set(['ColumnConverter_56963312\n',
     '\n',
     'PredicatesFactory_56963424\n',
     'ConfigHandler_56663624\n',
     'PredicateConverter_56963648\n',
     'ConfigHandler_80134888\n',
     'None_None\n'])

答案 4 :(得分:0)

import json
myfile = json.load(open('yourfile', 'r'))
uniq = set()
for p in myfile:
if p in uniq:
    print "duplicate : " + p
    del p
else:
    uniq.add(p)
print uniq

答案 5 :(得分:0)

以这种方式获取放在

中的相同文件
import uuid

def _remove_duplicates(filePath):
  f = open(filePath, 'r')
  lines = f.readlines()
  lines_set = set(lines)
  tmp_file=str(uuid.uuid4())
  out=open(tmp_file, 'w')
  for line in lines_set:
    out.write(line)
  f.close()
  os.rename(tmp_file,filePath)

答案 6 :(得分:0)

def remove_duplicates(infile):
    storehouse = set()
    with open('outfile.txt', 'w+') as out:
        for line in open(infile):
            if line not in storehouse:
                out.write(line)
                storehouse.add(line)

remove_duplicates('infile.txt')