我想从文本文件中删除重复的单词。
我有一些文本文件包含如下内容:
None_None
ConfigHandler_56663624
ConfigHandler_56663624
ConfigHandler_56663624
ConfigHandler_56663624
None_None
ColumnConverter_56963312
ColumnConverter_56963312
PredicatesFactory_56963424
PredicatesFactory_56963424
PredicateConverter_56963648
PredicateConverter_56963648
ConfigHandler_80134888
ConfigHandler_80134888
ConfigHandler_80134888
ConfigHandler_80134888
结果输出必须是:
None_None
ConfigHandler_56663624
ColumnConverter_56963312
PredicatesFactory_56963424
PredicateConverter_56963648
ConfigHandler_80134888
我只使用了这个命令: EN =设定(开( 'file.txt的') 但它不起作用。
任何人都可以帮我解决如何从文件
中仅提取唯一集的问题谢谢
答案 0 :(得分:6)
这是一个简单的解决方案,使用集合从文本文件中删除重复项。
lines = open('workfile.txt', 'r').readlines()
lines_set = set(lines)
out = open('workfile.txt', 'w')
for line in lines_set:
out.write(line)
答案 1 :(得分:4)
这是关于保留顺序的选项(与集合不同),但仍然具有相同的行为(请注意,故意剥离EOL字符并忽略空行)...
from collections import OrderedDict
with open('/home/jon/testdata.txt') as fin:
lines = (line.rstrip() for line in fin)
unique_lines = OrderedDict.fromkeys( (line for line in lines if line) )
print unique_lines.keys()
# ['None_None', 'ConfigHandler_56663624', 'ColumnConverter_56963312',PredicatesFactory_56963424', 'PredicateConverter_56963648', 'ConfigHandler_80134888']
然后你只需要将上面的内容写入你的输出文件。
答案 2 :(得分:1)
以下是使用集合(无序结果)的方法:
from pprint import pprint
with open('input.txt', 'r') as f:
print pprint(set(f.readlines()))
此外,您可能想要摆脱新的行字符。
答案 3 :(得分:0)
如果您只想获得不重复的输出,可以使用uniq
和sort
hvn@lappy: /tmp () $ sort -nr dup | uniq
PredicatesFactory_56963424
PredicateConverter_56963648
None_None
ConfigHandler_80134888
ConfigHandler_56663624
ColumnConverter_56963312
对于python:
In [2]: with open("dup", 'rt') as f:
lines = f.readlines()
...:
In [3]: lines
Out[3]:
['None_None\n',
'\n',
'ConfigHandler_56663624\n',
'ConfigHandler_56663624\n',
'ConfigHandler_56663624\n',
'ConfigHandler_56663624\n',
'\n',
'None_None\n',
'\n',
'ColumnConverter_56963312\n',
'ColumnConverter_56963312\n',
'\n',
'PredicatesFactory_56963424\n',
'PredicatesFactory_56963424\n',
'\n',
'PredicateConverter_56963648\n',
'PredicateConverter_56963648\n',
'\n',
'ConfigHandler_80134888\n',
'ConfigHandler_80134888\n',
'ConfigHandler_80134888\n',
'ConfigHandler_80134888\n']
In [4]: set(lines)
Out[4]:
set(['ColumnConverter_56963312\n',
'\n',
'PredicatesFactory_56963424\n',
'ConfigHandler_56663624\n',
'PredicateConverter_56963648\n',
'ConfigHandler_80134888\n',
'None_None\n'])
答案 4 :(得分:0)
import json
myfile = json.load(open('yourfile', 'r'))
uniq = set()
for p in myfile:
if p in uniq:
print "duplicate : " + p
del p
else:
uniq.add(p)
print uniq
答案 5 :(得分:0)
以这种方式获取放在
中的相同文件import uuid
def _remove_duplicates(filePath):
f = open(filePath, 'r')
lines = f.readlines()
lines_set = set(lines)
tmp_file=str(uuid.uuid4())
out=open(tmp_file, 'w')
for line in lines_set:
out.write(line)
f.close()
os.rename(tmp_file,filePath)
答案 6 :(得分:0)
def remove_duplicates(infile):
storehouse = set()
with open('outfile.txt', 'w+') as out:
for line in open(infile):
if line not in storehouse:
out.write(line)
storehouse.add(line)
remove_duplicates('infile.txt')