我想编写一个将文件作为字符串的函数,如果文件有重复的单词,则返回True,否则返回False。
到目前为止,我有:
def double(filename):
infile = open(filename, 'r')
res = False
l = infile.split()
infile.close()
for line in l:
#if line is in l twice
res = True
return res
如果我的文件包含: “有一个相同的词”
我应该成真
如果我的文件包含: “没有相同的词”
我应该得到错误
如何确定字符串
中是否有重复的单词P.S。重复的单词不必一个接一个地出现 即“在那里的句子中有一个相同的单词”应该返回True,因为“there”也是重复的。
答案 0 :(得分:4)
由于撇号和标点符号,str.split()方法无法用于分割自然英语文本中的单词。你通常需要regular expressions的力量:
>>> text = """I ain't gonna say ain't, because it isn't
in the dictionary. But my dictionary has it anyways."""
>>> text.lower().split()
['i', "ain't", 'gonna', 'say', "ain't,", 'because', 'it', "isn't", 'in', 'the',
'dictionary.', 'but', 'my', 'dictionary', 'has', 'it', 'anyways.']
>>> re.findall(r"[a-z']+", text.lower())
['i', "ain't", 'gonna', 'say', "ain't", 'because', 'it', "isn't", 'in', 'the',
'dictionary', 'but', 'my', 'dictionary', 'has', 'it', 'anyways']
要查找是否有任何重复字词,您可以使用set operations:
>>> len(words) != len(set(words))
True
要列出重复的字词,请使用collections.Counter中的多字节操作:
>>> sorted(Counter(words) - Counter(set(words)))
["ain't", 'dictionary', 'it']
答案 1 :(得分:3)
def has_duplicates(filename):
seen = set()
for line in open(filename):
for word in line.split():
if word in seen:
return True
seen.add(word)
return False
答案 2 :(得分:0)
使用集合来检测重复项:
def double(filename):
seen = set()
with open(filename, 'r') as infile:
for line in l:
for word in line.split():
if word in seen:
return True
seen.add(word)
return False
您可以将其缩短为:
def double(filename):
seen = set()
with open(filename, 'r') as infile:
return any(word in seen or seen.add(word) for line in l for word in line.split())
两个版本都提前退出;一旦找到重复的单词,该函数返回True
;它必须读取整个文件以确定没有重复项并返回False
。
答案 3 :(得分:0)
a = set()
for line in l:
if (line in a):
return True
a.add(line)
return False
答案 4 :(得分:0)
另一种检测重复单词的一般方法,涉及collections.Counter
from itertools import chain
from collections import Counter
with open('test_file.txt') as f:
x = Counter(chain.from_iterable(line.split() for line in f))
for (key, value) in x.iteritems():
if value > 1:
print key