如果文件中有重复的单词,如何编写一个返回true的函数

时间:2014-01-29 22:21:17

标签: python string file duplicates

这是函数

def duplicate(fname):
    'returns true if there are duplicates in the file, false otherwise'
    fn = open(fname, 'r')
    llst = fn.readlines()
    fn.close()

我不知道去哪里。我尝试拆分文件,对其进行排序,然后编写一个函数来查找两个相同的单词是否按连续顺序排列。但它说我不能将分裂归因于列表。

有什么想法吗?

6 个答案:

答案 0 :(得分:1)

您可以将每个单词作为键添加到字典中。如果密钥已经存在,则它是重复的。您还可以将找到单词的次数与值相关联。

#!/usr/bin/env python
def duplicate(fname):
    'returns true if there are duplicates in the file, false otherwise'
    with open (fname, 'r') as file_handle:
        word_dict = dict()
        for line in file_handle:
            words = line.split()
            for word in words:
                if word in word_dict:
                    word_dict[word] = 'Duplicate'
                else:
                    word_dict[word] = 'Unique'
    return word_dict

results = duplicate('alice.txt')
for key in results:
    print "{}: {}".format(key, results[key])

答案 1 :(得分:0)

您可以使用set数据结构:

def has_duplicate_words(filename):
    with open(filename, 'r') as f:
        words = set()
        for line in f.readlines():
            lineWords = line.split()
            for word in lineWords:
                if word in words:
                    return True

                words.add(word)
    return False

另请注意,这取决于您对单词的定义。在此解决方案中,它是任何不包含空白字符的字符序列,即split()函数documentation中定义的空格,制表符,换行符,返回值,换页符。

如果您想要返回所有重复项,则可以在list中累积它们,而不是在找到重复项时执行return True

另请注意,如果文件可能包含不适合内存的极长行,则此解决方案不可行。

答案 2 :(得分:0)

你在寻找这个吗?

def duplicate(fname):
    with open(fname, "r") as f: # it's better to use with open, than only open, since otherwise the file might not be closed on error
        dict = {} # create an empty dictionary for checking, if a line was already in the file
        for line in f: # go through all lines
            try:
                foo = dict[line] # check, if line already exists
                return True # no error was thrown, so this is a duplicated line
            except:
                dict[line] = 1 # give the key line some random input, so that the dict contains this key
    return False

另一种方法是读取此文件,对行进行排序,然后检查douplicate行,然后相互跟随。

请注意,如果文件包含“foo”和“foo”行,则由于第二行末尾的空格,因此不会返回true,而是false。

答案 3 :(得分:0)

一种更简单的方法:将文件中单词列表的长度与一组单词的长度进行比较:

>>> def HasDuplicates(str):
...    words = str.split()
...    uniqueWords = set(words)
...    return len(words) != len(uniqueWords)
...
>>> str1 = "this is a sentence with two two duplicates"
>>> str2 = "this is a sentence with no duplicates"
>>> HasDuplicates(str1)
True
>>> HasDuplicates(str2)
False

(文件I / O作为读者的练习而留下;它与重复的问题没有密切关系)

答案 4 :(得分:0)

这有效:

如果有重复项,它会返回True,但也会构建一个字典,其中重复的单词为key,并且它们在文本中的频率为value并打印出来。我知道的比你要求的要多,但是改变代码只需检查重复项并返回True / False就不会花费太多。

def duplicate(fname):

    with open(fname, 'r') as f:
        text = f.read() # auto closes file after reading

    split_text = [word.strip() for word in text.split()] # create list of all the words

    duplicates = {}
    for word in split_text:
        count = text.count(word) # count occurrences of each word
        if count > 1:
            duplicates[word] = count
    if duplicates:
        print duplicates
        return True
    return False

示例输出:

{'dear': 2, 'the': 6, 'name': 2}

答案 5 :(得分:0)

with open('filepath','r') as f:
    all_words = f.read().split()
    return len(all_words) > len(set(all_words))