Question

我正在玩一个带有3个参数的函数，一个文本文件的名称，substring1和substring2。它将搜索文本文件并返回包含两个子字符串的单词：

def myfunction(filename, substring1, substring2)
    result = ""
    text=open(filename).read().split()
    for word in text:
        if substring1 in word and substring2 in word:
            result+=word+" "
    return result

此功能有效，但我想删除重复的结果。例如，对于我的特定文本文件，如果substring1为“at”而substring2为“wh”，则它将返回“what”，但是，因为我的文本文件中有3个“what”，它将返回所有这些文件。我正在寻找一种不返回重复项的方法，只有单独的单词，我还想保留ORDER，那么计算“套”出来了吗？

我想也许做一些“文字”会起作用，不知何故在循环之前删除重复项。

Answer 1

这是一个使用小内存（在文件行上使用迭代器）并具有良好时间复杂度的解决方案（当返回单词列表时很重要）很大，就像substring1是＆＃34; a＆＃34;而substring2是＆＃34; e＆＃34;，对于英语）：

import collections

def find_words(file_path, substring1, substring2)
    """Return a string with the words from the given file that contain both substrings."""
    matching_words = collections.OrderedDict()
    with open(file_path) as text_file:
        for line in text_file:
            for word in line.split():
                if substring1 in word and substring2 in word:
                    matching_words[word] = True
    return " ".join(matching_words)

OrderedDict保留了第一次使用键的顺序，因此可以按照查找顺序保存单词。由于它是一个映射，因此没有重复的单词。由于在OrderedDict中插入密钥是在恒定时间内完成的（而不是许多其他解决方案的if word in result_list的线性时间），因此获得了良好的时间复杂度。

Answer 2

不，你需要做的就是让result成为一个列表而不是一个字符串。然后，在添加每个单词之前，您可以执行if word not in result:。您可以稍后通过''.join(result)将列表转换为以空格分隔的字符串。

这将保留找到它们的顺序，而一组则不会。

Answer 3

我认为，最好的方法就是保留订单，最好的方法是让results成为一个列表，然后在列表中检查每个word是否已经在列表中添加它。此外，您应该使用上下文管理器with来处理文件，以确保它们正确关闭：

def myfunction(filename, substring1, substring2)
    result = []
    with open(filename) as f:
        text = f.read().split()
    for word in text:
        if substring1 in word and substring2 in word and word not in result:
            result.append(word)
    return " ".join(result)

Answer 4

请使用with语句来使用文件的上下文管理器。使用列表并测试列表中是否存在字符串将为您完成任务：

def myfunction(filename, substring1, substring2)
    result = []
    with open(filename) as f:
        for word in f.read().split():
            if substring1 in word and substring2 in word:
                 if not word in result:
                     result.append(word)
        return result

考虑返回一个列表而不是字符串，因为您可以随时将列表轻松转换为字符串，这样做：

r = myfunction(arg1, arg2, arg3)
print(",".join(r))

编辑：

@EOL是完全正确的，所以我在这里给出了两种更节省时间的方法（但内存效率略低）：

from collections import OrderedDict
def myfunction(filename, substring1, substring2)
    result = OrderedDict()
    with open(filename) as f:
        for word in f.read().split():
            if substring1 in word and substring2 in word:
                 result[word] = None # here we don't care about the stored value, only the key
        return result.values()

OrderedDict是一个保留插入顺序的字典。字典的密钥是set的一个特例，它共享只有唯一值的属性。因此，如果一个键已经在dict中，当第二次插入时，它将被默默地忽略。该操作比查找列表中的值更快。

从文本文件输入中删除重复的单词？

4 个答案: