Question

我只检索文件中的唯一单词，这是我到目前为止所用的，但是有一个更好的方法在python中实现这个大O表示法吗？现在这是n平方

def retHapax():
    file = open("myfile.txt")
    myMap = {}
    uniqueMap = {}
    for i in file:
        myList = i.split(' ')
        for j in myList:
            j = j.rstrip()
            if j in myMap:
                del uniqueMap[j]
            else:
                myMap[j] = 1
                uniqueMap[j] = 1
    file.close()
    print uniqueMap

Answer 1

如果您想查找所有唯一字词并将foo视为与foo.相同，则需要删除标点符号。

from collections import Counter
from string import punctuation

with open("myfile.txt") as f:
    word_counts = Counter(word.strip(punctuation) for line in f for word in line.split())

print([word for word, count in word_counts.iteritems() if count == 1])

如果您想忽略大小写，还需要使用line.lower()。如果你想准确地得到唯一的单词，那么除了在空格上分割线之外还有更多的内容。

Answer 2

我采用collections.Counter方法，但如果只想要使用set s，那么您可以通过以下方式执行此操作：

with open('myfile.txt') as input_file:
    all_words = set()
    dupes = set() 
    for word in (word for line in input_file for word in line.split()):
        if word in all_words:
            dupes.add(word)
        all_words.add(word)

    unique = all_words - dupes

给出输入：

one two three
two three four
four five six

输出结果为：

{'five', 'one', 'six'}

Answer 3

尝试此操作以获取文件中的唯一字词。使用Counter

from collections import Counter
with open("myfile.txt") as input_file:
    word_counts = Counter(word for line in input_file for word in line.split())
>>> [word for (word, count) in word_counts.iteritems() if count==1]
-> list of unique words (words that appear exactly once)

Answer 4

你可以稍微修改你的逻辑并在第二次出现时将其从唯一移动（例如使用集而不是dicts）：

words = set()
unique_words = set()
for w in (word.strip() for line in f for word in line.split(' ')):
    if w in words:
        continue
    if w in unique_words:
        unique_words.remove(w)
        words.add(w)
    else:
        unique_words.add(w)
print(unique_words)

查找仅出现一次的单词

4 个答案: