查找仅出现一次的单词

时间:2015-04-02 12:08:41

标签: python

我只检索文件中的唯一单词,这是我到目前为止所用的,但是有一个更好的方法在python中实现这个大O表示法吗?现在这是n平方

def retHapax():
    file = open("myfile.txt")
    myMap = {}
    uniqueMap = {}
    for i in file:
        myList = i.split(' ')
        for j in myList:
            j = j.rstrip()
            if j in myMap:
                del uniqueMap[j]
            else:
                myMap[j] = 1
                uniqueMap[j] = 1
    file.close()
    print uniqueMap

4 个答案:

答案 0 :(得分:3)

如果您想查找所有唯一字词并将foo视为与foo.相同,则需要删除标点符号。

from collections import Counter
from string import punctuation

with open("myfile.txt") as f:
    word_counts = Counter(word.strip(punctuation) for line in f for word in line.split())

print([word for word, count in word_counts.iteritems() if count == 1])

如果您想忽略大小写,还需要使用line.lower()。如果你想准确地得到唯一的单词,那么除了在空格上分割线之外还有更多的内容。

答案 1 :(得分:3)

我采用collections.Counter方法,但如果想要使用set s,那么您可以通过以下方式执行此操作:

with open('myfile.txt') as input_file:
    all_words = set()
    dupes = set() 
    for word in (word for line in input_file for word in line.split()):
        if word in all_words:
            dupes.add(word)
        all_words.add(word)

    unique = all_words - dupes

给出输入:

one two three
two three four
four five six

输出结果为:

{'five', 'one', 'six'}

答案 2 :(得分:2)

尝试此操作以获取文件中的唯一字词。使用Counter

from collections import Counter
with open("myfile.txt") as input_file:
    word_counts = Counter(word for line in input_file for word in line.split())
>>> [word for (word, count) in word_counts.iteritems() if count==1]
-> list of unique words (words that appear exactly once)

答案 3 :(得分:1)

你可以稍微修改你的逻辑并在第二次出现时将其从唯一移动(例如使用集而不是dicts):

words = set()
unique_words = set()
for w in (word.strip() for line in f for word in line.split(' ')):
    if w in words:
        continue
    if w in unique_words:
        unique_words.remove(w)
        words.add(w)
    else:
        unique_words.add(w)
print(unique_words)