如何替换python中句子中出现一次的单词

时间:2014-10-21 19:14:47

标签: python

我想用'<unk>'替换句子中出现一次的单词。喜欢句子:hello hello world my world,我希望输出为hello hello world <unk> world,该怎么做?

现在我这样做:

 wordlist1 = trainfiles.split(None)
        wordlist2 = []
        for word1 in wordlist1:
            lastchar = word1[-1:]
            if lastchar in [",",".","!","?",";"]:
                word2 = word1.rstrip(lastchar)
            else:
                word2 = word1
            wordlist2.append(word2)
        freq = {}
        for word2 in wordlist2:
            freq[word2] = freq.get(word2,0)+1
        keylist = freq.keys()
        keylist.sort()

    for key2 in keylist:
        if freq[key2] == 1:
            print "%-10s %d" % ('<unk>', freq[key2])
        else:
            print "%-10s %d" % (key2, freq[key2])

这给了我一个输出:

hello   2
<unk>   1
world   2

但是,我需要输出如:

hello hello world <unk> world

怎么做?

2 个答案:

答案 0 :(得分:5)

使用collections.Counter计算句子中单词的频率

from collections import Counter
s = 'hello hello world my world'
counts = Counter(s.split())

然后使用生成器表达式替换任何计数为1的单词,并将结果与​​空格字符连接。

replaced = ' '.join(i if counts[i] > 1 else '<unk>' for i in s.split())

结果

'hello hello world <unk> world'

答案 1 :(得分:2)

正如@Cyber​​指出的那样,关键是使用collections.Counter。此版本保留了原始行的标点符号和空格。

import re
from collections import Counter
trainfiles = 'hello hello, world my world!'

wordlist = re.findall(r'\b\w+\b', trainfiles)
wordlist = Counter(wordlist)
for word, count in wordlist.items():
    if count == 1:
        trainfiles = re.sub(r'\b{}\b'.format(word), '<unk>', trainfiles)

print trainfiles