我想用'<unk>'
替换句子中出现一次的单词。喜欢句子:hello hello world my world
,我希望输出为hello hello world <unk> world
,该怎么做?
现在我这样做:
wordlist1 = trainfiles.split(None)
wordlist2 = []
for word1 in wordlist1:
lastchar = word1[-1:]
if lastchar in [",",".","!","?",";"]:
word2 = word1.rstrip(lastchar)
else:
word2 = word1
wordlist2.append(word2)
freq = {}
for word2 in wordlist2:
freq[word2] = freq.get(word2,0)+1
keylist = freq.keys()
keylist.sort()
for key2 in keylist:
if freq[key2] == 1:
print "%-10s %d" % ('<unk>', freq[key2])
else:
print "%-10s %d" % (key2, freq[key2])
这给了我一个输出:
hello 2
<unk> 1
world 2
但是,我需要输出如:
hello hello world <unk> world
怎么做?
答案 0 :(得分:5)
使用collections.Counter
计算句子中单词的频率
from collections import Counter
s = 'hello hello world my world'
counts = Counter(s.split())
然后使用生成器表达式替换任何计数为1的单词,并将结果与空格字符连接。
replaced = ' '.join(i if counts[i] > 1 else '<unk>' for i in s.split())
结果
'hello hello world <unk> world'
答案 1 :(得分:2)
正如@Cyber指出的那样,关键是使用collections.Counter
。此版本保留了原始行的标点符号和空格。
import re
from collections import Counter
trainfiles = 'hello hello, world my world!'
wordlist = re.findall(r'\b\w+\b', trainfiles)
wordlist = Counter(wordlist)
for word, count in wordlist.items():
if count == 1:
trainfiles = re.sub(r'\b{}\b'.format(word), '<unk>', trainfiles)
print trainfiles