我有一本字典。我想省略字典中带有计数1的单词。我该怎么做?有帮助吗?我想提取剩下的单词的二元模型?我怎么能这样做?
import codecs
file=codecs.open("Pezeshki339.txt",'r','utf8')
txt = file.read()
txt = txt[1:]
token=txt.split()
count={}
for word in token:
if word not in count:
count[word]=1
else:
count[word]+=1
for k,v in count.items():
print(k,v)
我可以编辑我的代码,如下所示。但是有一个问题:如何使用add-one方法创建bigram矩阵并使其平滑?我感谢任何符合我的代码的建议。
import nltk
from collections import Counter
import codecs
with codecs.open("Pezeshki339.txt",'r','utf8') as file:
for line in file:
token=line.split()
spl = 80*len(token)/100
train = token[:int(spl)]
test = token[int(spl):]
print(len(test))
print(len(train))
cn=Counter(train)
known_words=([word for word,v in cn.items() if v>1])# removes the rare words and puts them in a list
print(known_words)
print(len(known_words))
bigram=nltk.bigrams(known_words)
frequency=nltk.FreqDist(bigram)
for f in frequency:
print(f,frequency[f])
答案 0 :(得分:3)
使用Counter dict计算单词,然后过滤.items删除值为1的键:
from collections import Counter
import codecs
with codecs.open("Pezeshki339.txt",'r','utf8') as f:
cn = Counter(word for line in f for word in line.split())
print(dict((word,v )for word,v in cn.items() if v > 1 ))
如果你只想要单词使用list comp:
print([word for word,v in cn.items() if v > 1 ])
你不需要打电话阅读你可以随意拆分每一行,如果你想删除你需要删除的标点符号:
from string import punctuation
cn = Counter(word.strip(punctuation) for line in file for word in line.split())
答案 1 :(得分:3)
import collections
c = collections.Counter(['a', 'a', 'b']) # Just an example - use your words
[w for (w, n) in c.iteritems() if n > 1]
答案 2 :(得分:0)
Padraic的解决方案效果很好。但是这里有一个解决方案可以放在你的代码之下,而不是完全重写它:
newdictionary = {}
for k,v in count.items():
if v != 1:
newdictionary[k] = v