如何在python中省略字典中较不频繁的单词?

时间:2015-06-08 17:32:02

标签: python

我有一本字典。我想省略字典中带有计数1的单词。我该怎么做?有帮助吗?我想提取剩下的单词的二元模型?我怎么能这样做?

import codecs
file=codecs.open("Pezeshki339.txt",'r','utf8')
txt = file.read()
txt = txt[1:]

token=txt.split()

count={}
for word in token:
    if word not in count:
      count[word]=1
    else:
      count[word]+=1
for k,v in count.items():
    print(k,v)

我可以编辑我的代码,如下所示。但是有一个问题:如何使用add-one方法创建bigram矩阵并使其平滑?我感谢任何符合我的代码的建议。

import nltk
from collections import Counter
import codecs
with codecs.open("Pezeshki339.txt",'r','utf8') as file:
    for line in file:
       token=line.split()

spl = 80*len(token)/100
train = token[:int(spl)]
test = token[int(spl):]
print(len(test))
print(len(train))
cn=Counter(train)
known_words=([word for word,v in cn.items() if v>1])# removes the rare words and puts them in a list
print(known_words)
print(len(known_words))
bigram=nltk.bigrams(known_words)
frequency=nltk.FreqDist(bigram)
for f in frequency:
     print(f,frequency[f])

3 个答案:

答案 0 :(得分:3)

使用Counter dict计算单词,然后过滤.items删除值为1的键:

from collections import Counter

import codecs
with codecs.open("Pezeshki339.txt",'r','utf8') as f:

    cn = Counter(word for line in f for word in line.split())

    print(dict((word,v )for word,v in cn.items() if v > 1 ))

如果你只想要单词使用list comp:

print([word for word,v in cn.items() if v > 1 ])

你不需要打电话阅读你可以随意拆分每一行,如果你想删除你需要删除的标点符号:

from string import punctuation

cn = Counter(word.strip(punctuation) for line in file for word in line.split())

答案 1 :(得分:3)

import collections

c = collections.Counter(['a', 'a', 'b']) # Just an example - use your words

[w for (w, n) in c.iteritems() if n > 1]

答案 2 :(得分:0)

Padraic的解决方案效果很好。但是这里有一个解决方案可以放在你的代码之下,而不是完全重写它:

newdictionary = {}
for k,v in count.items():
    if v != 1:
        newdictionary[k] = v