Question

我现在正在使用nltk学习naivebayes分类器。

在文档（http://www.nltk.org/book/ch06.html）1.3文档分类中，有一个功能集示例。

featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000] [1]

def document_features(document): [2]
    document_words = set(document) [3]
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

因此，要素集形式的示例是{（'contains（waste）'：False，'contains（lot）'：False，...}，'neg'）...}

但是我想从 '包含（废弃）'更改单词词典表单：错误 到 '包含（废弃物）'：2 < / EM> 即可。我认为那种形式（'包含（浪费）'：2）很好地解释文件，因为它可以计算世界的频率。因此，功能集将是 {（'contains（waste）'：2，'contains（lot）'：5，...}，'neg'）...}

但是我担心'包含（废物）'：2 和'包含（废物）'：1 与naivebayesclassifier是完全不同的词。然后它无法解释'包含（废物）'的相似性：2 和'包含（废物）'：1 。

{'contains（lot）'：1和'contains（waste）'：1} 和 {'contains（waste）' ：2和'包含（废物）'：1} 可以与程序相同。

nltk.naivebayesclassifier能否理解单词的频率？

这是我使用的代码

def split_and_count_word(data): #belongs_to : Main #Role : make featuresets from korean words using konlpy. #Parameter : dictionary data(dict of contents ex.{'politic':{'parliament': [content,content]}..}) #Return : list featuresets([{'word':True',...},'politic'] == featureset + category) featuresets = [] twitter = konlpy.tag.Twitter()#Korean word splitter for big_cat in data: for small_cat in data[big_cat]: #save category name needed in featuresets category = str(big_cat[0:3])+'/'+str(small_cat) count = 0; print(small_cat) for one_news in data[big_cat][small_cat]: count+=1; if count%100==0: print(count,end=' ') #one_news is list in list so open it! doc = one_news #split word as using konlpy list_of_splited_word = twitter.morphs(doc[:-63])#delete useless sentences. #get word length is higher than two and get list of splited words list_of_up_two_word = [word for word in list_of_splited_word if len(word)>1] dict_of_featuresets = make_featuresets(list_of_up_two_word) #save featuresets.append((dict_of_featuresets,category)) return featuresets def make_featuresets(data): #belongs_to : split_and_count_word #Role : make featuresets #Parameter : list list_of_up_two_word(ex.['비누','떨어','지다'] #Return : dictionary {word : True for word in data} #PROBLEM :( #cannot consider the freqency of word return {word : True for word in data} def naive_train(featuresets): #belongs_to : Main #Role : Learning by naive bayes rule #Parameter : list featuresets([{'word':True',...},'pol/pal']) #Return : object classifier(nltk naivebayesclassifier object), # list test_set(the featuresets that are randomly selected) random.shuffle(featuresets) train_set, test_set = featuresets[1000:], featuresets[:1000] classifier = naivebayes.NaiveBayesClassifier.train(train_set) return classifier,test_set featuresets = split_and_count_word(data) classifier,test_set = naive_train(featuresets)

Answer 1

nltk的朴素贝叶斯分类器将特征值视为逻辑上不同。值不限于False和f=2，但绝不会将其视为数量。如果您有功能f=3和f=1，则会将其视为不同的值。向这样的模型添加数量的唯一方法是将它们分类为“桶”，如f="few"，f="several"（2-5），f="many"（6-10），{{1例如，（11或更多）。（注意：如果你走这条路线，有一些算法可以为桶选择合适的价值范围。）即使这样，模型也不会“知道”“少数”介于“一”和“几”之间。您需要一个不同的机器学习工具来直接处理数量。

如何在nltk naivebayes分类器中添加频率？

1 个答案: