Nltk Sklearn Unigram + Bigram

时间:2015-08-27 14:25:44

标签: python machine-learning nlp scikit-learn nltk

我正在使用NLTK和nltk.sklearn包装器构建分类器。

=IF(A2>3,MAX(INDEX(A$1:A1,MATCH(B2-1,B$1:B1))
:INDEX(A3:A$15,MATCH(B2+1,B3:B$15,0))),"")

当我仅使用unigrams并构建featureset时,例如:

classifier = SklearnClassifier(LinearSVC(), int,True)
classifier.train(train_set)
evertyhing很好。但是当我想使用搭配时,就会出现问题。功能集看起来不同:

{"Cristiano" : True, "Ronaldo : True}

然后我收到错误:

{ {"Cristiano" : True, "Ronaldo : True, ("Cristiano", "Ronaldo") : True }

如何使用unigrams和bigrams为nltk sklearn包装器正确创建功能集?

2 个答案:

答案 0 :(得分:2)

您可以使用CountVectorizer中的scikit-learn来生成ngrams。

演示:

import sklearn.feature_extraction.text

ngram_size = 1
train_set = ['Cristiano plays football', 'Ronaldo like football too']

vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vectorizer.fit(train_set) # build ngram dictionary
ngram = vectorizer.transform(train_set) # get ngram
print('ngram: {0}\n'.format(ngram))
print('ngram.shape: {0}'.format(ngram.shape))
print('vectorizer.vocabulary_: {0}'.format(vectorizer.vocabulary_))

输出:

ngram:   (0, 0) 1
  (0, 1)    1
  (0, 3)    1
  (1, 1)    1
  (1, 2)    1
  (1, 4)    1
  (1, 5)    1

ngram.shape: (2, 6)
vectorizer.vocabulary_: {u'cristiano': 0, u'plays': 3, u'like': 2, 
                         u'ronaldo': 4, u'football': 1, u'too': 5}

答案 1 :(得分:0)

如果您想继续使用NLTK warper,您可以在训练分类器之前执行以下操作:

classifier._vectorizer.sort = False