NLTK分类器精度和召回率始终为无(0)

时间:2013-12-09 16:51:52

标签: python nltk

我使用Python NLTK library和Naive Bayes分类器来检测字符串是否应该根据训练数据标记为“php”(实际上是Stackoverflow问题)。

分类器似乎找到了有趣的功能

Most Informative Features
     contains-word-isset = True             True : False  =    125.6 : 1.0
      contains-word-echo = True             True : False  =     28.1 : 1.0
       contains-word-php = True             True : False  =     17.1 : 1.0
     contains-word-this- = True             True : False  =     16.0 : 1.0
     contains-word-mysql = True             True : False  =     14.3 : 1.0
      contains-word-_get = True             True : False  =     11.7 : 1.0
   contains-word-foreach = True             True : False  =      7.6 : 1.0

功能定义如下:

def features(question):
    features = {}
    for token in detectorTokens:
        featureName = "contains-word-"+token
        features[featureName] = (token in question)
    return features

似乎分类器决定永远不会将字符串标记为“php”问题。 即使是一个简单的字符串,如:“这是一个php问题吗?”被归类为虚假。

任何人都可以帮我理解这种现象吗?

这是一些部分代码(我有3或4页代码,所以这只是一小部分):

classifier = nltk.NaiveBayesClassifier.train(train_set)
cross_valid_accuracy = nltk.classify.accuracy(classifier, cross_valid_set)

refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)

for i, (feats, label) in enumerate(cross_valid_set):
    refsets[label].add(i)
    observed = classifier.classify(feats)
    testsets[observed].add(i)

print 'Precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'Recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])

0 个答案:

没有答案