我使用Python NLTK library
和Naive Bayes分类器来检测字符串是否应该根据训练数据标记为“php”(实际上是Stackoverflow问题)。
分类器似乎找到了有趣的功能:
Most Informative Features
contains-word-isset = True True : False = 125.6 : 1.0
contains-word-echo = True True : False = 28.1 : 1.0
contains-word-php = True True : False = 17.1 : 1.0
contains-word-this- = True True : False = 16.0 : 1.0
contains-word-mysql = True True : False = 14.3 : 1.0
contains-word-_get = True True : False = 11.7 : 1.0
contains-word-foreach = True True : False = 7.6 : 1.0
功能定义如下:
def features(question):
features = {}
for token in detectorTokens:
featureName = "contains-word-"+token
features[featureName] = (token in question)
return features
但似乎分类器决定永远不会将字符串标记为“php”问题。 即使是一个简单的字符串,如:“这是一个php问题吗?”被归类为虚假。
任何人都可以帮我理解这种现象吗?
这是一些部分代码(我有3或4页代码,所以这只是一小部分):
classifier = nltk.NaiveBayesClassifier.train(train_set)
cross_valid_accuracy = nltk.classify.accuracy(classifier, cross_valid_set)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(cross_valid_set):
refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)
print 'Precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'Recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])