我使用混乱矩阵跟随NLTK书,但混淆矩阵看起来很奇怪。
#empirically exam where tagger is making mistakes
test_tags = [tag for sent in brown.sents(categories='editorial')
for (word, tag) in t2.tag(sent)]
gold_tags = [tag for (word, tag) in brown.tagged_words(categories='editorial')]
print nltk.ConfusionMatrix(gold_tags, test_tags)
任何人都可以解释如何使用混淆矩阵吗?
答案 0 :(得分:14)
首先,我假设你从旧NLTK
的第05章https://nltk.googlecode.com/svn/trunk/doc/book/ch05.py获得了代码,特别是你看看这一部分:http://pastebin.com/EC8fFqLU
现在,让我们看看NLTK
中的混淆矩阵,尝试:
from nltk.metrics import ConfusionMatrix
ref = 'DET NN VB DET JJ NN NN IN DET NN'.split()
tagged = 'DET VB VB DET NN NN NN IN DET NN'.split()
cm = ConfusionMatrix(ref, tagged)
print cm
[OUT]:
| D |
| E I J N V |
| T N J N B |
----+-----------+
DET |<3>. . . . |
IN | .<1>. . . |
JJ | . .<.>1 . |
NN | . . .<3>1 |
VB | . . . .<1>|
----+-----------+
(row = reference; col = test)
<>
中嵌入的数字是真正的正数(tp)。从上面的示例中,您可以看到来自引用的JJ
之一被标记为NN
的标记输出。对于该实例,它被视为NN
的一个误报和JJ
的一个误报。
要访问混淆矩阵(用于计算精确度/召回/ fscore),您可以通过以下方式访问假阴性,误报和真阳性:
labels = set('DET NN VB IN JJ'.split())
true_positives = Counter()
false_negatives = Counter()
false_positives = Counter()
for i in labels:
for j in labels:
if i == j:
true_positives[i] += cm[i,j]
else:
false_negatives[i] += cm[i,j]
false_positives[j] += cm[i,j]
print "TP:", sum(true_positives.values()), true_positives
print "FN:", sum(false_negatives.values()), false_negatives
print "FP:", sum(false_positives.values()), false_positives
[OUT]:
TP: 8 Counter({'DET': 3, 'NN': 3, 'VB': 1, 'IN': 1, 'JJ': 0})
FN: 2 Counter({'NN': 1, 'JJ': 1, 'VB': 0, 'DET': 0, 'IN': 0})
FP: 2 Counter({'VB': 1, 'NN': 1, 'DET': 0, 'JJ': 0, 'IN': 0})
计算每个标签的Fscore:
for i in sorted(labels):
if true_positives[i] == 0:
fscore = 0
else:
precision = true_positives[i] / float(true_positives[i]+false_positives[i])
recall = true_positives[i] / float(true_positives[i]+false_negatives[i])
fscore = 2 * (precision * recall) / float(precision + recall)
print i, fscore
[OUT]:
DET 1.0
IN 1.0
JJ 0
NN 0.75
VB 0.666666666667
我希望以上内容能够消除NLTK
中混淆矩阵的使用,这里是上面例子的完整代码:
from collections import Counter
from nltk.metrics import ConfusionMatrix
ref = 'DET NN VB DET JJ NN NN IN DET NN'.split()
tagged = 'DET VB VB DET NN NN NN IN DET NN'.split()
cm = ConfusionMatrix(ref, tagged)
print cm
labels = set('DET NN VB IN JJ'.split())
true_positives = Counter()
false_negatives = Counter()
false_positives = Counter()
for i in labels:
for j in labels:
if i == j:
true_positives[i] += cm[i,j]
else:
false_negatives[i] += cm[i,j]
false_positives[j] += cm[i,j]
print "TP:", sum(true_positives.values()), true_positives
print "FN:", sum(false_negatives.values()), false_negatives
print "FP:", sum(false_positives.values()), false_positives
print
for i in sorted(labels):
if true_positives[i] == 0:
fscore = 0
else:
precision = true_positives[i] / float(true_positives[i]+false_positives[i])
recall = true_positives[i] / float(true_positives[i]+false_negatives[i])
fscore = 2 * (precision * recall) / float(precision + recall)
print i, fscore
答案 1 :(得分:1)
这是文本分类器的真实情况, 与sklearn和NLTK一起使用
from collections import defaultdict
refsets = defaultdict(set)
testsets = defaultdict(set)
labels = []
tests = []
for i, (feats, label) in enumerate(testset):
refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)
labels.append(label)
tests.append(observed)
print(metrics.confusion_matrix(labels, tests))
print(nltk.ConfusionMatrix(labels, tests))