为什么朴素贝叶斯文本分类器在使用带标签的类别时表现不佳?

时间:2015-05-06 15:34:55

标签: python machine-learning scikit-learn text-classification naivebayes

我正在尝试使用多项式朴素贝叶斯创建文本分类模型。我的数据有10种不同类别。在模型训练期间,我以整数格式表示这些类别。

topics = ["gis","security","photo","mathematica","unix","wordpress","scifi","electronics","android","apple"]
topic2label = {topics[i]:i for i in range(len(topics))}

培训数据格式:

{"topic":"electronics","question":"What is the effective differencial effective of this circuit","excerpt":"I'm trying to work out, in general terms, the effective capacitance of this circuit .  \n\nWhat is the effective capacitance of this circuit and will the ...\r\n        "}
{"topic":"electronics","question":"Heat sensor with fan cooling","excerpt":"Can I know which component senses heat or acts as heat sensor in the following circuit?\nIn the given diagram, it is said that the 4148 diode acts as the sensor. But basically it is a zener diode and ...\r\n        "}

这就是我的代码段的样子:

# ---------------------------------------- Training -------------------------------------
import sklearn
with open('training.json') as f:
    next(f)
        for line in f:
            data = json.loads(line)
            topic.append(data["topic"])
            que = data["question"]
            question.append(data["question"])
            excer = data["excerpt"]
            excerpt.append(data["excerpt"])
            combo.append(que +" "+ excer)

unique_topics = list(set(topic))
numeric_topics = [name.replace('gis', '1').replace('security', '2').replace('photo', '3').replace('mathematica', '4').replace('unix', '5').replace('wordpress', '6').replace('scifi', '7').replace('electronics', '8').replace('android', '9').replace('apple', '10') for name in new_topic]
x1 = np.array(question)
x2 = np.array(excerpt)
x3 = np.array(combo)
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1,2),stop_words="english") 
X = vectorizer.fit_transform(x3)
Y = np.array(new_topic)
clf = MultinomialNB(alpha=0.1).fit(X, Y)

# ----------------------------   Prediction -----------------------------------------

docs_new = []

input = int(raw_input())
for i in xrange(input):
    input_data = raw_input()
    data = json.loads(input_data)
    que = data["question"]
    excer = data["excerpt"]
    docs_new.append(que +" "+ excer)

X_new_counts = vectorizer.transform(docs_new)
predicted = clf.predict(X_new_counts)
predicted =  list(predicted)
for i in predicted:
    print i

现在我分析了一个奇怪的行为,同时使用类别的整数表示,我的模型的准确率是82%,如果我使用字符串表示,精度会上升到90%。

我的问题是为什么模型在第二种情况下表现不同(更好)?

P.S。我正在使用sklearn库。

0 个答案:

没有答案