用于Python预测的逻辑回归分类器

时间:2015-09-24 10:19:27

标签: python machine-learning scikit-learn kaggle

我正在尝试创建一个带有json文件的脚本(pizza-train.json)(来自this Kaggle competition。我想从列表中的每个字典中提取request_text字段,并构建一个单词包字符串的表示形式(字符串到count-list)。

下一步是训练逻辑回归分类器来预测变量“requester_received_pizza”。我想训练90%的数据并预测10%。问题是我不知道如何预测10%。任何建议都会非常有帮助!

import json
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer


f_json = json.load(open('pizza-train.json'))
request_text = []
y = []

for item in f_json[:100]:
    request_text.append(item['request_text'])
    y.append(item['requester_received_pizza'])

vectorizer = CountVectorizer(min_df=1, lowercase=True, stop_words='english')

train_data_features = vectorizer.fit_transform(request_text)
train_data_features = train_data_features.toarray()


print 'Shape = '
print train_data_features.shape
vocab = vectorizer.get_feature_names()
print '\n'
print 'Vocab = '
print vocab


x_train, x_test, y_train, y_test = train_test_split(train_data_features, y, test_size=0.10)

1 个答案:

答案 0 :(得分:0)

你可以这样做:

alg = sklearn.linear_model.LogisticRegression()
alg.fit(x_train, y_train)
test_score = alg.score(x_test, y_test)

您应该阅读sklearn docs logistic regressioncross validation,这些文档非常好,并提供了更复杂的方法来验证您的模型。 Kaggle Titanic竞赛的This教程也可能有用。