如何使用python读取MLComp数据集?

时间:2016-07-23 15:00:52

标签: python machine-learning dataset scikit-learn

MLComp数据集具有我不知道的特殊类型的文件格式。我想用python阅读,但不能。

1 个答案:

答案 0 :(得分:0)

首先要注意的是sklearn(v0.17.1,截至2016年7月24日),仅支持DocumentClassification的{​​{1}}域。

假设您已经下载了例如mlcomp id=523/somewhere/on/your/computersklearn,您可以使用以下from sklearn.datasets import load_mlcomp from sklearn.feature_extraction.text import CountVectorizer from sklearn.metrics import accuracy_score from sklearn.naive_bayes import MultinomialNB # Load mlcomp data using sklearn train_data = load_mlcomp(name_or_id=523, set_='train', mlcomp_root='/somewhere/on/your/computer') test_data = load_mlcomp(name_or_id=523, set_='test', mlcomp_root='/somewhere/on/your/computer') # if you had the environment variable `MLCOMP_DATASETS_HOME` set, you wouldn't need to explicitly pass anything to `mlcomp_root` # `data` is a standard `Bunch` object, so you can now straightforwardly go on and vectorize the dataset,... vec = CountVectorizer(decode_error='replace') X_train = vec.fit_transform(train_data.data) X_test = vec.transform(test_data.data) # ...train a classifier... mnb = MultinomialNB() mnb.fit(X_train, train_data.target) # ...and evaluate it. print('Accuracy: {}'.format(accuracy_score(test_data.target, mnb.predict(X_test)))) 代码段加载数据集并训练分类器:

<td><a data-quoteapi="$cur.symbol href=/asx/{$cur.symbol} (stockLink)" href="/asx/abc">ABC</a></td>