How does one call external datasets into scikit-learn?

时间:2015-07-31 19:27:28

标签: dataset python scikit-learn scipy numpy

For example consider this dataset:

(1) https://archive.ics.uci.edu/ml/machine-learning-databases/annealing/anneal.data

Or

(2) http://data.worldbank.org/topic

How does one call such external datasets into scikit-learn to do anything with it?


The only kind of dataset calling that I have seen in scikit-learn is through a command like:

from sklearn.datasets import load_digits

digits = load_digits()

2 个答案:

答案 0 :(得分:1)

You need to learn a little pandas, which is a data frame implementation in python. Then you can do

import pandas
my_data_frame = pandas.read_csv("/path/to/my/data")

To create model matrices from your data frame, I recommend the patsy library, which implements a model specification language, similar to R formulas

import patsy
model_frame = patsy.dmatrix("my_response ~ my_model_fomula", my_data_frame)

then the model frame can be passed in as an X into the various sklearn models.

答案 1 :(得分:0)

只需运行以下命令并将名称“EXTERNALDATASETNAME”替换为您的数据集名称

import sklearn.datasets 
data = sklearn.datasets.fetch_EXTERNALDATASETNAME()