Question

我正在尝试使用sci-kitLearn对从json文件解析的数据集执行机器学习。要在sci-kitLearn中使用数据集API，我需要一个Numpy形状数组（n_samples * n_features）。

我将这些数据编码为嵌套的Python列表，其中列表的大小为“X”（一些大量的样本），每个元素的类型为[int，float，int]（3个特征）。

Ex：[[int，float，int]，[int，float，int]，...]

我需要将它转换为一个numpy数组，它将与sci-kitLearn数据集一起正常运行，但我似乎无法创建一个支持每列不同类型的numpy数组。

Numpy数组基本上是同质的，但我发现很难相信数据集中不同类型的特征/列仍然是使用此API的一个缺陷，我已经看到了使用不同类型特征的示例。

关于加载您自己的数据集的文档很差：http://scikit-learn.org/stable/tutorial/basic/tutorial.html。任何帮助创建numpy数组和/或使用数据集API将不胜感激。

我的代码发布在下面（虽然问题是下一步该做什么）：

with open('bc_mp_at_blockchain.json') as data_:
mp_json = json.load(data_)

with open('bc_tv_at_blockchain.json') as data:
    tv_json = json.load(data)

# access dictionary of length 1 that list of values
list_of_mpdata = mp_json['values']
list_of_tvdata = tv_json['values']

# ensure both sets of data start on the same day
assert ( list_of_mpdata[0]['x'] == list_of_tvdata[0]['x'] )

#concatenate lists as necessary
combined_list = []
for mp_dict, tv_dict in zip(list_of_mpdata, list_of_tvdata) :
    combined_list.append([ mp_dict['x'], mp_dict['y'], tv_dict['y'] ])

# combined_list is now a list of [int,float,int] lists

Answer 1

如果您有列表列表，可以将其转换为numpy数组np.array(combined_list)。这将是外部列表的长度在第一维（向下）中的形状，例如，

>>> a = np.array([[1,2,3],[1,2,3]])
>>> a.shape
(2, 3)

如果我理解正确应该是scikit的正确n_samples * n_features顺序，但如果没有，你可以使用以下方式转置数组：

>>> a = a.T
>>> a.shape
(3, 2)

Answer 2

您可以使用numpy.array(combined_list)创建一个numpy数组，所有值都将转换为float。 Int到float转换通常不会影响任何机器学习分析。

Numpy Array具有不同类型的功能，适用于Python中的sci-kitLearn数据集API

2 个答案: