我有一个csv数据,数据的第一列是'标签'并且在第一列到第784列之后的列包含图像(28 * 28)格式的表示。
我使用以下函数创建了一个numpy数组元组。
下一步是我正在尝试将此数据集拆分为所需的80%/ 20%分割以进行培训和验证。为此,我使用loadData()
方法如下。当我运行要拆分的功能时,我得到错误无法将输入数组从形状(5851,784)广播到形状(5851)错误。
我的问题是,我只想将使用load(filename)
生成的可用元组拆分为两个数据集。有帮助吗?
filename=dir_path+'train1.csv'
def load(filename):
# read file into a list of rows
with open(filename, 'rU') as csvfile:
lines = csv.reader(csvfile, delimiter=',')
rows = list(lines)
# create empty numpy arrays of the required size
data = np.empty((len(rows), len(rows[0])-1), dtype=np.float64)
expected = np.empty((len(rows),), dtype=np.int64)
# fill array with data from the csv-rows
for i, row in enumerate(rows):
data[i,:] = row[1:]
expected[i] = row[0]
training_data = data, expected
return training_data
print load(filename)
结果
(array([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.]]), array([1, 1, 1, ..., 1, 1, 1]))
运行此功能以拆分:
def loadData():
train_data= load(train_name)
#test_data= load(test_name)
training_data,validation_data =np.split(train_data, [int(.8 * len(train_data))])
return train_data
print loadData()
结果: 无法将输入数组从形状(5851,784)广播到形状(5851)
SOLUTION:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
train_name=dir_path+'train8.csv'
test_name=dir_path+'test8.csv'
def load(filename):
# read file into a list of rows
with open(filename, 'rU') as csvfile:
lines = csv.reader(csvfile, delimiter=',')
rows = list(lines)
# create empty numpy arrays of the required size
data = np.empty((len(rows), len(rows[0])-1), dtype=np.float64)
expected = np.empty((len(rows),), dtype=np.int64)
# fill array with data from the csv-rows
for i, row in enumerate(rows):
data[i,:] = row[1:]
expected[i] = row[0]
result_data = data, expected
return result_data
def loadData():
train_data= load(train_name)[0]
labels= load(train_name)[1]
test_data= load(test_name)
x_train, x_test, y_train, y_test = train_test_split(train_data, labels, test_size=0.33)
training_data = (x_train, y_train)
validation_data=(x_test, y_test)
return (training_data, validation_data, test_data)
此解决方案将匹配mnist数据集
答案 0 :(得分:0)
据我所知,你将一个由一个矩阵和一个数组(形状不同)组成的元组传递给np.split
,这就是你得到广播错误的原因。如果你给np.split
一个矩阵,它可以正常工作:
train_data = np.zeros((5000, 784))
labels = np.zeros(5000)
train,test = np.split(train_data, [int(0.8 * len(train_data))])
print "Train: {0}, Test: {1}".format(train.shape, test.shape)
这给出了以下输出:
Train: (4000, 784), Test: (1000, 784)
如果你传递矩阵和数组的元组:
train_data = np.zeros((5000, 784))
labels = np.zeros(5000)
train,test = np.split((train_data,labels), [int(0.8 *len(train_data))])
您收到广播错误:
ValueError: could not broadcast input array from shape (5000,784) into shape (5000)
如果你想分割一个数据集,包括它的标签,我建议使用scikit learn
train_test_split之类的东西(可通过pip install sklearn
获得),它可以处理观察和标签。功能相同:
import numpy as np
from sklearn.model_selection import train_test_split
def loadData():
train_data = np.zeros((5000, 784))
labels = np.zeros(5000)
x_train, x_test, y_train, y_test = train_test_split(train_data, labels, test_size=0.22)
print "Training samples: {0}, training labels: {1}".format(x_train.shape, y_train.shape)
print "Validation samples: {0}, validation labels: {1}".format(x_test.shape, y_test.shape)
if __name__ == "__main__":
loadData()
输出:
Training samples: (3900, 784), training labels: (3900,)
Validation samples: (1100, 784), validation labels: (1100,)