无法广播输入数组

时间:2017-11-21 11:15:47

标签: python numpy mnist

我有一个csv数据,数据的第一列是'标签'并且在第一列到第784列之后的列包含图像(28 * 28)格式的表示。

我使用以下函数创建了一个numpy数组元组。

下一步是我正在尝试将此数据集拆分为所需的80%/ 20%分割以进行培训和验证。为此,我使用loadData()方法如下。当我运行要拆分的功能时,我得到错误无法将输入数组从形状(5851,784)广播到形状(5851)错误。

我的问题是,我只想将使用load(filename)生成的可用元组拆分为两个数据集。有帮助吗?

filename=dir_path+'train1.csv'
def load(filename):
    # read file into a list of rows
    with open(filename, 'rU') as csvfile:
        lines = csv.reader(csvfile, delimiter=',')
        rows = list(lines)

    # create empty numpy arrays of the required size
    data = np.empty((len(rows), len(rows[0])-1), dtype=np.float64)
    expected = np.empty((len(rows),), dtype=np.int64)

    # fill array with data from the csv-rows
    for i, row in enumerate(rows):
        data[i,:] = row[1:]
        expected[i] = row[0]

    training_data = data, expected
    return training_data

print load(filename)

结果

 (array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.],
           ..., 
           [ 0.,  0.,  0., ...,  0.,  0.,  0.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.]]), array([1, 1, 1, ..., 1, 1, 1]))

运行此功能以拆分:

def loadData():
    train_data= load(train_name)
    #test_data= load(test_name)

    training_data,validation_data =np.split(train_data, [int(.8 * len(train_data))])

    return train_data

print loadData()

结果: 无法将输入数组从形状(5851,784)广播到形状(5851)

  

SOLUTION:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
train_name=dir_path+'train8.csv'

test_name=dir_path+'test8.csv'

def load(filename):
    # read file into a list of rows
    with open(filename, 'rU') as csvfile:
        lines = csv.reader(csvfile, delimiter=',')
        rows = list(lines)

    # create empty numpy arrays of the required size
    data = np.empty((len(rows), len(rows[0])-1), dtype=np.float64)
    expected = np.empty((len(rows),), dtype=np.int64)

    # fill array with data from the csv-rows
    for i, row in enumerate(rows):
        data[i,:] = row[1:]
        expected[i] = row[0]

    result_data = data, expected
    return result_data

def loadData():
    train_data= load(train_name)[0]
    labels= load(train_name)[1]
    test_data= load(test_name)

    x_train, x_test, y_train, y_test = train_test_split(train_data, labels, test_size=0.33)

    training_data = (x_train, y_train)
    validation_data=(x_test, y_test)

    return (training_data, validation_data, test_data)

此解决方案将匹配mnist数据集

1 个答案:

答案 0 :(得分:0)

据我所知,你将一个由一个矩阵和一个数组(形状不同)组成的元组传递给np.split,这就是你得到广播错误的原因。如果你给np.split一个矩阵,它可以正常工作:

train_data = np.zeros((5000, 784))
labels = np.zeros(5000)

train,test = np.split(train_data, [int(0.8 * len(train_data))])
print "Train: {0}, Test: {1}".format(train.shape, test.shape)

这给出了以下输出:

Train: (4000, 784), Test: (1000, 784)

如果你传递矩阵和数组的元组:

train_data = np.zeros((5000, 784))
labels = np.zeros(5000)

train,test = np.split((train_data,labels), [int(0.8 *len(train_data))])

您收到广播错误:

ValueError: could not broadcast input array from shape (5000,784) into shape (5000)

如果你想分割一个数据集,包括它的标签,我建议使用scikit learn train_test_split之类的东西(可通过pip install sklearn获得),它可以处理观察和标签。功能相同:

import numpy as np
from sklearn.model_selection import train_test_split

def loadData():

    train_data = np.zeros((5000, 784))
    labels = np.zeros(5000)
    x_train, x_test, y_train, y_test = train_test_split(train_data, labels, test_size=0.22)

    print "Training samples: {0}, training labels: {1}".format(x_train.shape, y_train.shape)
    print "Validation samples: {0}, validation labels: {1}".format(x_test.shape, y_test.shape)

if __name__ == "__main__":
    loadData()

输出:

Training samples: (3900, 784), training labels: (3900,)
Validation samples: (1100, 784), validation labels: (1100,)