我有数组x_train
和targets_train
。我想将训练数据混洗并将其分成较小的批次,并将批次用作训练数据。我的原始数据有1000行,每次我尝试使用250行:
x_train = np.memmap('/home/usr/train', dtype='float32', mode='r', shape=(1000, 1, 784))
# print(x_train)
targets_train = np.memmap('/home/usr/train_label', dtype='int32', mode='r', shape=(1000, 1))
train_idxs = [i for i in range(x_train.shape[0])]
np.random.shuffle(train_idxs)
num_batches_train = 4
def next_batch(start, train, labels, batch_size=250):
newstart = start + batch_size
if newstart > train.shape[0]:
newstart = 0
idxs = train_idxs[start:start + batch_size]
# print(idxs)
return train[idxs, :], labels[idxs, :], newstart
# x_train_lab = x_train[:200]
# # x_train = np.array(targets_train)
# targets_train_lab = targets_train[:200]
for i in range(num_batches_train):
x_train, targets_train, newstart = next_batch(i*batch_size, x_train, targets_train, batch_size=250)
问题是,当我随机播放训练数据并尝试访问批次时,我收到错误消息:
return train[idxs, :], labels[idxs, :], newstart
IndexError: index 250 is out of bounds for axis 0 with size 250
有谁知道我做错了什么?
答案 0 :(得分:1)
(编辑 - 首先猜测删除了newstart
)
在这一行:
x_train, targets_train, newstart = next_batch(i*batch_size, x_train, targets_train, batch_size=250)
每次迭代都会更改x_train
的大小,但是您继续使用为完整大小数组创建的train_idxs
数组。
批量从x_train
中提取随机值是一回事,但您必须保持选择数组的一致性。
由于缺乏最小的,可验证的例子,这个问题可能应该已经结束。令人沮丧的是必须猜测并制作一个小的可测试的例子,希望能够重现这个问题。
https://stackoverflow.com/help/mcve
如果我目前的猜测是错误的,只需要几个中间打印报表就可以解决问题。
========================
将代码缩减为简单的案例
import numpy as np
x_train = np.arange(20).reshape(20,1)
train_idxs = np.arange(x_train.shape[0])
np.random.shuffle(train_idxs)
num_batches_train = 4
batch_size=5
def next_batch(start, train):
idxs = train_idxs[start:start + batch_size]
print(train.shape, idxs)
return train[idxs, :]
for i in range(num_batches_train):
x_train = next_batch(i*batch_size, x_train)
print(x_train)
跑步产生:
1658:~/mypy$ python3 stack39919181.py
(20, 1) [ 7 18 3 0 9]
[[ 7]
[18]
[ 3]
[ 0]
[ 9]]
(5, 1) [13 5 2 15 1]
Traceback (most recent call last):
File "stack39919181.py", line 14, in <module>
x_train = next_batch(i*batch_size, x_train)
File "stack39919181.py", line 11, in next_batch
return train[idxs, :]
IndexError: index 13 is out of bounds for axis 0 with size 5
我将(5,1)x_train
反馈回next_batch
,但尝试将其编入索引,就好像它是原始的一样。
将迭代更改为:
for i in range(num_batches_train):
x_batch = next_batch(i*batch_size, x_train)
print(x_batch)
让它可以生成4批5行。
答案 1 :(得分:0)
问题在于函数定义中的这一行:
idxs = train_idxs[start:start + batch_size]
将其更改为:
idxs = train_idxs[start: newstart]
然后它应该按预期工作!
另外,请将for
循环中的变量名更改为:
batch_size = 250
for i in range(num_batches_train):
x_train_split, targets_train_split, newstart = next_batch(i*batch_size,
x_train,
targets_train,
batch_size=250)
print(x_train_split.shape, targets_train_split.shape, newstart)
示例输出:
(250, 1, 784) (250, 1) 250
(250, 1, 784) (250, 1) 500
(250, 1, 784) (250, 1) 750
(250, 1, 784) (250, 1) 1000