Question

背景故事：

将数据整合到神经网络中;作为文档开始（长字符串）;被分成句子，句子被减少到1或0，这取决于它们是否具有特征（在这种情况下，是单词的类别）。

问题在于文档具有不同数量的句子，因此句子与输入神经元之间不能为1-1;你必须训练到固定数量的神经元（除非我错过了什么）。

所以，我正在使用算法将阵列映射到固定大小，同时尽可能地保留阵列中这些1的频率和位置（因为那是什么） NN正在做出决定。

代码：

说我们的目标是固定长度的10个句子或神经元，并且需要能够处理越来越大的数组。

new_length = 10
short = [1,0,1,0,0,0,0,1]
long  = [1,1,0,0,1,0,0,0,0,1,0,0,1]

def map_to_fixed_length(arr, new_length):
    arr_length = len(arr)
    partition_size = arr_length/new_length
    res = []
    for i in range(new_length):
        slice_start_index = int(math.floor(i * partition_size))
        slice_end_index = int(math.ceil(i * partition_size))
        partition = arr[slice_start_index:slice_end_index]
        val = sum(partition)
        res.append([slice_start_index, slice_end_index, partition])
        if val > 0:
            res.append(1)
        else:
            res.append(0)
    return res

可能不是非常pythonic。无论如何，问题在于这省略了某些索引切片。例如，省略了short的最后一个索引，并且由于舍入，各种索引也被省略。

这是我一直在努力的简化版本，主要是添加if语句来解决所留下的所有差距。但是有更好的方法吗？还有一点统计上的声音吗？

我正在寻找numpy，但所有调整大小的函数都只是用零填充或任意的任意内容。

Answer 1

一种简单的方法可能是使用scipy.interpolate.interp1d，如下所示：

>>> from scipy.interpolate import interp1d

>>> def resample(data, n):
...     m = len(data)
...     xin, xout = np.arange(n, 2*m*n, 2*n), np.arange(m, 2*m*n, 2*m)
...     return interp1d(xin, data, 'nearest', fill_value='extrapolate')(xout)
... 
>>> resample(short, new_length)
array([1., 0., 0., 1., 0., 0., 0., 0., 0., 1.])
>>> 
>>> resample(long, new_length)
array([1., 1., 0., 1., 0., 0., 0., 1., 0., 1.])

Answer 2

跟进 - 一旦网络启动并运行，我测试了Paul Panzer的答案与我能想出的最佳答案（下图）

def resample(arr, new_length):
    arr_length = len(arr)
    partition_size = arr_length/new_length
    res = []
    last_round_end_slice = 0
    for i in range(new_length):
        slice_start_index = int(math.floor(i * partition_size))
        slice_end_index = int(math.ceil(i * partition_size))
        if slice_start_index > last_round_end_slice:
            slice_start_index = last_round_end_slice
        if i == 0:
            slice_end_index = int(math.ceil(partition_size))
        if i == new_length:
            slice_end_index = arr_length
        partition = arr[slice_start_index:slice_end_index]
        val = sum(partition)
        if val > 0:
            res.append(1)
        else:
            res.append(0)
        last_round_end_slice = slice_end_index
    return res

丑陋，但确实有效。

在1000个训练周期（完整的光学化循环，所有批次）之后的平均准确度结果

0.9765427094697953我的 0.968362500667572 for scipy

插值是正确的答案，但在这个用例中，它看起来像你自己更好一点。插值的问题是它有时导致数组全部为0;也就是说，它不会错误地表示匹配与消除它们。

我认为最终要求是一致性。只要输入一致确定，网络就可以从中学习。

如果有人最终绊倒了这个

，那么这很有意思

将任意长度列表映射到固定长度，保留内部结果的频率和位置（尽可能多）

2 个答案: