Question

我有一个pyspark数据框，每个唯一ID有30个观察值，如下所示：

id  time  features
 1     0   [1,2,3]
 1     1   [4,5,6]
..    ..        ..
 1    29   [7,8,9]
 2     0   [0,1,2]
 2     1   [3,4,5]
..    ..        ..
 2    29   [6,7,8]
..    ..        ..

我需要做的是创建一系列序列以馈入keras神经网络。因此，例如，让我说我有一个id的以下较小数据集：

id  time  features
 1     0    [1,2,3]
 1     1    [4,5,6]
 1     2    [7,8,9]

所需的数据格式为：

[[[1,2,3]
  [0,0,0]
  [0,0,0]],
 [[1,2,3],
  [4,5,6],
  [0,0,0]],
 [[1,2,3],
  [4,5,6],
  [7,8,9]]]

我可以使用keras包中的pad_sequences函数添加[0,0,0]行，这样我真正需要做的就是为所有ID创建以下数组。

[[[1,2,3]],
 [[1,2,3],
  [4,5,6]],
 [[1,2,3],
  [4,5,6],
  [7,8,9]]]

我能想到的唯一方法是使用循环，如下所示：

x = []
for i in range(10000):
   user = x_train[i]
   arr = []
   for j in range(30):
      arr.append(user[0:j])
   x.append(arr)

循环解决方案虽然不可行。我有904批10,000个独特的ID，每个有30个观察值。我一次收集一个批次到一个numpy数组，所以一个numpy解决方案是好的。使用rdds的pyspark解决方案将是非常棒的。或许使用map的东西？

Answer 1

这是一个numpy解决方案，可以创建包含零的所需输出。它使用triu_indices来创建＆＃34;累积时间序列结构＆＃34;：

import numpy as np
from timeit import timeit

def time_series(nids, nsteps, features):
    f3d = np.reshape(features, (nids, nsteps, -1))
    f4d = np.zeros((nids, nsteps, nsteps, f3d.shape[-1]), f3d.dtype)
    i, j = np.triu_indices(nsteps)
    f4d[:, j, i, :] = f3d[:, i, :]
    return f4d

nids = 2
nsteps = 4
nfeatures = 3
features = np.random.randint(1, 100, (nids * nsteps, nfeatures))

print('small example', time_series(nids, nsteps, features))

nids = 10000
nsteps = 30
nfeatures = 3
features = np.random.randint(1, 100, (nids * nsteps, nfeatures))

print('time needed for big example {:6.4f} secs'.format(
    timeit(lambda: time_series(nids, nsteps, features), number=10)/10))

输出：

small example [[[[76 53 48]
   [ 0  0  0]
   [ 0  0  0]
   [ 0  0  0]]

  [[76 53 48]
   [46 59 76]
   [ 0  0  0]
   [ 0  0  0]]

  [[76 53 48]
   [46 59 76]
   [62 39 17]
   [ 0  0  0]]

  [[76 53 48]
   [46 59 76]
   [62 39 17]
   [61 90 69]]]


 [[[68 32 20]
   [ 0  0  0]
   [ 0  0  0]
   [ 0  0  0]]

  [[68 32 20]
   [47 11 72]
   [ 0  0  0]
   [ 0  0  0]]

  [[68 32 20]
   [47 11 72]
   [30  3  9]
   [ 0  0  0]]

  [[68 32 20]
   [47 11 72]
   [30  3  9]
   [28 73 78]]]]
time needed for big example 0.2251 secs

Answer 2

你为什么不沿着这些方向做点什么：

dict1 = {}

for tuple1 in your_collection:
    if tuple1 ['id'] not in dict1:
    ###if we've never seen the id then add a list of lists of feature lists as entry
        dict1 [tuple1['id']] = [[tuple1['features']]]
    else:
        ##if we've seen this ID then take the previous (n-1) 
        ##list of list of features from the current dictionary       
        ##entry, copy its value to a variable, add the current list of
        ##features to this list of lists and finally append this 
        ##updated list back to the entry (which is essentially     
        ##a 3d matrix). So each entry is a 3d list keyed off by id.
        prev_list = dict1[tuple1['id']][-1][:]
        prev_list.append ( tuple1['features'])
        dict1[tuple1['id']].append (prev_list)

这有一些很差的空间复杂性，但如果你处理一组有限的尺寸，可能会有效。

将序列转换为序列数组

2 个答案: