我有一个pyspark数据框,每个唯一ID有30个观察值,如下所示:
id time features
1 0 [1,2,3]
1 1 [4,5,6]
.. .. ..
1 29 [7,8,9]
2 0 [0,1,2]
2 1 [3,4,5]
.. .. ..
2 29 [6,7,8]
.. .. ..
我需要做的是创建一系列序列以馈入keras神经网络。因此,例如,让我说我有一个id的以下较小数据集:
id time features
1 0 [1,2,3]
1 1 [4,5,6]
1 2 [7,8,9]
所需的数据格式为:
[[[1,2,3]
[0,0,0]
[0,0,0]],
[[1,2,3],
[4,5,6],
[0,0,0]],
[[1,2,3],
[4,5,6],
[7,8,9]]]
我可以使用keras包中的pad_sequences
函数添加[0,0,0]行,这样我真正需要做的就是为所有ID创建以下数组。
[[[1,2,3]],
[[1,2,3],
[4,5,6]],
[[1,2,3],
[4,5,6],
[7,8,9]]]
我能想到的唯一方法是使用循环,如下所示:
x = []
for i in range(10000):
user = x_train[i]
arr = []
for j in range(30):
arr.append(user[0:j])
x.append(arr)
循环解决方案虽然不可行。我有904批10,000个独特的ID,每个有30个观察值。我一次收集一个批次到一个numpy数组,所以一个numpy解决方案是好的。使用rdds的pyspark解决方案将是非常棒的。或许使用map
的东西?
答案 0 :(得分:1)
这是一个numpy解决方案,可以创建包含零的所需输出。
它使用triu_indices
来创建"累积时间序列结构":
import numpy as np
from timeit import timeit
def time_series(nids, nsteps, features):
f3d = np.reshape(features, (nids, nsteps, -1))
f4d = np.zeros((nids, nsteps, nsteps, f3d.shape[-1]), f3d.dtype)
i, j = np.triu_indices(nsteps)
f4d[:, j, i, :] = f3d[:, i, :]
return f4d
nids = 2
nsteps = 4
nfeatures = 3
features = np.random.randint(1, 100, (nids * nsteps, nfeatures))
print('small example', time_series(nids, nsteps, features))
nids = 10000
nsteps = 30
nfeatures = 3
features = np.random.randint(1, 100, (nids * nsteps, nfeatures))
print('time needed for big example {:6.4f} secs'.format(
timeit(lambda: time_series(nids, nsteps, features), number=10)/10))
输出:
small example [[[[76 53 48]
[ 0 0 0]
[ 0 0 0]
[ 0 0 0]]
[[76 53 48]
[46 59 76]
[ 0 0 0]
[ 0 0 0]]
[[76 53 48]
[46 59 76]
[62 39 17]
[ 0 0 0]]
[[76 53 48]
[46 59 76]
[62 39 17]
[61 90 69]]]
[[[68 32 20]
[ 0 0 0]
[ 0 0 0]
[ 0 0 0]]
[[68 32 20]
[47 11 72]
[ 0 0 0]
[ 0 0 0]]
[[68 32 20]
[47 11 72]
[30 3 9]
[ 0 0 0]]
[[68 32 20]
[47 11 72]
[30 3 9]
[28 73 78]]]]
time needed for big example 0.2251 secs
答案 1 :(得分:0)
你为什么不沿着这些方向做点什么:
dict1 = {}
for tuple1 in your_collection:
if tuple1 ['id'] not in dict1:
###if we've never seen the id then add a list of lists of feature lists as entry
dict1 [tuple1['id']] = [[tuple1['features']]]
else:
##if we've seen this ID then take the previous (n-1)
##list of list of features from the current dictionary
##entry, copy its value to a variable, add the current list of
##features to this list of lists and finally append this
##updated list back to the entry (which is essentially
##a 3d matrix). So each entry is a 3d list keyed off by id.
prev_list = dict1[tuple1['id']][-1][:]
prev_list.append ( tuple1['features'])
dict1[tuple1['id']].append (prev_list)
这有一些很差的空间复杂性,但如果你处理一组有限的尺寸,可能会有效。