与pyspark分开

时间:2016-12-19 01:29:33

标签: python apache-spark pyspark

我有 rdd 的数据:

[[[1, 3]],
 [[1, 3, 5, 4, 2], [1, 3, 4, 2]],
 [[1, 3, 5], [1, 3, 4, 2, 5]],
 [[1, 3, 5, 4], [1, 3, 4]],
 [[3, 5, 1], [3, 5, 4, 1], [3, 4, 1], [3, 4, 2, 5, 1]]]

我如何得到结果:

[(1, 3)], [[(1,3), (3,5), (5,4), (4,2)], [(1,3), (3,4), (4,2)]]

1 个答案:

答案 0 :(得分:0)

您需要将配对功能应用于数据结构中的每个列表:

def pairs(seq):
  return list(zip(seq, seq[1:]))

[[pairs(e2) for e2 in e1] for e1 in data]
# [[[(1, 3)]], [[(1, 3), (3, 5), (5, 4), (4, 2)], [(1, 3), (3, 4), (4, 2)]], [[(1, 3), (3, 5)], [(1, 3), (3, 4), (4, 2), (2, 5)]], [[(1, 3), (3, 5), (5, 4)], [(1, 3), (3, 4)]], [[(3, 5), (5, 1)], [(3, 5), (5, 4), (4, 1)], [(3, 4), (4, 1)], [(3, 4), (4, 2), (2, 5), (5, 1)]]]