将元组的spark RDD转换为numpy数组

时间:2015-09-04 01:46:27

标签: python pyspark

如果我的RDD是这样的:

(key (date, value), (date, value), (date, value))

如何将其转换为

(key (Numpy.Array(date), Numpy.Array(value)))

1 个答案:

答案 0 :(得分:1)

您可以使用zip重塑(date, value)对:

>>> xs = (("x1", 1), ("x2", 2), ("x3", 3))
>>> zip(*xs)
[('x1', 'x2', 'x3'), (1, 2, 3)]

添加地图或理解解决了(Numpy.Array(date), Numpy.Array(value))部分,其余部分非常简单:

import numpy as np
import datetime

rdd = sc.parallelize([
    ("foo",
        (datetime.date(2010, 01, 01),  1.0),
        (datetime.date(2011, 02, 10),  2.0),
        (datetime.date(2012, 03, 10),  3.0)
    ),
    ("bar",
        (datetime.date(2000, 04, 01),  14.0),
        (datetime.date(2001, 05, 10),  15.0),
        (datetime.date(2002, 06, 10),  16.0 )
    ), 
])

rdd.map(lambda x: (x[0], tuple(np.array(_) for _ in zip(*x[1:]))))