如何在Python中按列将RDD拆分为RDD列表

时间:2018-06-19 15:09:00

标签: python apache-spark rdd

假设我们有这个RDD:

RDDs = sc.parallelize([["panda", 0], ["pink", 3]])

由于RDD现在有两列,因此想要获得两个RDD,如下所示:

RDDList[0] = (["panda"], ["pink"])
RDDList[1] = ([0], [3])

以前找不到关于此主题的讨论,这是否可行?

2 个答案:

答案 0 :(得分:2)

您可以执行以下操作

RDDs = sc.parallelize([["panda", 0], ["pink", 3]])

cols = [0, 1]
RDDList = [(RDDs.map(lambda x: [x[col]]).collect()) for col in cols]

应该给您

print RDDList[0]
#[['panda'], ['pink']]

print RDDList[1]
#[[0], [3]]

我希望答案会有所帮助

答案 1 :(得分:1)

这是基于@Ramesh Maharjan答案构建的,以使其适用于任何RDD (python 3.x)

RDDList = []
for i in range(0,len(RDDs.first())):
    RDDList.append(RDDs.map(lambda x: [x[i]]).collect())

print (RDDList[0])
print (RDDList[1])

预期输出:

[['panda'], ['pink']]
[[0], [3]]