如何将列表的RDD转换为压缩列表的RDD?

时间:2019-05-09 06:38:26

标签: pyspark bigdata

RDD (  清单(1、2、3)  清单('A','B','C')  清单('a','b','c') )

我想将其转换为

RDD ( 清单(1,'A','a') 清单(2,'B','b') 清单(3,'C','c') )

我想在PySpark中执行此操作而不使用收集操作吗?

我尝试了以下操作:

  1. lst = [[1, 2, 3], ['A', 'B', 'C'], ['a', 'b', 'c']]
    l = sc.parallelize(lst)
    lst_new = l.reduce(lambda x,y: zip(x, y))
    for i in lst_new:
        print(i)
    
output: 
((1, 'A'), 'aa')
((2, 'B'), 'bb')
((3, 'C'), 'cc')
Required output: RDD(List(1, 'A', 'a'), List(2, 'B', 'b'), List(3, 'C', 'c'))

以便我可以将其转换为数据框。

+--+---+---+
|A1| A2| A3|
+--+---+---+
|1 |  A| aa|
|2 |  B| bb|
|3 |  C| cc|
+--+---+---+

1 个答案:

答案 0 :(得分:0)

RDD适用于(key, value)对。当您将zip first RDDsecond RDD一起使用时,则values from first RDD becomes keys for new RDDvalues from the second RDD becomes values for new RDD

现在通过示例1理解-

创建RDDS

#Python Lists
a = [1, 2, 3]
b = ['A', 'B', 'C']
c = ['a','b', 'c']

#3 Different RDDS from Python Lists
rdda = sc.parallelize(a)
rddb = sc.parallelize(b)
rddc = sc.parallelize(c)

一对一压缩并检查key, value对-

d = rdda.zip(rddb)
print (d.take(1))
[(1, 'A')] # 1 is key here and 'A' is Value

d = d.zip(rddc)
print (d.take(1))
[((1, 'A'), 'a')] # (1, 'A') is key here and 'a' is Value

print (d.collect()) #This wouldn't give us desired output
[((1, 'A'), 'a'), ((2, 'B'), 'b'), ((3, 'C'), 'c')]

#To get the desired output we need to map key and values in the same object/tuple using map

print (d.map(lambda x:x[0]+(x[1], )).take(1))
[(1, 'A', 'a')]

#lambda x:x[0]+(x[1], )  Here x[0] is having tuple of keys (1, 'A') and x[1] is just a string value 'a'. Now concat key tuple and value (convert to tuple (x[1], )) 

最终转换为DF

d.map(lambda x:x[0]+(x[1], )).toDF().show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
|  1|  A|  a|
|  2|  B|  b|
|  3|  C|  c|
+---+---+---+

希望这将帮助您解决第二个示例。