我有一个RDD,元组的格式为:
[("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2"), ...
我想要的是将其转换为键值对RDD,其中第一个字段将是第一个字符串(键),第二个字段是字符串列表(值),即我想将其转换为形式:
[("a1",["b1","c1","d1","e1"]), ("a2",["b2","c2","d2","e2"]), ...
答案 0 :(得分:7)
>>> rdd = sc.parallelize([("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2")])
>>> result = rdd.map(lambda x: (x[0], list(x[1:])))
>>> print result.collect()
[('a1', ['b1', 'c1', 'd1', 'e1']), ('a2', ['b2', 'c2', 'd2', 'e2'])]
lambda x: (x[0], list(x[1:]))
的解释:
x[0]
将使第一个元素成为第一个元素
输出x[1:]
将生成除第一个元素之外的所有元素
在第二个元素list(x[1:])
会强制它成为一个列表
因为默认将是一个元组