PySpark - 将RDD转换为键值对RDD,其值在List中

时间:2015-10-16 15:58:21

标签: apache-spark pyspark rdd key-value

我有一个RDD,元组的格式为:

[("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2"), ...

我想要的是将其转换为键值对RDD,其中第一个字段将是第一个字符串(键),第二个字段是字符串列表(值),即我想将其转换为形式:

[("a1",["b1","c1","d1","e1"]), ("a2",["b2","c2","d2","e2"]), ...

1 个答案:

答案 0 :(得分:7)

>>> rdd = sc.parallelize([("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2")])

>>> result = rdd.map(lambda x: (x[0], list(x[1:])))

>>> print result.collect()
[('a1', ['b1', 'c1', 'd1', 'e1']), ('a2', ['b2', 'c2', 'd2', 'e2'])]

lambda x: (x[0], list(x[1:]))的解释:

  1. x[0]将使第一个元素成为第一个元素 输出
  2. x[1:]将生成除第一个元素之外的所有元素 在第二个元素
  3. list(x[1:])会强制它成为一个列表 因为默认将是一个元组