pyspark - 将收集的列表转换为元组

时间:2017-11-27 06:28:58

标签: python pyspark

我的数据框如下:

+----------------------------------+
| invoice_id | newcolor            |
+------------+---------------------+
|         1  | [red, white, green] | 
+------------+---------------------+
|         2  | [red, green]        |       
+------------+---------------------+

我需要一个包含以下内容的新专栏:

[('red', 'color'), ('white', 'color), ('green','color)]
[('red', 'color'), ('green','color)]

1 个答案:

答案 0 :(得分:1)

您可以将udf功能定义为

from pyspark.sql import functions as F
from pyspark.sql import types as T
def addColor(x):
    return [[color, 'color'] for color in x]

udfAddColor = F.udf(addColor, T.ArrayType(T.StringType()))

然后将其与.withColumn一起用作

df.withColumn('newcolor', udfAddColor(df.newcolor)).show(truncate=False)

您应该将所需的输出设为

+----------+----------------------------------------------+
|invoice_id|newcolor                                      |
+----------+----------------------------------------------+
|1         |[[red, color], [white, color], [green, color]]|
|2         |[[red, color], [green, color]]                |
+----------+----------------------------------------------+