我的数据框如下:
+----------------------------------+
| invoice_id | newcolor |
+------------+---------------------+
| 1 | [red, white, green] |
+------------+---------------------+
| 2 | [red, green] |
+------------+---------------------+
我需要一个包含以下内容的新专栏:
[('red', 'color'), ('white', 'color), ('green','color)]
[('red', 'color'), ('green','color)]
答案 0 :(得分:1)
您可以将udf
功能定义为
from pyspark.sql import functions as F
from pyspark.sql import types as T
def addColor(x):
return [[color, 'color'] for color in x]
udfAddColor = F.udf(addColor, T.ArrayType(T.StringType()))
然后将其与.withColumn
一起用作
df.withColumn('newcolor', udfAddColor(df.newcolor)).show(truncate=False)
您应该将所需的输出设为
+----------+----------------------------------------------+
|invoice_id|newcolor |
+----------+----------------------------------------------+
|1 |[[red, color], [white, color], [green, color]]|
|2 |[[red, color], [green, color]] |
+----------+----------------------------------------------+