PySpark:将字符串转换为列的字符串数组

时间:2020-09-24 16:27:54

标签: apache-spark pyspark pyspark-dataframes

我有一个这样的数据框

data = [(('ID1', "[apples, mangos, eggs, milk, oranges]")),
   (('ID1', "[eggs, milk, cereals, mangos, apples]"))]
df = spark.createDataFrame(data, ['ID', "colval"])
df.show(truncate=False)
df.printSchema()

+---+-------------------------------------+
|ID |colval                               |
+---+-------------------------------------+
|ID1|[apples, mangos, eggs, milk, oranges]|
|ID1|[eggs, milk, cereals, mangos, apples]|
+---+-------------------------------------+

root
 |-- ID: string (nullable = true)
 |-- colval: string (nullable = true)

我想将colval转换为字符串数组

当我在分割后获取第一个元素时,它返回的结果与第一个相同。有帮助吗?

root
 |-- ID: string (nullable = true)
 |-- colval: array (nullable = true)
 |    |-- element: string (containsNull = true)

我尝试使用split,但最终得到了这个结果

df = df.withColumn('colval', split('colval', "', ?'"))
df.show(truncate = False)
df.printSchema()

+---+---------------------------------------+
|ID |colval                                 |
+---+---------------------------------------+
|ID1|[[apples, mangos, eggs, milk, oranges]]|
|ID1|[[eggs, milk, cereals, mangos, apples]]|
+---+---------------------------------------+

root
 |-- ID: string (nullable = true)
 |-- colval: array (nullable = true)
 |    |-- element: string (containsNull = true)

1 个答案:

答案 0 :(得分:2)

您可以替换[],然后拆分:

df.withColumn("colval",F.split(F.regexp_replace("colval",r"\[|\]",""),",")).show()

+---+-----------------------------------------+
|ID |colval                                   |
+---+-----------------------------------------+
|ID1|[apples,  mangos,  eggs,  milk,  oranges]|
|ID1|[eggs,  milk,  cereals,  mangos,  apples]|
+---+-----------------------------------------+


root
 |-- ID: string (nullable = true)
 |-- colval: array (nullable = true)
 |    |-- element: string (containsNull = true)

如果要在分割后进行修整,可以在分割后使用高阶函数:

(df.withColumn("colval",F.split(F.regexp_replace("colval",r"\[|\]",""),","))
.withColumn("colval",F.expr("transform(colval,x-> trim(x))")))

方法1和2之间的验证和区别(请注意多余的空格

df.withColumn("colval",F.split(F.regexp_replace("colval",r"\[|\]",""),",")).collect()
[Row(ID='ID1', colval=['apples', ' mangos', ' eggs', ' milk', ' oranges']),
 Row(ID='ID1', colval=['eggs', ' milk', ' cereals', ' mangos', ' apples'])]


(df.withColumn("colval",F.split(F.regexp_replace("colval",r"\[|\]",""),","))
 .withColumn("colval",F.expr("transform(colval,x-> trim(x))"))).collect()

[Row(ID='ID1', colval=['apples', 'mangos', 'eggs', 'milk', 'oranges']),
 Row(ID='ID1', colval=['eggs', 'milk', 'cereals', 'mangos', 'apples'])]