我有一个pyspark数据框,例如下面的输入数据。我想在空白处的productname列中拆分值。然后,我想用前3个值创建新列。我在下面有示例输入和输出数据。有人可以建议如何使用pyspark吗?
输入数据:
+------+-------------------+
|id |productname |
+------+-------------------+
|235832|EXTREME BERRY Sweet|
|419736|BLUE CHASER SAUCE |
|124513|LAAVA C2L5 |
+------+-------------------+
输出:
+------+-------------------+-------------+-------------+-------------+
|id |productname |product1 |product2 |product3 |
+------+-------------------+-------------+-------------+-------------+
|235832|EXTREME BERRY Sweet|EXTREME |BERRY |Sweet |
|419736|BLUE CHASER SAUCE |BLUE |CHASER |SAUCE |
|124513|LAAVA C2L5 |LAAVA |C2L5 | |
+------+-------------------+-------------+-------------+-------------+
答案 0 :(得分:2)
Split
产品名称列,然后使用 element_at
(或) .getItem()
创建新列指数值。
df.withColumn("tmp",split(col("productname"),"\s+")).\
withColumn("product1",element_at(col("tmp"),1)).\
withColumn("product2",element_at(col("tmp"),2)).\
withColumn("product3",coalesce(element_at(col("tmp"),3),lit(""))).drop("tmp").show()
#or
df.withColumn("tmp",split(col("productname"),"\s+")).\
withColumn("product1",col("tmp").getItem(0)).\
withColumn("product2",col("tmp").getItem(1)).\
withColumn("product3",coalesce(col("tmp").getItem(2),lit(""))).drop("tmp").show()
#+------+-------------------+--------+--------+--------+
#| id| productname|product1|product2|product3|
#+------+-------------------+--------+--------+--------+
#|235832|EXTREME BERRY Sweet| EXTREME| BERRY| Sweet|
#| 4| BLUE CHASER SAUCE| BLUE| CHASER| SAUCE|
#| 1| LAAVA C2L5| LAAVA| C2L5| |
#+------+-------------------+--------+--------+--------+
To do more dynamic way:
df.show()
#+------+-------------------+
#| id| productname|
#+------+-------------------+
#|235832|EXTREME BERRY Sweet|
#| 4| BLUE CHASER SAUCE|
#| 1| LAAVA C2L5|
#+------+-------------------+
#caluculate array max size and store into variable
arr=int(df.select(size(split(col("productname"),"\s+")).alias("size")).orderBy(desc("size")).collect()[0][0])
#loop through arr variable and add the columns replace null with ""
(df.withColumn('temp', split('productname', '\s+')).select("*",*(coalesce(col('temp').getItem(i),lit("")).alias('product{}'.format(i+1)) for i in range(arr))).drop("temp").show())
#+------+-------------------+--------+--------+--------+
#| id| productname|product1|product2|product3|
#+------+-------------------+--------+--------+--------+
#|235832|EXTREME BERRY Sweet| EXTREME| BERRY| Sweet|
#| 4| BLUE CHASER SAUCE| BLUE| CHASER| SAUCE|
#| 1| LAAVA C2L5| LAAVA| C2L5| |
#+------+-------------------+--------+--------+--------+
答案 1 :(得分:1)
您可以将split
,element_at
和when/otherwise
子句与array_union
一起使用,以放置空字符串。
from pyspark.sql import functions as F
from pyspark.sql.functions import when
df.withColumn("array", F.split("productname","\ "))\
.withColumn("array", F.when(F.size("array")==2, F.array_union(F.col("array"),F.array(F.lit(""))))\
.when(F.size("array")==1, F.array_union(F.col("array"),F.array(F.lit(" "),F.lit(""))))\
.otherwise(F.col("array")))\
.withColumn("product1", F.element_at("array",1))\
.withColumn("product2", F.element_at("array",2))\
.withColumn("product3", F.element_at("array",3)).drop("array")\
.show(truncate=False)
+------+-------------------+--------+--------+--------+
|id |productname |product1|product2|product3|
+------+-------------------+--------+--------+--------+
|235832|EXTREME BERRY Sweet|EXTREME |BERRY |Sweet |
|419736|BLUE CHASER SAUCE |BLUE |CHASER |SAUCE |
|124513|LAAVA C2L5 |LAAVA |C2L5 | |
|123455|LAVA |LAVA | | |
+------+-------------------+--------+--------+--------+