from pyspark.sql import Row, functions as F
row = Row("UK_1","UK_2","Date","Cat")
df = (sc.parallelize
([
row(1,1,'12/10/2016',"A"),
row(1,2,None,'A'),
row(2,1,'14/10/2016','B'),
row(3,3,'!~2016/2/276','B'),
row(None,1,'26/09/2016','A'),
row(1,1,'12/10/2016',"A"),
row(1,2,None,'A'),
row(2,1,'14/10/2016','B'),
row(None,None,'!~2016/2/276','B'),
row(None,1,'26/09/2016','A')
]).toDF())
pks = ["UK_1","UK_2"]
df1 = (
df
.select(columns)
#.withColumn('pk',F.concat(pks))
.withColumn('pk',F.concat("UK_1","UK_2"))
)
df1.show()
有没有办法可以将列表列表传入concat?我希望将代码用于可以改变列的场景,并且我希望将其作为列表传递。
答案 0 :(得分:3)
是的,python中的语法是df.withColumn("pk", F.concat(*pks)).show()
+----+----+------------+---+----+
|UK_1|UK_2| Date|Cat| pk|
+----+----+------------+---+----+
| 1| 1| 12/10/2016| A| 11|
| 1| 2| null| A| 12|
| 2| 1| 14/10/2016| B| 21|
| 3| 3|!~2016/2/276| B| 33|
|null| 1| 26/09/2016| A|null|
| 1| 1| 12/10/2016| A| 11|
| 1| 2| null| A| 12|
| 2| 1| 14/10/2016| B| 21|
|null|null|!~2016/2/276| B|null|
|null| 1| 26/09/2016| A|null|
+----+----+------------+---+----+
(可变数量的参数):
items_to_pay = floor(amount / 3) * 2 + X, with X = amount mod 3