火花2.2.1 Pyspark
df = sqlContext.createDataFrame([
("dog", "1", "2", "3"),
("cat", "4", "5", "6"),
("dog", "7", "8", "9"),
("cat", "10", "11", "12"),
("dog", "13", "14", "15"),
("parrot", "16", "17", "18"),
("goldfish", "19", "20", "21"),
], ["pet", "dog_30", "cat_30", "parrot_30"])
然后我从“宠物”列中获得了我上面关心的字段列表
dfvalues = ["dog", "cat", "parrot"]
我想编写代码,它将为我提供dog_30
,cat_30
或parrot_30
中与“ pet”中的值相对应的值。例如,在第一行中,pet
列的值为dog
,因此我们将dog_30
的值为1。
我尝试使用它来获取代码,但是它仅为stats
列提供了空值。我还没有弄清楚如何处理goldfish
情况。我想将其设置为0。
mycols = [F.when(F.col("pet") == p + "_30", p) for p in dfvalues]
df = df.withColumn("newCol2",F.coalesce(*stats) )
df.show()
所需的输出:
+--------+------+------+---------+------+
| pet|dog_30|cat_30|parrot_30|stats |
+--------+------+------+---------+------+
| dog| 1| 2| 3| 1 |
| cat| 4| 5| 6| 5 |
| dog| 7| 8| 9| 7 |
| cat| 10| 11| 12| 11 |
| dog| 13| 14| 15| 13 |
| parrot| 16| 17| 18| 18 |
|goldfish| 19| 20| 21| 0 |
+--------+------+------+---------+------+
答案 0 :(得分:3)
逻辑关闭;您需要.when(F.col("pet") == p, F.col(p + '_30'))
:
mycols = [F.when(F.col("pet") == p, F.col(p + '_30')) for p in dfvalues]
df = df.withColumn("newCol2",F.coalesce(F.coalesce(*mycols),F.lit(0)))
df.show()
+--------+------+------+---------+-------+
| pet|dog_30|cat_30|parrot_30|newCol2|
+--------+------+------+---------+-------+
| dog| 1| 2| 3| 1|
| cat| 4| 5| 6| 5|
| dog| 7| 8| 9| 7|
| cat| 10| 11| 12| 11|
| dog| 13| 14| 15| 13|
| parrot| 16| 17| 18| 18|
|goldfish| 19| 20| 21| 0|
+--------+------+------+---------+-------+