Spark:如何基于不同的列来创造新列的价值

时间:2018-10-04 13:40:54

标签: pyspark apache-spark-sql

火花2.2.1 Pyspark

df = sqlContext.createDataFrame([
    ("dog", "1", "2", "3"),
    ("cat", "4", "5", "6"),
    ("dog", "7", "8", "9"),
    ("cat", "10", "11", "12"),
    ("dog", "13", "14", "15"),
    ("parrot", "16", "17", "18"),
    ("goldfish", "19", "20", "21"),
], ["pet", "dog_30", "cat_30", "parrot_30"])

然后我从“宠物”列中获得了我上面关心的字段列表

dfvalues = ["dog", "cat", "parrot"]

我想编写代码,它将为我提供dog_30cat_30parrot_30中与“ pet”中的值相对应的值。例如,在第一行中,pet列的值为dog,因此我们将dog_30的值为1。

我尝试使用它来获取代码,但是它仅为stats列提供了空值。我还没有弄清楚如何处理goldfish情况。我想将其设置为0。

mycols = [F.when(F.col("pet") == p + "_30", p) for p in dfvalues]
df = df.withColumn("newCol2",F.coalesce(*stats) )
df.show()

所需的输出:

+--------+------+------+---------+------+
|     pet|dog_30|cat_30|parrot_30|stats |
+--------+------+------+---------+------+
|     dog|     1|     2|        3|  1   |
|     cat|     4|     5|        6|  5   |
|     dog|     7|     8|        9|  7   |
|     cat|    10|    11|       12|  11  |
|     dog|    13|    14|       15|  13  |
|  parrot|    16|    17|       18|  18  |
|goldfish|    19|    20|       21|  0   |
+--------+------+------+---------+------+

1 个答案:

答案 0 :(得分:3)

逻辑关闭;您需要.when(F.col("pet") == p, F.col(p + '_30'))

mycols = [F.when(F.col("pet") == p, F.col(p + '_30')) for p in dfvalues]
df = df.withColumn("newCol2",F.coalesce(F.coalesce(*mycols),F.lit(0)))
df.show()
+--------+------+------+---------+-------+
|     pet|dog_30|cat_30|parrot_30|newCol2|
+--------+------+------+---------+-------+
|     dog|     1|     2|        3|      1|
|     cat|     4|     5|        6|      5|
|     dog|     7|     8|        9|      7|
|     cat|    10|    11|       12|     11|
|     dog|    13|    14|       15|     13|
|  parrot|    16|    17|       18|     18|
|goldfish|    19|    20|       21|      0|
+--------+------+------+---------+-------+