如何有条件地替换Apache Spark数据集中的值?

时间:2018-02-19 10:48:16

标签: apache-spark kotlin apache-spark-dataset

我有这个数据集:

+-------+-----+--------+--------------------+
|   Name|Order|Orbiting|           Habitable|
+-------+-----+--------+--------------------+
|Mercury|    1|     Sol|                  No|
|  Venus|    2|     Sol|                  No|
|  Earth|    3|     Sol|                 Yes|
|   Mars|    4|     Sol|Only with terrafo...|
|Jupiter|    5|     Sol|                  No|
| Saturn|    6|     Sol|                  No|
| Uranus|    7|     Sol|                  No|
|Neptune|    8|     Sol|                  No|
|  Pluto|    9|     Sol|                  No|
+-------+-----+--------+--------------------+

如果Sol包含Sun并以Name开头,我希望将us替换为Ve

我试过这个:

var col = col("Name")
col = col.contains("us").and(col.startsWith("Ve"))

val result = dataset.withColumn(
        "Orbiting",
        functions.regexp_replace(col,
                "Sol",
                "Sun")) 

但是有了这个,我只看到布尔过滤器的结果:

+-------+-----+--------+--------------------+
|   Name|Order|Orbiting|           Habitable|
+-------+-----+--------+--------------------+
|Mercury|    1|   false|                  No|
|  Venus|    2|    true|                  No|
|  Earth|    3|   false|                 Yes|
|   Mars|    4|   false|Only with terrafo...|
|Jupiter|    5|   false|                  No|
| Saturn|    6|   false|                  No|
| Uranus|    7|   false|                  No|
|Neptune|    8|   false|                  No|
|  Pluto|    9|   false|                  No|
+-------+-----+--------+--------------------+

我想得到的是:

+-------+-----+--------+--------------------+
|   Name|Order|Orbiting|           Habitable|
+-------+-----+--------+--------------------+
|Mercury|    1|     Sol|                  No|
|  Venus|    2|     Sun|                  No|
|  Earth|    3|     Sol|                 Yes|
|   Mars|    4|     Sol|Only with terrafo...|
|Jupiter|    5|     Sol|                  No|
| Saturn|    6|     Sol|                  No|
| Uranus|    7|     Sol|                  No|
|Neptune|    8|     Sol|                  No|
|  Pluto|    9|     Sol|                  No|
+-------+-----+--------+--------------------+

但仅当Orbiting的值为Sol时。例如,如果它是Proxima Centauri,它应该保持这样。

我也试过这个:

var col = col("Name")
col = col.contains("us").and(col.startsWith("Ve"))

val result = dataset.withColumn(
        "Orbiting",
        `when`(col, "Sun").otherwise("Sol"))

OrbitingSol作为值时有效,但当我ProximaCentauri时,它不再有效,因为我无法对其进行过滤。

我该如何解决这个问题?

2 个答案:

答案 0 :(得分:2)

尝试

val result = it.withColumn("Orbiting", 
      when(col("Name").startsWith("Ve") && 
      col("Name").contains("nus"), 
    regexp_replace(col("Orbiting"), "Sol", "Sun"))
    .otherwise(col("Orbiting")))

绝对是以下一个

val result = it.withColumn("Orbiting",
    when(col("Name") === "Venus", 
    regexp_replace(col("Orbiting"), "Sol", "Sun"))).otherwise(col("Orbiting")))

当然需要以下导入

import org.apache.spark.sql.functions._

答案 1 :(得分:0)

我最终创建了一个temp列,并在该列中使用了when

var result = it.withColumn(
        "temp",
        functions.regexp_replace(col("Orbiting"),
                "Sol",
                "Sun"))

result = result.withColumn("Orbiting",
        `when`(col("Name").startsWith("Ve")
                .and(col("Name").contains("nus")),
                col("temp")).otherwise(col("Orbiting")))

result = result.drop(col("temp"))
result.show()

结果:

+-------------------+-----+----------------+--------------------+
|               Name|Order|        Orbiting|           Habitable|
+-------------------+-----+----------------+--------------------+
|            Mercury|    1|             Sol|                  No|
|              Venus|    2|             Sun|                  No|
|              Earth|    3|             Sol|                 Yes|
|               Mars|    4|             Sol|Only with terrafo...|
|            Jupiter|    5|             Sol|                  No|
|             Saturn|    6|             Sol|                  No|
|             Uranus|    7|             Sol|                  No|
|            Neptune|    8|             Sol|                  No|
|              Pluto|    9|             Sol|                  No|
|Proxima Centauri b |    1|Proxima Centauri|               Maybe|
+-------------------+-----+----------------+--------------------+