我有这个数据集:
+-------+-----+--------+--------------------+
| Name|Order|Orbiting| Habitable|
+-------+-----+--------+--------------------+
|Mercury| 1| Sol| No|
| Venus| 2| Sol| No|
| Earth| 3| Sol| Yes|
| Mars| 4| Sol|Only with terrafo...|
|Jupiter| 5| Sol| No|
| Saturn| 6| Sol| No|
| Uranus| 7| Sol| No|
|Neptune| 8| Sol| No|
| Pluto| 9| Sol| No|
+-------+-----+--------+--------------------+
如果Sol
包含Sun
并以Name
开头,我希望将us
替换为Ve
。
我试过这个:
var col = col("Name")
col = col.contains("us").and(col.startsWith("Ve"))
val result = dataset.withColumn(
"Orbiting",
functions.regexp_replace(col,
"Sol",
"Sun"))
但是有了这个,我只看到布尔过滤器的结果:
+-------+-----+--------+--------------------+
| Name|Order|Orbiting| Habitable|
+-------+-----+--------+--------------------+
|Mercury| 1| false| No|
| Venus| 2| true| No|
| Earth| 3| false| Yes|
| Mars| 4| false|Only with terrafo...|
|Jupiter| 5| false| No|
| Saturn| 6| false| No|
| Uranus| 7| false| No|
|Neptune| 8| false| No|
| Pluto| 9| false| No|
+-------+-----+--------+--------------------+
我想得到的是:
+-------+-----+--------+--------------------+
| Name|Order|Orbiting| Habitable|
+-------+-----+--------+--------------------+
|Mercury| 1| Sol| No|
| Venus| 2| Sun| No|
| Earth| 3| Sol| Yes|
| Mars| 4| Sol|Only with terrafo...|
|Jupiter| 5| Sol| No|
| Saturn| 6| Sol| No|
| Uranus| 7| Sol| No|
|Neptune| 8| Sol| No|
| Pluto| 9| Sol| No|
+-------+-----+--------+--------------------+
但仅当Orbiting
的值为Sol
时。例如,如果它是Proxima Centauri
,它应该保持这样。
我也试过这个:
var col = col("Name")
col = col.contains("us").and(col.startsWith("Ve"))
val result = dataset.withColumn(
"Orbiting",
`when`(col, "Sun").otherwise("Sol"))
在Orbiting
仅Sol
作为值时有效,但当我ProximaCentauri
时,它不再有效,因为我无法对其进行过滤。
我该如何解决这个问题?
答案 0 :(得分:2)
尝试
val result = it.withColumn("Orbiting",
when(col("Name").startsWith("Ve") &&
col("Name").contains("nus"),
regexp_replace(col("Orbiting"), "Sol", "Sun"))
.otherwise(col("Orbiting")))
绝对是以下一个
val result = it.withColumn("Orbiting",
when(col("Name") === "Venus",
regexp_replace(col("Orbiting"), "Sol", "Sun"))).otherwise(col("Orbiting")))
当然需要以下导入
import org.apache.spark.sql.functions._
答案 1 :(得分:0)
我最终创建了一个temp
列,并在该列中使用了when
:
var result = it.withColumn(
"temp",
functions.regexp_replace(col("Orbiting"),
"Sol",
"Sun"))
result = result.withColumn("Orbiting",
`when`(col("Name").startsWith("Ve")
.and(col("Name").contains("nus")),
col("temp")).otherwise(col("Orbiting")))
result = result.drop(col("temp"))
result.show()
结果:
+-------------------+-----+----------------+--------------------+
| Name|Order| Orbiting| Habitable|
+-------------------+-----+----------------+--------------------+
| Mercury| 1| Sol| No|
| Venus| 2| Sun| No|
| Earth| 3| Sol| Yes|
| Mars| 4| Sol|Only with terrafo...|
| Jupiter| 5| Sol| No|
| Saturn| 6| Sol| No|
| Uranus| 7| Sol| No|
| Neptune| 8| Sol| No|
| Pluto| 9| Sol| No|
|Proxima Centauri b | 1|Proxima Centauri| Maybe|
+-------------------+-----+----------------+--------------------+