我有一个数据框df,其中我有两列像这样。
+-----+------------------+
|x | y |
+-----+------------------+
|0.0 |{12,16,17,18,19} |
|0.0 |{18,16,17,18,19} |
|0.0 |{15,16,67,18,19} |
|0.0 |{65,16,17,18,19} |
|0.0 |{9,16,17,18,19} |
|1.0 |{12,16,17,28,39} |
|0.0 |{24,16,17,28,19} |
|0.0 |{90,16,17,18,29} |
|1.0 |{30,16,17,18,19} |
|1.0 |{28,16,17,18,19} |
+-----+------------------+
从这里我想要像
这样的东西+---+---+
|x |y |
+---+---+
|0 |12 |
|0 |18 |
|0 |15 |
|0 |65 |
|0 |9 |
|1 |12 |
|0 |24 |
|0 |90 |
|1 |30 |
|1 |28 |
+---+---+
我试过
println(df .withColumn("y", df("y".replace("{", "").replace("}","").split(",")(0))).show)
两列都是字符串类型
但它在y列中打印相同 任何帮助表示赞赏。
答案 0 :(得分:2)
您需要使用Spark的内置列功能。这是一个例子:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("0.0", "{12,16,17,18,19}"),
("0.0", "{18,16,17,18,19}"),
("0.0", "{15,16,67,18,19}"),
("0.0", "{65,16,17,18,19}"),
("0.0", "{9,16,17,18,19}"),
("1.0", "{12,16,17,28,39}"),
("0.0", "{24,16,17,28,19}"),
("0.0", "{90,16,17,18,29}"),
("1.0", "{30,16,17,18,19}"),
("1.0", "{28,16,17,18,19}")
).toDF("x", "y")
def firstItem(column: Column): Column = split(
regexp_replace(column, "[{}]", ""),
","
).getItem(0)
df.withColumn("y", firstItem(df("y"))).show
导致:
+---+---+
| x| y|
+---+---+
|0.0| 12|
|0.0| 18|
|0.0| 15|
|0.0| 65|
|0.0| 9|
|1.0| 12|
|0.0| 24|
|0.0| 90|
|1.0| 30|
|1.0| 28|
+---+---+
functions
包文档和Column
类文档(getItem
方法)中的更多信息。
如果您需要更复杂的转换并且内置函数不够用,则可以使用用户定义函数(UDF)。您可以找到有关UDF here的更多信息。
答案 1 :(得分:1)
您可以尝试:
df.withColumn("y", regexp_extract($"y", "(\\{)([0-9]*)",2)).show()
+---+---+
| x| y|
+---+---+
|0.0| 12|
|0.0| 18|
|0.0| 15|
|0.0| 65|
|0.0| 9|
|1.0| 12|
|0.0| 24|
|0.0| 90|
|1.0| 30|
|1.0| 28|
+---+---+