如何在列中将字符串转换为另一个字符串

时间:2017-07-20 11:03:58

标签: scala hadoop apache-spark

我有一个数据框df,其中我有两列像这样。

+-----+------------------+
|x    |       y          |
+-----+------------------+
|0.0  |{12,16,17,18,19}  |
|0.0  |{18,16,17,18,19}  |
|0.0  |{15,16,67,18,19}  |
|0.0  |{65,16,17,18,19}  |
|0.0  |{9,16,17,18,19}   |
|1.0  |{12,16,17,28,39}  |
|0.0  |{24,16,17,28,19}  |
|0.0  |{90,16,17,18,29}  |
|1.0  |{30,16,17,18,19}  |
|1.0  |{28,16,17,18,19}  |
+-----+------------------+

从这里我想要像

这样的东西
+---+---+
|x  |y  |
+---+---+
|0  |12 |
|0  |18 |
|0  |15 |
|0  |65 |
|0  |9  |
|1  |12 |
|0  |24 |
|0  |90 |
|1  |30 |
|1  |28 |
+---+---+

我试过

println(df  .withColumn("y", df("y".replace("{", "").replace("}","").split(",")(0))).show)

两列都是字符串类型

但它在y列中打印相同 任何帮助表示赞赏。

2 个答案:

答案 0 :(得分:2)

您需要使用Spark的内置列功能。这是一个例子:

import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._

import spark.implicits._

val df = Seq(
  ("0.0", "{12,16,17,18,19}"),
  ("0.0", "{18,16,17,18,19}"),
  ("0.0", "{15,16,67,18,19}"),
  ("0.0", "{65,16,17,18,19}"),
  ("0.0", "{9,16,17,18,19}"), 
  ("1.0", "{12,16,17,28,39}"),
  ("0.0", "{24,16,17,28,19}"),
  ("0.0", "{90,16,17,18,29}"),
  ("1.0", "{30,16,17,18,19}"),
  ("1.0", "{28,16,17,18,19}")
).toDF("x", "y")

def firstItem(column: Column): Column = split(
  regexp_replace(column, "[{}]", ""), 
  ","
).getItem(0)

df.withColumn("y", firstItem(df("y"))).show

导致:

+---+---+
|  x|  y|
+---+---+
|0.0| 12|
|0.0| 18|
|0.0| 15|
|0.0| 65|
|0.0|  9|
|1.0| 12|
|0.0| 24|
|0.0| 90|
|1.0| 30|
|1.0| 28|
+---+---+

functions包文档和Column类文档(getItem方法)中的更多信息。

如果您需要更复杂的转换并且内置函数不够用,则可以使用用户定义函数(UDF)。您可以找到有关UDF here的更多信息。

答案 1 :(得分:1)

您可以尝试:

df.withColumn("y", regexp_extract($"y", "(\\{)([0-9]*)",2)).show()

+---+---+
|  x|  y|
+---+---+
|0.0| 12|
|0.0| 18|
|0.0| 15|
|0.0| 65|
|0.0|  9|
|1.0| 12|
|0.0| 24|
|0.0| 90|
|1.0| 30|
|1.0| 28|
+---+---+