Scala Spark DataFrame:将一个字符串列分解为多个字符串

时间:2019-11-21 20:46:28

标签: scala dataframe apache-spark

下面有任何指针吗?

输入df:此处col1的类型为string

+----------------------------------+
|                              col1|
+----------------------------------+
|[{a:1,g:2},{b:3,h:4},{c:5,i:6}]   |
|[{d:7,j:8},{e:9,k:10},{f:11,l:12}]|
+----------------------------------+

预期的输出:(同样col1的类型为string

+-------------+
|        col1 |
+-------------+
|  {a:1,g:2}  |
|  {b:3,h:4}  |
|  {c:5,i:6}  |
|  {d:7,j:8}  |
|  {e:9,k:10} |
|  {f:11,l:12}|
+-----+

谢谢!

4 个答案:

答案 0 :(得分:2)

您可以将Spark SQL爆炸功能与UDF一起使用:

import spark.implicits._
val df = spark.createDataset(Seq("[{a},{b},{c}]","[{d},{e},{f}]")).toDF("col1")
df.show()

+-------------+
|         col1|
+-------------+
|[{a},{b},{c}]|
|[{d},{e},{f}]|
+-------------+

import org.apache.spark.sql.functions._
val stringToSeq = udf{s: String => s.drop(1).dropRight(1).split(",")}
df.withColumn("col1", explode(stringToSeq($"col1"))).show()

+----+
|col1|
+----+
| {a}|
| {b}|
| {c}|
| {d}|
| {e}|
| {f}|
+----+

编辑:对于您新的输入数据,自定义UDF可以按以下方式演变:

val stringToSeq = udf{s: String =>
  val extractor = "[^{]*:[^}]*".r
  extractor.findAllIn(s).map(m => s"{$m}").toSeq
}

新输出:

+-----------+
|       col1|
+-----------+
|  {a:1,g:2}|
|  {b:3,h:4}|
|  {c:5,i:6}|
|  {d:7,j:8}|
| {e:9,k:10}|
|{f:11,l:12}|
+-----------+

答案 1 :(得分:2)

Spark提供了非常丰富的trim函数,该函数可用于删除前导字符和尾随字符[]。正如@LeoC已经提到的那样,可以通过内置函数来实现所需的功能,这些函数的性能会更好:

import org.apache.spark.sql.functions.{trim, explode, split}

val df = Seq(
  ("[{a},{b},{c}]"),
  ("[{d},{e},{f}]")
).toDF("col1")

df.select(
  explode(
    split(
      trim($"col1", "[]"), ","))).show

// +---+
// |col|
// +---+
// |{a}|
// |{b}|
// |{c}|
// |{d}|
// |{e}|
// |{f}|
// +---+

编辑:

对于新数据集,逻辑保持不变,不同之处在于您需要使用,以外的其他字符进行拆分。您可以使用regexp_replace},替换为}|,以便以后可以使用|而不是,进行拆分:

import org.apache.spark.sql.functions.{trim, explode, split, regexp_replace}

val df = Seq(
  ("[{a:1,g:2},{b:3,h:4},{c:5,i:6}]"),
  ("[{d:7,j:8},{e:9,k:10},{f:11,l:12}]")
).toDF("col1")

df.select(
  explode(
    split(
      regexp_replace(trim($"col1", "[]"), "},", "}|"), // gives: {a:1,g:2}|{b:3,h:4}|{c:5,i:6}
    "\\|")
  )
).show(false)

// +-----------+
// |col        |
// +-----------+
// |{a:1,g:2}  |
// |{b:3,h:4}  |
// |{c:5,i:6}  |
// |{d:7,j:8}  |
// |{e:9,k:10} |
// |{f:11,l:12}|
// +-----------+

请注意:使用split(..., "\\|")可以转义|,这是一个特殊的正则表达式字符。

答案 2 :(得分:1)

您可以这样做:

val newDF = df.as[String].flatMap(line=>line.replaceAll("\\[", "").replaceAll("\\]", "").split(","))
newDF.show()

输出:

+-----+
|value|
+-----+
|  {a}|
|  {b}|
|  {c}|
|  {d}|
|  {e}|
|  {f}|
+-----+

仅需注意,此过程会将输出列命名为value,您可以使用selectwithColumn等轻松地将其重命名(如果需要)。

答案 3 :(得分:1)

最后有效的方法:

import spark.implicits._
val df = spark.createDataset(Seq("[{a:1,g:2},{b:3,h:4},{c:5,i:6}]","[{d:7,j:8},{e:9,k:10},{f:11,l:12}]")).toDF("col1")
df.show()

val toStr = udf((value : String) => value.split("},\\{").map(_.toString))
val addParanthesis = udf((value : String) => ("{" + value + "}"))
val removeParanthesis = udf((value : String) => (value.slice(2,value.length()-2)))

import org.apache.spark.sql.functions._
df
.withColumn("col0", removeParanthesis(col("col1")))
.withColumn("col2", toStr(col("col0")))
.withColumn("col3", explode(col("col2")))
.withColumn("col4", addParanthesis(col("col3")))
.show()

输出:

+--------------------+--------------------+--------------------+---------+-----------+
|                col1|                col0|                col2|     col3|       col4|
+--------------------+--------------------+--------------------+---------+-----------+
|[{a:1,g:2},{b:3,h...|a:1,g:2},{b:3,h:4...|[a:1,g:2, b:3,h:4...|  a:1,g:2|  {a:1,g:2}|
|[{a:1,g:2},{b:3,h...|a:1,g:2},{b:3,h:4...|[a:1,g:2, b:3,h:4...|  b:3,h:4|  {b:3,h:4}|
|[{a:1,g:2},{b:3,h...|a:1,g:2},{b:3,h:4...|[a:1,g:2, b:3,h:4...|  c:5,i:6|  {c:5,i:6}|
|[{d:7,j:8},{e:9,k...|d:7,j:8},{e:9,k:1...|[d:7,j:8, e:9,k:1...|  d:7,j:8|  {d:7,j:8}|
|[{d:7,j:8},{e:9,k...|d:7,j:8},{e:9,k:1...|[d:7,j:8, e:9,k:1...| e:9,k:10| {e:9,k:10}|
|[{d:7,j:8},{e:9,k...|d:7,j:8},{e:9,k:1...|[d:7,j:8, e:9,k:1...|f:11,l:12|{f:11,l:12}|
+--------------------+--------------------+--------------------+---------+-----------+