Question

如何在Spark 2.2.0中将字符串数组展平为数据帧的多行？

输入行if($_GET["jtSorting"]!=null &&$_GET["jtSorting"]!="") { $query_s =$query_s. " ORDER BY ".$_GET["jtSorting"]." "; }

["foo", "bar"]

inputDS.printSchema（）

val inputDS = Seq("""["foo", "bar"]""").toDF

输入数据集root |-- value: string (nullable = true)

inputDS

预期输出数据集inputDS.show(false) value ----- ["foo", "bar"]

outputDS

我尝试了value ------- "foo" | "bar" |功能如下，但它没有完成工作

explode

我收到以下错误

inputDS.select(explode(from_json(col("value"), ArrayType(StringType))))

还尝试了以下

org.apache.spark.sql.AnalysisException: cannot resolve 'jsontostructs(`value`)' due to data type mismatch: Input schema string must be a struct or an array of structs

我收到以下错误

inputDS.select(explode(col("value")))

Answer 1

抛出异常：

from_json(col("value"), ArrayType(StringType))

不是explode，具体来说是：

输入模式数组必须是结构或结构数组。

你可以：

inputDS.selectExpr(
  "split(substring(value, 2, length(value) - 2), ',\\s+') as value")

和explode输出。

Answer 2

您可以简单地使用flatMap。

val input=sc.parallelize(Array("foo", "bar")).toDS()
val out=input.flatMap(x=>x.split(","))
out.collect.foreach{println}

Answer 3

上述问题应在Spark 2.4.0（https://jira.apache.org/jira/browse/SPARK-24391）中修复因此，您可以毫无问题地使用此from_json($"column_nm", ArrayType(StringType))。

如何在Spark中解析字符串到数组？

3 个答案: