我有以下输入
输入
[level:1,firstFile:one,secondFile:secone,Flag:NA][level:1,firstFile:two,secondFile:sectwo,Flag:NA][level:2,firstFile:three,secondFile:secthree,Flag:NA]
低于输出并正常工作,
List(List(one, two), List(three))
List(List(secone, sectwo), List(secthree))
但是,当我通过下面的输入时,我得到的输出是
[level:1,firstFile:one,four,secondFile:secone,Flag:NA][level:1,firstFile:two,secondFile:sectwo,Flag:NA][level:2,firstFile:three,secondFile:secthree,Flag:NA]
获取为,
List(List(), List(two), List(three))
List(List(), List(sectwo), List(secthree))
但是预期的输出是
List(List(one, four, two), List(three))
List(List(secone, sectwo), List(secthree))
代码。
val validJsonRdd = sc.parallelize(Seq(input)).flatMap(x => x.replace(",", "\",\"").replace(":", "\":\"").replace("[", "{\"").replace("]", "\"}").replace("}{", "}&{").split("&"))
import org.apache.spark.sql.functions._
val df = spark.read.json(validJsonRdd).orderBy("level").groupBy("level")
.agg(collect_list("firstFile").as("firstFile"), collect_list("secondFile").as("secondFile"))
.select(collect_list("firstFile").as("firstFile"), collect_list("secondFile").as("secondFile"))
val rdd = df.collect().map(row => (row(0).asInstanceOf[Seq[Seq[String]]], row(1).asInstanceOf[Seq[Seq[String]]]))
val first = rdd(0)._1.map(x => x.toList).toList
val second = rdd(0)._2.map(x => x.toList).toList
val firstInputcolumns = first.map(_.filterNot(_ == null))
val secondInputcolumns= second.map(_.filterNot(_ == null))
println(firstInputcolumns)
println(secondInputcolumns)
请帮助我更正代码。
答案 0 :(得分:1)
看起来您的替换项不能产生有效的JSON。如果您在第二个输入上运行它们,对于第一个输入,您将获得:
{"level":"1","firstFile":"one","four","secondFile":"secone","Flag":"NA"}
但是JSON是键值对的列表。您不能仅仅让“四个”像这样独立出来。如果要将 firstFile 映射到列表,则一个和四个应该放在方括号中,并且JSON应该如下所示:
{"level":"1","firstFile":["one","four"],"secondFile":"secone","Flag":"NA"}