我想在spark中编写json对象但是当我尝试使用sc.parallelize将其转换为RDD时,它再次将其转换回字符串
import scala.util.parsing.json._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.lit
import org.json4s._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._
val df = Seq((2012, 8, "Batman", 9.8),
(2012, 9, "Batman", 10.0),
(2012, 8, "Hero", 8.7),
(2012, 10, "Hero", 5.7),
(2012, 2, "Robot", 5.5),
(2011, 7, "Git", 2.0),
(2010, 1, "Dom", 2.0),
(2019, 3, "Sri", 2.0)).toDF("year", "month", "title", "rating")
case class Rating(year:Int, month:Int, title:String, rating:Double)
import scala.collection.JavaConversions._
val ratingList = df.as[Rating].collectAsList
import java.io._
val output = for (c <- ratingList) yield
{
val json = ("record" ->
("year" -> c.year) ~
("month" -> c.month) ~
("title" -> c.title) ~
("rating" -> c.rating))
compact(render(json))
}
output.foreach(println)
在这个阶段它是一个json对象,一切都很好。但是当我将它转换为RDD时,spark将其视为一个字符串
val outputDF = sc.parallelize(output).toDF("json")
outputDF.show()
outputDF.write.mode("overwrite").json("s3://location/")
输出是:
{"test":{"json":"{\"record\":{\"year\":2012,\"month\":8,\"title\":\"Batman\",\"rating\":9.8}}"}}
答案 0 :(得分:1)
当你调用compact
时 - 你从渲染的json中创建了String。
参见:
scala> val json = ("name" -> "joe") ~ ("age" -> 35)
scala> compact(render(json))
res2: String = {"name":"joe","age":35}
这意味着您的output
是字符串的集合。当你并行化它时 - 你得到RDD [String]。
您可能希望返回render
函数的结果以获取JSON对象的集合。但是你需要检查文档。
当然,Spark不知道如何使用toDF()
函数将JSON对象从第三方库转换为DataFrame。可能你可以这样做:
val anotherPeopleRDD = sc.parallelize(
"""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val anotherPeople = sqlContext.read.json(anotherPeopleRDD)
所以基本上有RDD [String]然后把它读作JSON。
和BTW
你为什么先这样做:
val ratingList = df.as[Rating].collectAsList
val output = for (c <- ratingList) yield
{
val json = ("record" ->
("year" -> c.year) ~
("month" -> c.month) ~
("title" -> c.title) ~
("rating" -> c.rating))
compact(render(json))
}
然后:
val outputDF = sc.parallelize(output).toDF("json")
为什么不直接处理集群中的所有数据:
df.as[Rating].map{c =>
val json = ("record" ->
("year" -> c.year) ~
("month" -> c.month) ~
("title" -> c.title) ~
("rating" -> c.rating))
compact(render(json))
}
这样会更有效率。