我不知道这个问题是否重复,但不知何故,我遇到的所有答案似乎都不适合我(也许我做错了)。
我有一个定义的类:
case class myRec(
time: String,
client_title: String,
made_on_behalf: Double,
country: String,
email_address: String,
phone: String)
以及包含
形式的记录或对象的示例Json文件[{...}{...}{...}...]
即
[{"time": "2015-05-01 02:25:47",
"client_title": "Mr.",
"made_on_behalf": 0,
"country": "Brussel",
"email_address": "15e29034@gmail.com"},
{"time": "2015-05-01 04:15:03",
"client_title": "Mr.",
"made_on_behalf": 0,
"country": "Bundesliga",
"email_address": "aae665d95c5d630@aol.com"},
{"time": "2015-05-01 06:29:18",
"client_title": "Mr.",
"made_on_behalf": 0,
"country": "Japan",
"email_address": "fef412c714ff@yahoo.com"}...]
我的build.sbt
libraryDependencies += "com.owlike" % "genson-scala_2.11" % "1.3"
为scalaVersion := "2.11.7"
,
我有一个定义的scala函数
//PS: Other imports already made
import com.owlike.genson.defaultGenson_
//PS: Spark context already defined
def prepData(infile:String):RDD[myRec] = {
val input = sc.textFile(infile)
//Read Json Data into my Record Case class
input.mapPartitions( records =>
records.map( record => fromJson[myRec](record))
)}
我正在调用该函数
prepData("file://path/to/abc.json")
有没有办法做到这一点,还是有任何其他Json库我可以用来转换为RDD
我也尝试了这个并且两者似乎都没有用
PS:我不想通过spark SQL
来处理json文件
谢谢!
答案 0 :(得分:3)
Jyd,不使用Spark SQL for JSON是一个有趣的选择,但它非常可行。有一个如何做到这一点的例子是在学习Spark书的例子中(免责声明我是共同作者之一,因此有点偏颇)。这些示例位于github https://github.com/databricks/learning-spark上,但这里是相关的代码段:
case class Person(name: String, lovesPandas: Boolean) // Note: must be a top level class
object BasicParseJsonWithJackson {
def main(args: Array[String]) {
if (args.length < 3) {
println("Usage: [sparkmaster] [inputfile] [outputfile]")
exit(1)
}
val master = args(0)
val inputFile = args(1)
val outputFile = args(2)
val sc = new SparkContext(master, "BasicParseJsonWithJackson", System.getenv("SPARK_HOME"))
val input = sc.textFile(inputFile)
// Parse it into a specific case class. We use mapPartitions beacuse:
// (a) ObjectMapper is not serializable so we either create a singleton object encapsulating ObjectMapper
// on the driver and have to send data back to the driver to go through the singleton object.
// Alternatively we can let each node create its own ObjectMapper but that's expensive in a map
// (b) To solve for creating an ObjectMapper on each node without being too expensive we create one per
// partition with mapPartitions. Solves serialization and object creation performance hit.
val result = input.mapPartitions(records => {
// mapper object created on each executor node
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.registerModule(DefaultScalaModule)
// We use flatMap to handle errors
// by returning an empty list (None) if we encounter an issue and a
// list with one element if everything is ok (Some(_)).
records.flatMap(record => {
try {
Some(mapper.readValue(record, classOf[Person]))
} catch {
case e: Exception => None
}
})
}, true)
result.filter(_.lovesPandas).mapPartitions(records => {
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
records.map(mapper.writeValueAsString(_))
})
.saveAsTextFile(outputFile)
}
}
请注意,这使用Jackson(特别是"com.fasterxml.jackson.core" % "jackson-databind" % "2.3.3"
和"com.fasterxml.jackson.module" % "jackson-module-scala_2.10" % "2.3.3"
依赖项)。
我刚注意到你的问题有一些示例输入,并且@ zero323指出逐行解析并不能正常工作。相反,你会这样做:
val input = sc.wholeTextFiles(inputFile).map(_._2)
// Parse it into a specific case class. We use mapPartitions beacuse:
// (a) ObjectMapper is not serializable so we either create a singleton object encapsulating ObjectMapper
// on the driver and have to send data back to the driver to go through the singleton object.
// Alternatively we can let each node create its own ObjectMapper but that's expensive in a map
// (b) To solve for creating an ObjectMapper on each node without being too expensive we create one per
// partition with mapPartitions. Solves serialization and object creation performance hit.
val result = input.mapPartitions(records => {
// mapper object created on each executor node
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.registerModule(DefaultScalaModule)
// We use flatMap to handle errors
// by returning an empty list (None) if we encounter an issue and a
// list with one element if everything is ok (List(_)).
records.flatMap(record => {
try {
mapper.readValue(record, classOf[List[Person]])
} catch {
case e: Exception => None
}
})
})
答案 1 :(得分:1)
只是为了好玩,您可以尝试使用特定分隔符拆分单个文档。虽然它不适用于复杂的嵌套文档,但它应该处理示例输入而不使用wholeTextFiles
:
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.conf.Configuration
import net.liftweb.json.{parse, JObject, JField, JString, JInt}
case class MyRec(
time: String,
client_title: String,
made_on_behalf: Double,
country: String,
email_address: String)
@transient val conf = new Configuration
conf.set("textinputformat.record.delimiter", "},\n{")
def clean(s: String) = {
val p = "(?s)\\[?\\{?(.*?)\\}?\\]?".r
s match {
case p(x) => Some(s"{$x}")
case _ => None
}
}
def toRec(os: Option[String]) = {
os match {
case Some(s) =>
for {
JObject(o) <- parse(s);
JField("time", JString(time)) <- o;
JField("client_title", JString(client_title)) <- o;
JField("made_on_behalf", JInt(made_on_behalf)) <- o
JField("country", JString(country)) <- o;
JField("email_address", JString(email)) <- o
} yield MyRec(time, client_title, made_on_behalf.toDouble, country, email)
case _ => Nil
}
}
val records = sc.newAPIHadoopFile("some.json",
classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
.map{case (_, txt) => clean(txt.toString)}
.flatMap(toRec)