我正在尝试使用Spark来处理具有可变结构的JSON数据(嵌套JSON)。输入JSON数据可能非常大,每行超过1000个密钥,一个批次可能超过20 GB。 已经从30个数据源和' key2'生成了整个批次。每个JSON可以用于标识每个源的源和结构是预定义的。
处理此类数据的最佳方法是什么? 我尝试使用from_json,如下所示,但它只适用于固定架构并首先使用它我需要根据每个源对数据进行分组,然后应用架构。 由于数据量较大,我首选的方法是仅扫描数据一次,并根据预定义的模式从每个源中提取所需的值。
import org.apache.spark.sql.types._
import spark.implicits._
val data = sc.parallelize(
"""{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1"}}"""
:: Nil)
val df = data.toDF
val schema = (new StructType)
.add("key1", StringType)
.add("key2", StringType)
.add("key3", (new StructType)
.add("key3_k1", StringType))
df.select(from_json($"value",schema).as("json_str"))
.select($"json_str.key3.key3_k1").collect
res17: Array[org.apache.spark.sql.Row] = Array([xxx])
答案 0 :(得分:4)
这只是对@Ramesh Maharjan的回答的重述,但是使用了更现代的Spark语法。
我发现这个方法潜伏在DataFrameReader
中,它允许您将Dataset[String]
中的JSON字符串解析为任意DataFrame
,并利用相同的模式推断Spark为您提供spark.read.json("filepath")
直接从JSON文件中读取时。每行的架构可以完全不同。
def json(jsonDataset: Dataset[String]): DataFrame
使用示例:
val jsonStringDs = spark.createDataset[String](
Seq(
("""{"firstname": "Sherlock", "lastname": "Holmes", "address": {"streetNumber": 121, "street": "Baker", "city": "London"}}"""),
("""{"name": "Amazon", "employeeCount": 500000, "marketCap": 817117000000, "revenue": 177900000000, "CEO": "Jeff Bezos"}""")))
jsonStringDs.show
jsonStringDs:org.apache.spark.sql.Dataset[String] = [value: string]
+----------------------------------------------------------------------------------------------------------------------+
|value
|
+----------------------------------------------------------------------------------------------------------------------+
|{"firstname": "Sherlock", "lastname": "Holmes", "address": {"streetNumber": 121, "street": "Baker", "city": "London"}}|
|{"name": "Amazon", "employeeCount": 500000, "marketCap": 817117000000, "revenue": 177900000000, "CEO": "Jeff Bezos"} |
+----------------------------------------------------------------------------------------------------------------------+
val df = spark.read.json(jsonStringDs)
df.show(false)
df:org.apache.spark.sql.DataFrame = [CEO: string, address: struct ... 6 more fields]
+----------+------------------+-------------+---------+--------+------------+------+------------+
|CEO |address |employeeCount|firstname|lastname|marketCap |name |revenue |
+----------+------------------+-------------+---------+--------+------------+------+------------+
|null |[London,Baker,121]|null |Sherlock |Holmes |null |null |null |
|Jeff Bezos|null |500000 |null |null |817117000000|Amazon|177900000000|
+----------+------------------+-------------+---------+--------+------------+------+------------+
该方法可从Spark 2.2.0获得: http://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.DataFrameReader@json(jsonDataset:org.apache.spark.sql.Dataset[String]):org.apache.spark.sql.DataFrame
答案 1 :(得分:1)
我不确定我的建议是否可以帮助你,虽然我有类似的案例,但我解决了以下问题:
1)所以我的想法是使用json rapture(或其他一些json库)来实现 动态加载JSON模式。例如,您可以阅读第1个 json文件的一行来发现模式(与我的工作类似) 这里有jsonSchema)
2)动态生成架构。首先迭代动态 字段(注意我将key3的值投影为Map [String,String]) 并为每个架构添加一个StructField到架构
3)将生成的模式应用于数据框
import rapture.json._
import jsonBackends.jackson._
val jsonSchema = """{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1", "key3_k2":"key3_v2", "key3_k3":"key3_v3"}}"""
val json = Json.parse(jsonSchema)
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.{StringType, StructType}
val schema = ArrayBuffer[StructField]()
//we could do this dynamic as well with json rapture
schema.appendAll(List(StructField("key1", StringType), StructField("key2", StringType)))
val items = ArrayBuffer[StructField]()
json.key3.as[Map[String, String]].foreach{
case(k, v) => {
items.append(StructField(k, StringType))
}
}
val complexColumn = new StructType(items.toArray)
schema.append(StructField("key3", complexColumn))
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val sparkConf = new SparkConf().setAppName("dynamic-json-schema").setMaster("local")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val jsonDF = spark.read.schema(StructType(schema.toList)).json("""your_path\data.json""")
jsonDF.select("key1", "key2", "key3.key3_k1", "key3.key3_k2", "key3.key3_k3").show()
我使用下一个数据作为输入:
{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v11", "key3_k2":"key3_v21", "key3_k3":"key3_v31"}}
{"key1":"val2","key2":"source2","key3":{"key3_k1":"key3_v12", "key3_k2":"key3_v22", "key3_k3":"key3_v32"}}
{"key1":"val3","key2":"source3","key3":{"key3_k1":"key3_v13", "key3_k2":"key3_v23", "key3_k3":"key3_v33"}}
输出:
+----+-------+--------+--------+--------+
|key1| key2| key3_k1| key3_k2| key3_k3|
+----+-------+--------+--------+--------+
|val1|source1|key3_v11|key3_v21|key3_v31|
|val2|source2|key3_v12|key3_v22|key3_v32|
|val2|source3|key3_v13|key3_v23|key3_v33|
+----+-------+--------+--------+--------+
我尚未测试的高级替代方法是生成一个案例类,例如从JSON模式中调用JsonRow,以便拥有一个强类型数据集,除了使代码更多的事实之外,它还提供更好的序列化性能维护。要完成这项工作,首先需要创建一个JsonRow.scala文件,然后你应该实现一个sbt预构建脚本,该脚本将根据你的源文件动态地修改JsonRow.scala(当然你可能有多个)的内容。要动态生成类JsonRow,可以使用下一个代码:
def generateClass(members: Map[String, String], name: String) : Any = {
val classMembers = for (m <- members) yield {
s"${m._1}: String"
}
val classDef = s"""case class ${name}(${classMembers.mkString(",")});scala.reflect.classTag[${name}].runtimeClass"""
classDef
}
方法generateClass接受字符串映射以创建类成员和类名本身。生成的类的成员可以再次从json schema中填充它们:
import org.codehaus.jackson.node.{ObjectNode, TextNode}
import collection.JavaConversions._
val mapping = collection.mutable.Map[String, String]()
val fields = json.$root.value.asInstanceOf[ObjectNode].getFields
for (f <- fields) {
(f.getKey, f.getValue) match {
case (k: String, v: TextNode) => mapping(k) = v.asText
case (k: String, v: ObjectNode) => v.getFields.foreach(f => mapping(f.getKey) = f.getValue.asText)
case _ => None
}
}
val dynClass = generateClass(mapping.toMap, "JsonRow")
println(dynClass)
打印出来:
case class JsonRow(key3_k2: String,key3_k1: String,key1: String,key2: String,key3_k3: String);scala.reflect.classTag[JsonRow].runtimeClass
祝你好运
答案 2 :(得分:0)
如果您有问题中提到的数据
val data = sc.parallelize(
"""{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1"}}"""
:: Nil)
您无需为 json数据创建schema
。 Spark sql 可以从 json字符串推断出schema
。您只需使用SQLContext.read.json
,如下所示
val df = sqlContext.read.json(data)
对于上面使用的 rdd数据
,您将获得schema
,如下所示
root
|-- key1: string (nullable = true)
|-- key2: string (nullable = true)
|-- key3: struct (nullable = true)
| |-- key3_k1: string (nullable = true)
您可以select
key3_k1
作为
df2.select("key3.key3_k1").show(false)
//+-------+
//|key3_k1|
//+-------+
//|key3_v1|
//+-------+
您可以根据需要操纵dataframe
。我希望答案很有帮助