我有一个文件,其数据如下:
<1>2019-03-20T20:59:59Z daily_report.txt[102852]: { "ts": "1553115599", "data": {"field1": "value11", "field21": "value12"} }
<2>2019-03-20T20:59:59Z daily_report.txt[102852]: { "ts": "1553115599", "data": {"field1": "value21", "field2": "value22"} }
<3>2019-03-20T20:59:59Z daily_report.txt[102852]: { "ts": "1553115599", "data": {"field1": "value31", "field2": "value32"} }
通常在火花中,我只能执行spark.read.json("inputs.json")
,但是由于每行前面都有垃圾,所以我不能。有没有办法解决这个问题,或者我可以砍掉前端,或者更好的方法-将垃圾作为DataFrame中的列包括在内?
答案 0 :(得分:2)
您必须以Dataset[String]
的形式读取数据,然后自己解析这些列。完成后,为您的json
数据创建一个架构,并使用内置的sparks from_json()
函数:
import org.apache.spark.sql.types._
val ds = spark.createDataset(Seq(
"<1>2019-03-20T20:59:59Z daily_report.txt[102852]: { \"ts\": \"1553115599\", \"data\": {\"field1\": \"value11\", \"field2\": \"value12\"} }",
"<2>2019-03-20T20:59:59Z daily_report.txt[102852]: { \"ts\": \"1553115599\", \"data\": {\"field1\": \"value21\", \"field2\": \"value22\"} }",
"<3>2019-03-20T20:59:59Z daily_report.txt[102852]: { \"ts\": \"1553115599\", \"data\": {\"field1\": \"value31\", \"field2\": \"value32\"} }"
))
//val ds = spark.read.text("inputs.txt").as[String]
val schema = StructType(List(StructField("ts", StringType), StructField("data", StructType(List(StructField("field1", StringType), StructField("field2", StringType))))))
val df = ds.map(r => {
val j = r.indexOf("{")-1
(r.substring(0, j), r.substring(j, r.length))
}).toDF("garbage", "json")
df.withColumn("data", from_json($"json", schema)).select("garbage", "data").show(false)
使用示例数据(将field21
更正为field2
),您将得到:
+-------------------------------------------------+------------------------------+
|garbage |data |
+-------------------------------------------------+------------------------------+
|<1>2019-03-20T20:59:59Z daily_report.txt[102852]:|[1553115599,[value11,value12]]|
|<2>2019-03-20T20:59:59Z daily_report.txt[102852]:|[1553115599,[value21,value22]]|
|<3>2019-03-20T20:59:59Z daily_report.txt[102852]:|[1553115599,[value31,value32]]|
+-------------------------------------------------+------------------------------+
使用架构:
root
|-- garbage: string (nullable = true)
|-- data: struct (nullable = true)
| |-- ts: string (nullable = true)
| |-- data: struct (nullable = true)
| | |-- field1: string (nullable = true)
| | |-- field2: string (nullable = true)
如果您确实不需要garbage
数据,请通过将spark.read.json()
传递给您使用已经习惯的Dataset[String]
。这不需要定义架构,因为可以推断出来:
val data = spark.read.json(df.select("json").as[String])
答案 1 :(得分:0)
另一种方法,您可以使用示例JSON记录动态获取架构。 使用正则表达式函数regexp_extract()解析垃圾字符串
检查一下:
scala> val df = Seq(( """<1>2019-03-20T20:59:59Z daily_report.txt[102852]: { "ts": "1553115599", "data": {"field1": "value11", "field2": "value12"} }"""),
| ("""<2>2019-03-20T20:59:59Z daily_report.txt[102852]: { "ts": "1553115599", "data": {"field1": "value21", "field2": "value22"} }"""),
| ("""<3>2019-03-20T20:59:59Z daily_report.txt[102852]: { "ts": "1553115599", "data": {"field1": "value31", "field2": "value32"} }""")).toDF("data_garb")
df: org.apache.spark.sql.DataFrame = [data_garb: string]
scala> val json_str = """{ "ts": "1553115599", "data": {"field1": "value11", "field2": "value12"} }"""
json_str: String = { "ts": "1553115599", "data": {"field1": "value11", "field2": "value12"} }
scala> val dfj = spark.read.json(Seq(json_str).toDS)
dfj: org.apache.spark.sql.DataFrame = [data: struct<field1: string, field2: string>, ts: string]
scala> dfj.schema
res44: org.apache.spark.sql.types.StructType = StructType(StructField(data,StructType(StructField(field1,StringType,true), StructField(field2,StringType,true)),true), StructField(ts,StringType,true))
scala> val df2=df.withColumn("newc",regexp_extract('data_garb,""".*?(\{.*)""",1)).withColumn("newc",from_json('newc,dfj.schema)).drop("data_garb")
df2: org.apache.spark.sql.DataFrame = [newc: struct<data: struct<field1: string, field2: string>, ts: string>]
scala> df2.show(false)
+--------------------------------+
|newc |
+--------------------------------+
|[[value11, value12], 1553115599]|
|[[value21, value22], 1553115599]|
|[[value31, value32], 1553115599]|
+--------------------------------+
通配符使您可以选择单个字段
scala> df2.select($"newc.*").show(false)
+------------------+----------+
|data |ts |
+------------------+----------+
|[value11, value12]|1553115599|
|[value21, value22]|1553115599|
|[value31, value32]|1553115599|
+------------------+----------+
scala>
或者您可以通过显式提及嵌套字段
来查询嵌套字段scala> df2.select($"newc.ts",$"newc.data.field1",$"newc.data.field2").show(false)
+----------+-------+-------+
|ts |field1 |field2 |
+----------+-------+-------+
|1553115599|value11|value12|
|1553115599|value21|value22|
|1553115599|value31|value32|
+----------+-------+-------+
scala>