我对Spark很新,并且在将RDD转换为DataFrame时遇到问题。我尝试做的是获取一个日志文件,使用现有的jar将其转换为JSON(返回一个字符串),然后将生成的json转换为数据帧。以下是我到目前为止的情况:
val serverLog = sc.textFile("/Users/Downloads/file1.log")
val jsonRows = serverLog.mapPartitions(partition => {
val txfm = new JsonParser //*jar to parse logs to json*//
partition.map(line => {
Row(txfm.parseLine(line))
})
})
当我对此运行take(2)
时,我会得到类似的内容:
[{"pwh":"600","sVe":"10.0","psh":"667","udt":"mobile"}]
[{"pwh":"800","sVe":"10.0","psh":"1000","udt":"desktop"}]
我的问题来了。我创建了一个模式并尝试创建df
val schema = StructType(Array(
StructField("pwh",StringType,true),
StructField("sVe",StringType,true),...))
val jsonDf = sqlSession.createDataFrame(jsonRows, schema)
返回的错误是
java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 1
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, pwh), StringType), true) AS _pwh#0
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, pwh), StringType), true)
:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
: :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
: : +- input[0, org.apache.spark.sql.Row, true]
: +- 0
:- null
有人可以告诉我这里我做错了什么吗?我发现的大部分SO答案都说我可以使用createDataFrame
或toDF()
,但我也没有运气。我也尝试将RDD转换为JavaRDD
,但这也行不通。欣赏您可以提供的任何见解。
答案 0 :(得分:0)
您定义的架构适用于RDD,如:
{"pwh":"600","sVe":"10.0","psh":"667","udt":"mobile"}
{"pwh":"800","sVe":"10.0","psh":"1000","udt":"desktop"}
如果您可以更改RDD以将数据设为
{"logs": [{"pwh":"600","sVe":"10.0","psh":"667","udt":"mobile"}]}
使用此架构:
val schema = StructType(Seq(
StructField("logs",ArrayType( StructType(Seq(
StructField("pwh",StringType,true),
StructField("sVe",StringType,true), ...))
))
))
sqlContext.read.schema(schema).json(jsonRows)