Spark createDataFrame因ArrayOutOfBoundsException而失败

时间:2017-03-21 18:25:52

标签: scala apache-spark spark-dataframe rdd

我对Spark很新,并且在将RDD转换为DataFrame时遇到问题。我尝试做的是获取一个日志文件,使用现有的jar将其转换为JSON(返回一个字符串),然后将生成的json转换为数据帧。以下是我到目前为止的情况:

val serverLog = sc.textFile("/Users/Downloads/file1.log")
val jsonRows = serverLog.mapPartitions(partition => {
  val txfm = new JsonParser //*jar to parse logs to json*//
  partition.map(line => {
    Row(txfm.parseLine(line))
  })
})

当我对此运行take(2)时,我会得到类似的内容:

[{"pwh":"600","sVe":"10.0","psh":"667","udt":"mobile"}]
[{"pwh":"800","sVe":"10.0","psh":"1000","udt":"desktop"}]

我的问题来了。我创建了一个模式并尝试创建df

val schema = StructType(Array(
  StructField("pwh",StringType,true),
  StructField("sVe",StringType,true),...))

val jsonDf = sqlSession.createDataFrame(jsonRows, schema)

返回的错误是

java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 1
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, pwh), StringType), true) AS _pwh#0
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, pwh), StringType), true)
:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
:  :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
:  :  +- input[0, org.apache.spark.sql.Row, true]
:  +- 0
:- null

有人可以告诉我这里我做错了什么吗?我发现的大部分SO答案都说我可以使用createDataFrametoDF(),但我也没有运气。我也尝试将RDD转换为JavaRDD,但这也行不通。欣赏您可以提供的任何见解。

1 个答案:

答案 0 :(得分:0)

您定义的架构适用于RDD,如:

{"pwh":"600","sVe":"10.0","psh":"667","udt":"mobile"}
{"pwh":"800","sVe":"10.0","psh":"1000","udt":"desktop"}

如果您可以更改RDD以将数据设为

{"logs": [{"pwh":"600","sVe":"10.0","psh":"667","udt":"mobile"}]}

使用此架构:

val schema = StructType(Seq(
  StructField("logs",ArrayType( StructType(Seq(
    StructField("pwh",StringType,true),
    StructField("sVe",StringType,true), ...))
  ))
))

sqlContext.read.schema(schema).json(jsonRows)