Zeppelin中的IndexOutOfBounds错误

时间:2017-07-19 12:39:58

标签: java scala apache-spark apache-zeppelin

我在Zeppelin中遇到一个问题,当我尝试在我创建的temptable(数据帧)上执行SQL操作时,我总是得到一个IndexOutOfBounds错误。

这是我的代码:

import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset
import org.apache.spark.sql.SparkSession
//import sqlContext._

val realdata = sc.textFile("/root/application.txt")

case class testClass(date: String, time: String, level: String, unknown1: String, unknownConsumer: String, unknownConsumer2: String, vloer: String, tegel: String, msg: String, sensor1: String, sensor2: String, sensor3: String, sensor4: String, sensor5: String, sensor6: String, sensor7: String, sensor8: String, batchsize: String, troepje1: String, troepje2: String)

val mapData = realdata
.filter(line => line.contains("data") && line.contains("INFO"))
.map(s => s.split(" ").toList)
.map(
s => testClass(s(0),
s(1).split(",")(0),
s(1).split(",")(1),
s(3),
s(4),
s(5),
s(6),
s(7),
s(8),
s(15),
s(16),
s(17),
s(18),
s(19),
s(20),
s(21),
s(22),
"",
"",
""
)
).toDF
//mapData.count()
//mapData.printSchema()
mapData.registerTempTable("temp_carefloor")

然后在下一个笔记本中我尝试了一些简单的东西:

%sql
select * from temp_carefloor limit 10

我收到以下错误:

java.lang.IndexOutOfBoundsException: 18
    at scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65)
    at scala.collection.immutable.List.apply(List.scala:84)
    at $line128330188484.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$3.apply(<console>:84)
    at $line128330188484.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$3.apply(<console>:72)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:748)

现在我确定它与我的数据输出方式有关。 但我无法弄清楚我做错了什么,我真的在这里打网。真的希望有人可以帮助我。

编辑: 这是我试图提取的有用数据的摘录。

2016-03-10 07:18:58,985 INFO [http-nio-8080-exec-1] n.t.f.c.FloorUpdateController [FloorUpdateController.java:67] Floor 12FR received update from tile: 12G0, data = [false, false, false, false, true, false, false, false]
2016-03-10 07:18:58,992 INFO [http-nio-8080-exec-7] n.t.f.c.FloorUpdateController [FloorUpdateController.java:67] Floor 12FR received update from tile: 12G0, data = [false, false, false, false, false, false, false, false]
2016-03-10 07:18:59,907 INFO [http-nio-8080-exec-4] n.t.f.c.FloorUpdateController [FloorUpdateController.java:67] Floor 12FR received update from tile: 12G0, data = [false, false, false, false, false, false, false, false]
2016-03-10 07:19:10,418 INFO [http-nio-8080-exec-9] n.t.f.c.FloorUpdateController [FloorUpdateController.java:67] Floor 12FR received update from tile: 12G0, data = [true, true, false, false, false, false, false, false]

您可以在此处查看完整的平面文件:http://upload.grecom.nl/uploads/jeffrey/application.txt

1 个答案:

答案 0 :(得分:2)

正如我们在评论中讨论的那样,问题在于数据拆分,您无法使用" "拆分数据。

一种解决方案是使用像" data = |tile: |[|]| |,"

这样的正则表达式来拆分数据

您必须在正则表达式中包含所有分隔符(即使您不希望它们位于提取的字段中的子字符串,就像我对" data = "所做的那样)

希望这会对你有所帮助。最诚挚的问候。