Question

我在Spark中使用以下命令在Spark中创建了一个表

 **

case class trip(trip_id: String  , duration : String  , start_date : String , start_station : String  , start_terminal : String , end_date: String
     , end_station: String , end_terminal : String , bike : String , subscriber_type : String , zipcode : String )

 val trip_data = sc.textFile("/user/sankha087_gmail_com/trip_data.csv")

     val tripDF = trip_data.map(x=> x.split(",")).filter(x=> (x(1)!= "Duration")).map(x=> trip(x(0),x(1),x(2),x(3),x(4),x(5),x(6),x(7),x(8),x(9),
x(10))).toDF() 

tripDF.registerTempTable("tripdatas")

sqlContext.sql("select * from tripdatas").show()

**

如果我正在运行上面的查询（即选择*），那么我得到了理想的结果，但是如果我运行以下查询，那么我得到以下异常：

sqlContext.sql("select count(1) from tripdatas").show()

18/03/07 17:59:55 ERROR scheduler.TaskSetManager：阶段2.0中的任务1失败了4次;中止工作
org.apache.spark.SparkException：作业因阶段失败而中止：阶段2.0中的任务1失败4次，最近失败：阶段2中丢失任务1.3。 0（TID 6，datanode1-cloudera.mettl.com，executor 1）：java.lang.ArrayIndexOutOfBoundsException：10
at $ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ anonfun $ 3.apply（：31）
at $ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ iwC $$ anonfun $ 3.apply（：31）

Answer 1

检查您的数据。如果数据中的任何行少于11个元素，您将看到该错误。

您可以尝试以此方式查看最小列数。

val trip_data = spark.read.csv("/user/sankha087_gmail_com/trip_data.csv")
println(trip_data.columns.length)

在apache spark

1 个答案: