Apache Spark Streaming Error:在块被删除后尝试使用BlockRDD

时间:2017-02-12 06:09:05

标签: apache-spark apache-spark-sql

我正在尝试运行一个apache流程序,该程序接收数据流,进行一些处理,将其保存在缓存中,并将数据与下一个流进行比较。我的程序在下一个流上由下面的错误退出时执行第一批。

org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
 execute, tree: TungstenExchange hashpartitioning(at
 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
         at org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:247)
 Caused by: org.apache.spark.SparkException: Job aborted due to stage
 failure: Task creation failed: org.apache.spark.SparkException:
 Attempted to use BlockRDD[1] at socketTextStream at
 StreamExample.java:56 after its blocks have been removed!
 org.apache.spark.rdd.BlockRDD.assertValid(BlockRDD.scala:83)
 org.apache.spark.rdd.BlockRDD.getPreferredLocations(BlockRDD.scala:56)
 org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
 org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
 scala.Option.getOrElse(Option.scala:120)
 org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:256)
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1545)
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1556)

我正在尝试将流数据连接到现有的缓存数据,并使用join语句执行某些操作并再次缓存它。

static DataFrame active=null;    
foreachRDD
    {

        DataFrame x=sqlContext.read().json(rdd);
       if(active==null)
    {
    active=x;
    }
    else
    {
    Dataframe f=active.join(x)
    other functions using join statements
    active.persist()
    }

0 个答案:

没有答案