我正在尝试运行一个apache流程序,该程序接收数据流,进行一些处理,将其保存在缓存中,并将数据与下一个流进行比较。我的程序在下一个流上由下面的错误退出时执行第一批。
org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
execute, tree: TungstenExchange hashpartitioning(at
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:247)
Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: Task creation failed: org.apache.spark.SparkException:
Attempted to use BlockRDD[1] at socketTextStream at
StreamExample.java:56 after its blocks have been removed!
org.apache.spark.rdd.BlockRDD.assertValid(BlockRDD.scala:83)
org.apache.spark.rdd.BlockRDD.getPreferredLocations(BlockRDD.scala:56)
org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
scala.Option.getOrElse(Option.scala:120)
org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:256)
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1545)
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1556)
我正在尝试将流数据连接到现有的缓存数据,并使用join语句执行某些操作并再次缓存它。
static DataFrame active=null;
foreachRDD
{
DataFrame x=sqlContext.read().json(rdd);
if(active==null)
{
active=x;
}
else
{
Dataframe f=active.join(x)
other functions using join statements
active.persist()
}