偶尔加入两个DataFrame会抛出“java.lang.IllegalArgumentException:spark.sql.execution.id已经设置”

时间:2017-04-20 13:30:35

标签: scala apache-spark apache-spark-sql spark-dataframe

我正在加入两个通过读取两个非常大的CSV文件来创建的DataFrame来计算一些统计信息。代码在Web服务器上运行,并且由请求触发,这就是为什么Spark的会话始终保持活动状态而不调用sparkSession.close()

零星,代码抛出java.lang.IllegalArgumentException: spark.sql.execution.id is already set。我试图确保代码不会一次执行多次,但问题没有得到解决。

我正在使用Spark 2.1.0,我知道有一个问题here,希望可以在Spark 2.2.0中解决。

你能否在平均时间内建议任何解决方法来避免这个问题?

抛出异常的代码的简化版本:

val spark = SparkSession.builder().appName("application").master("local[*]").getOrCreate()
val itemCountry = spark.read.format("csv")
  .option("header", "true")
  .schema(StructType(Array(
    StructField("itemId", IntegerType, false),
    StructField("countryId", IntegerType, false))))
  .csv("/item_country.csv") // This file matches the schema provided
val itemPerformance = spark.read.format("csv")
  .option("header", "true")
  .schema(StructType(Array(
    StructField("itemId", IntegerType, false),
    StructField("date", TimestampType, false),
    StructField("performance", IntegerType, false))))
  .csv("/item_performance.csv") // This file matches the schema provided

itemCountry.join(itemPerformance, itemCountry("itemId") === itemPerformance("itemId"))
  .groupBy("countryId")
  .agg(sum(when(to_date(itemPerformance("date")) > to_date(lit("2017-01-01")), itemPerformance("performance")).otherwise(0)).alias("performance")).show()

异常的堆栈跟踪:

java.lang.IllegalArgumentException: spark.sql.execution.id is already set
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:81)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2375)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2375)
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2778)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2375)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2351)
at .... [Custom caller functions]

示例CSV文件:

item_country.csv

itemId,countryId
1,1
2,1
3,2
4,3

item_performance.csv

itemId,date,performance
1,2017-04-15,10
1,2017-04-16,10
1,2017-04-17,10
2,2017-04-15,15
3,2017-04-20,12
4,2017-04-18,18

0 个答案:

没有答案