我正在加入两个通过读取两个非常大的CSV文件来创建的DataFrame来计算一些统计信息。代码在Web服务器上运行,并且由请求触发,这就是为什么Spark的会话始终保持活动状态而不调用sparkSession.close()
。
零星,代码抛出java.lang.IllegalArgumentException: spark.sql.execution.id is already set
。我试图确保代码不会一次执行多次,但问题没有得到解决。
我正在使用Spark 2.1.0,我知道有一个问题here,希望可以在Spark 2.2.0中解决。
你能否在平均时间内建议任何解决方法来避免这个问题?
抛出异常的代码的简化版本:
val spark = SparkSession.builder().appName("application").master("local[*]").getOrCreate()
val itemCountry = spark.read.format("csv")
.option("header", "true")
.schema(StructType(Array(
StructField("itemId", IntegerType, false),
StructField("countryId", IntegerType, false))))
.csv("/item_country.csv") // This file matches the schema provided
val itemPerformance = spark.read.format("csv")
.option("header", "true")
.schema(StructType(Array(
StructField("itemId", IntegerType, false),
StructField("date", TimestampType, false),
StructField("performance", IntegerType, false))))
.csv("/item_performance.csv") // This file matches the schema provided
itemCountry.join(itemPerformance, itemCountry("itemId") === itemPerformance("itemId"))
.groupBy("countryId")
.agg(sum(when(to_date(itemPerformance("date")) > to_date(lit("2017-01-01")), itemPerformance("performance")).otherwise(0)).alias("performance")).show()
异常的堆栈跟踪:
java.lang.IllegalArgumentException: spark.sql.execution.id is already set
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:81)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2375)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2375)
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2778)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2375)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2351)
at .... [Custom caller functions]
示例CSV文件:
item_country.csv
itemId,countryId
1,1
2,1
3,2
4,3
item_performance.csv
itemId,date,performance
1,2017-04-15,10
1,2017-04-16,10
1,2017-04-17,10
2,2017-04-15,15
3,2017-04-20,12
4,2017-04-18,18