我有以下scala代码片段,它们反映了我在spark 2.1.1中曾经做过的事情:
val headers = Seq(StructField("A", StringType), StructField("B", StringType), StructField("C", StringType))
val data = Seq(Seq("A1", "B1", "C1"), Seq("A2", "B2", "C2"), Seq("A3", "B3", "C3"))
val rdd = sc.parallelize(data).map(Row.fromSeq)
sqlContext.createDataFrame(rdd, StructType(headers)).registerTempTable("TEMP_DATA")
val table = sqlContext.table("TEMP_DATA")
table
.select("A")
.filter(table("B") === "B1")
.show()
在2.3.1中,这会引发以下错误:
Resolved attribute(s) B#1604 missing from A#1603 in operator !Filter (B#1604
= B1).;;
!Filter (B#1604 = B1)
+- AnalysisBarrier
+- Project [A#1603]
+- SubqueryAlias temp_data
+- LogicalRDD [A#1603, B#1604, C#1605], false
org.apache.spark.sql.AnalysisException: Resolved attribute(s) B#1604 missing from A#1603 in operator !Filter (B#1604 = B1).;;
!Filter (B#1604 = B1)
+- AnalysisBarrier
+- Project [A#1603]
+- SubqueryAlias temp_data
+- LogicalRDD [A#1603, B#1604, C#1605], false
如果我交换select
和filter
可以解决此问题。我的问题是,为什么这已经改变了?我需要解释为什么会发生这种情况,最好将它们链接到支持它的文档。
我的理解是,select返回的数据帧在功能上仅包含列A
,因此您无法在B
上进行过滤。我试图在pyspark中重新创建此问题,但似乎在那儿可以正常工作。
这是堆栈跟踪:
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:92)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:92)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:172)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:178)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3301)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)