Spark SQL将正常的内部联接检测为交叉联接

时间:2018-10-25 16:36:18

标签: scala apache-spark join apache-spark-sql

我的csv数据类似于以下内容,代表问题和答案,后者引用了他们正在回答的问题:

Id,Type,Parent
1,Q,
2,Q,
3,A,1
4,A,2
5,A,1
6,Q,
7,A,6

我正在尝试提取答案,并将它们与各自的问题结合在一起。这是我的代码:

val posts = spark.read.option("header", "true").csv("posts.csv")
                 .as[(String, String, String)]
                 .map{case (id, kind, parent) =>
                         (id.toInt, kind, if (parent != null) parent.toInt else -1)
                     }

val questions = posts.filter(_._2 == "Q")
val answers = posts.filter(_._2 == "A")

val joined = answers.joinWith(questions, answers("_3") === questions("_1"))

不幸的是,这不起作用。 Spark抱怨我的INNER JOIN实际上是CROSS JOIN,但我不明白为什么这么说。连接条件并非无关紧要,连接中的行数应与左侧操作数中的行数匹配。绝对不是交叉产品。

这是错误消息:

org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these relations,
or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true;

编辑:我检查了Why does spark think this is a cross/cartesian join,但如下所示使用aliascache并没有改变错误

val questions = posts.filter(_._2 == "Q").alias("q").cache
val answers = posts.filter(_._2 == "A").alias("a").cache

0 个答案:

没有答案