Question

我的csv数据类似于以下内容，代表问题和答案，后者引用了他们正在回答的问题：

Id,Type,Parent
1,Q,
2,Q,
3,A,1
4,A,2
5,A,1
6,Q,
7,A,6

我正在尝试提取答案，并将它们与各自的问题结合在一起。这是我的代码：

val posts = spark.read.option("header", "true").csv("posts.csv")
                 .as[(String, String, String)]
                 .map{case (id, kind, parent) =>
                         (id.toInt, kind, if (parent != null) parent.toInt else -1)
                     }

val questions = posts.filter(_._2 == "Q")
val answers = posts.filter(_._2 == "A")

val joined = answers.joinWith(questions, answers("_3") === questions("_1"))

不幸的是，这不起作用。 Spark抱怨我的INNER JOIN实际上是CROSS JOIN，但我不明白为什么这么说。连接条件并非无关紧要，连接中的行数应与左侧操作数中的行数匹配。绝对不是交叉产品。

这是错误消息：

org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these relations,
or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true;

编辑：我检查了Why does spark think this is a cross/cartesian join，但如下所示使用alias和cache并没有改变错误

val questions = posts.filter(_._2 == "Q").alias("q").cache
val answers = posts.filter(_._2 == "A").alias("a").cache

Spark SQL将正常的内部联接检测为交叉联接

0 个答案: