我的csv数据类似于以下内容,代表问题和答案,后者引用了他们正在回答的问题:
Id,Type,Parent
1,Q,
2,Q,
3,A,1
4,A,2
5,A,1
6,Q,
7,A,6
我正在尝试提取答案,并将它们与各自的问题结合在一起。这是我的代码:
val posts = spark.read.option("header", "true").csv("posts.csv")
.as[(String, String, String)]
.map{case (id, kind, parent) =>
(id.toInt, kind, if (parent != null) parent.toInt else -1)
}
val questions = posts.filter(_._2 == "Q")
val answers = posts.filter(_._2 == "A")
val joined = answers.joinWith(questions, answers("_3") === questions("_1"))
不幸的是,这不起作用。 Spark抱怨我的INNER JOIN
实际上是CROSS JOIN
,但我不明白为什么这么说。连接条件并非无关紧要,连接中的行数应与左侧操作数中的行数匹配。绝对不是交叉产品。
这是错误消息:
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these relations,
or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true;
编辑:我检查了Why does spark think this is a cross/cartesian join,但如下所示使用alias
和cache
并没有改变错误
val questions = posts.filter(_._2 == "Q").alias("q").cache
val answers = posts.filter(_._2 == "A").alias("a").cache