为什么此内部联接在使用Spark数据帧时不起作用?

时间:2019-06-14 05:29:22

标签: apache-spark apache-spark-sql

我们有帖子数据框。我从下面的帖子中获得问题和答案的数据框

scala> val questions = spark.sql("select * from posts where posts._PostType = 'Question'")

scala> val answers = spark.sql("select * from posts where posts._PostType = 'Answer'")

scala> posts.select("_id", "_postType", "_parentId").show
+---+---------+---------+
|_id|_postType|_parentId|
+---+---------+---------+
|  4| Question|        0|
|  6| Question|        0|
|  7|   Answer|        4|
|  9| Question|        0|
| 11| Question|        0|
| 12|   Answer|       11|
| 13| Question|        0|
| 14| Question|        0|
| 16| Question|        0|
| 17| Question|        0|
| 18|   Answer|       17|
+---+---------+---------+


scala> questions.select("_id", "_postType", "_parentId").show
+---+---------+---------+
|_id|_postType|_parentId|
+---+---------+---------+
|  4| Question|        0|
|  6| Question|        0|
|  9| Question|        0|
| 11| Question|        0|
| 13| Question|        0|
| 14| Question|        0|
| 16| Question|        0|
| 17| Question|        0|
+---+---------+---------+


scala> answers.select("_id", "_postType", "_parentId").show
+---+---------+---------+
|_id|_postType|_parentId|
+---+---------+---------+
|  7|   Answer|        4|
| 12|   Answer|       11|
| 18|   Answer|       17|
+---+---------+---------+

我需要找到每个帖子的所有答案(答案的_ParentId col指向问题的_Id col)。所以我写了以下

val qanda = questions.join(answers, questions("_Id") === answers("_ParentId"), "inner")

但是它给了我零分

scala> val qanda = questions.join(answers, questions("_id") === answers("_parentId"), "inner")
qanda: org.apache.spark.sql.DataFrame = [_Id: bigint, _PostTypeId: tinyint ... 42 more fields]

scala> qanda.count
res8: Long = 0

0 个答案:

没有答案