我们有帖子数据框。我从下面的帖子中获得问题和答案的数据框
scala> val questions = spark.sql("select * from posts where posts._PostType = 'Question'")
scala> val answers = spark.sql("select * from posts where posts._PostType = 'Answer'")
scala> posts.select("_id", "_postType", "_parentId").show
+---+---------+---------+
|_id|_postType|_parentId|
+---+---------+---------+
| 4| Question| 0|
| 6| Question| 0|
| 7| Answer| 4|
| 9| Question| 0|
| 11| Question| 0|
| 12| Answer| 11|
| 13| Question| 0|
| 14| Question| 0|
| 16| Question| 0|
| 17| Question| 0|
| 18| Answer| 17|
+---+---------+---------+
scala> questions.select("_id", "_postType", "_parentId").show
+---+---------+---------+
|_id|_postType|_parentId|
+---+---------+---------+
| 4| Question| 0|
| 6| Question| 0|
| 9| Question| 0|
| 11| Question| 0|
| 13| Question| 0|
| 14| Question| 0|
| 16| Question| 0|
| 17| Question| 0|
+---+---------+---------+
scala> answers.select("_id", "_postType", "_parentId").show
+---+---------+---------+
|_id|_postType|_parentId|
+---+---------+---------+
| 7| Answer| 4|
| 12| Answer| 11|
| 18| Answer| 17|
+---+---------+---------+
我需要找到每个帖子的所有答案(答案的_ParentId col指向问题的_Id col)。所以我写了以下
val qanda = questions.join(answers, questions("_Id") === answers("_ParentId"), "inner")
但是它给了我零分
scala> val qanda = questions.join(answers, questions("_id") === answers("_parentId"), "inner")
qanda: org.apache.spark.sql.DataFrame = [_Id: bigint, _PostTypeId: tinyint ... 42 more fields]
scala> qanda.count
res8: Long = 0