Spark SQL通过运行SQL

时间:2017-06-14 09:19:24

标签: apache-spark apache-spark-sql spark-dataframe

我有一个数据帧,它是另外两个数据帧的连接。 我想执行一个SQL查询运行,但我不知道如何区分 id 列。 我试着指定原始表但没有运气。

架构

博客:

root
 |-- id: integer (nullable = false)
 |-- author: string (nullable = true)
 |-- title: string (nullable = true)

注释:

root
 |-- id: integer (nullable = false)
 |-- blog_id: integer (nullable = false)
 |-- author: string (nullable = true)
 |-- comment: string (nullable = true)

博客加入评论

root
 |-- id: integer (nullable = true)
 |-- author: string (nullable = true)
 |-- title: string (nullable = true)
 |-- id: integer (nullable = true)
 |-- blog_id: integer (nullable = true)
 |-- author: string (nullable = true)
 |-- comment: string (nullable = true)

尝试查询

scala> spark.sql("SELECT id FROM joined")
12:17:26.981 [run-main-0] INFO org.apache.spark.sql.execution.SparkSqlParser - Parsing command: SELECT id FROM joined
org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id#7, id#23.; line 1 pos 7

scala> spark.sql("SELECT blogs.id FROM joined")
org.apache.spark.sql.AnalysisException: cannot resolve '`blogs.id`' given input columns: [blog_id, id, comment, title, author, author, id]; line 1 pos 7;
'Project ['blogs.id]
+- SubqueryAlias joined, `joined`
   +- Join FullOuter, (id#7 = blog_id#24)
      :- Project [_1#0 AS id#7, _2#1 AS author#8, _3#2 AS title#9]
      :  +- LocalRelation [_1#0, _2#1, _3#2]
      +- Project [_1#14 AS id#23, _2#15 AS blog_id#24, _3#16 AS author#25, _4#17 AS comment#26]
         +- LocalRelation [_1#14, _2#15, _3#16, _4#17]

2 个答案:

答案 0 :(得分:0)

您可能已加入以下两个数据框:

val df = left.join(right, left.col("name") === right.col("name"))

在列name上进行连接的位置 - 此列在join-df中进行了描述。

由此决定:(指定连接列)

val df = left.join(right, Seq("name"))

这样,您可以删除已连接df中的重复列;和查询没有任何问题。

答案 1 :(得分:-2)

您的查询中有拼写错误。

spark.sql("SELECT blogs.id FROM joined")

应该是

spark.sql("SELECT blog.id FROM joined")