Spark SQL UNION - ORDER BY列不在SELECT中

时间:2017-10-19 16:28:04

标签: apache-spark apache-spark-sql

我正在做两个临时表的UNION并试图按列排序但是火花抱怨我订购的列无法解析。这是一个错误还是我错过了什么?

  lazy val spark: SparkSession = SparkSession.builder.master("local[*]").getOrCreate()
  import org.apache.spark.sql.types.StringType

  val oldOrders = Seq(
    Seq("old_order_id1", "old_order_name1", "true"),
    Seq("old_order_id2", "old_order_name2", "true")
  )

  val newOrders = Seq(
    Seq("new_order_id1", "new_order_name1", "false"),
    Seq("new_order_id2", "new_order_name2", "false")
  )
  val schema = new StructType()
    .add("id", StringType)
    .add("name", StringType)
    .add("is_old", StringType)

  val oldOrdersDF = spark.createDataFrame(spark.sparkContext.makeRDD(oldOrders.map(x => Row(x:_*))), schema)
  val newOrdersDF = spark.createDataFrame(spark.sparkContext.makeRDD(newOrders.map(x => Row(x:_*))), schema)

  oldOrdersDF.createOrReplaceTempView("old_orders")
  newOrdersDF.createOrReplaceTempView("new_orders")

  //ordering by column not in select works if I'm not doing UNION
  spark.sql(
    """
      |SELECT oo.id, oo.name FROM old_orders oo
      |ORDER BY oo.is_old
    """.stripMargin).show()

  //ordering by column not in select doesn't work as I'm doing a UNION
  spark.sql(
    """
      |SELECT oo.id, oo.name FROM old_orders oo
      |UNION
      |SELECT no.id, no.name FROM new_orders no
      |ORDER BY oo.is_old
    """.stripMargin).show()

上述代码的输出是:

+-------------+---------------+
|           id|           name|
+-------------+---------------+
|old_order_id1|old_order_name1|
|old_order_id2|old_order_name2|
+-------------+---------------+


cannot resolve '`oo.is_old`' given input columns: [id, name]; line 5 pos 9;
'Sort ['oo.is_old ASC NULLS FIRST], true
+- Distinct
   +- Union
      :- Project [id#121, name#122]
      :  +- SubqueryAlias oo
      :     +- SubqueryAlias old_orders
      :        +- LogicalRDD [id#121, name#122, is_old#123]
      +- Project [id#131, name#132]
         +- SubqueryAlias no
            +- SubqueryAlias new_orders
               +- LogicalRDD [id#131, name#132, is_old#133]

org.apache.spark.sql.AnalysisException: cannot resolve '`oo.is_old`' given input columns: [id, name]; line 5 pos 9;
'Sort ['oo.is_old ASC NULLS FIRST], true
+- Distinct
   +- Union
      :- Project [id#121, name#122]
      :  +- SubqueryAlias oo
      :     +- SubqueryAlias old_orders
      :        +- LogicalRDD [id#121, name#122, is_old#123]
      +- Project [id#131, name#132]
         +- SubqueryAlias no
            +- SubqueryAlias new_orders
               +- LogicalRDD [id#131, name#132, is_old#133]

因此,如果我没有执行UNION,那么按SELECT子句中没有的列进行排序会有效,如果我正在执行两个表的UNION,则会失败。

1 个答案:

答案 0 :(得分:2)

// So even the syntax of Spark SQL is very similar to SQL, 
// but they are working very differently. Under the hood of Spark, its all about Rdds/dataframes. 
// After the UNION statement, a new dataframe is generated, and we are not able to refer the fields from the old table/dataframe if we did not select them.

// how to fix
spark.sql(
  """
    |SELECT id, name
    |FROM (
    | SELECT oo.id, oo.name, oo.is_old FROM old_orders oo
    | UNION
    | SELECT no.id, no.name, no.is_old FROM new_orders no
    | ORDER BY oo.is_old
    | ) t
  """.stripMargin).show()

感谢。