删除Hive联接中的重复加入列

时间:2017-04-20 22:06:12

标签: hive apache-spark-sql hiveql pyspark-sql

我在Hive中执行连接:

select * from
  (select * from 
      (select * from A join B on A.x = B.x) t1
  join C on t1.y = C.y) t2
join D on t2.x = D.x

我得到的列x无法解析,因为A和B都包含列x。我应该如何使用限定名称,或者是否有办法删除Hive中的重复列。

3 个答案:

答案 0 :(得分:0)

由于表A和表B的列为x,因此必须在此列的此选择中指定别名

select * from A join B on A.x = B.x   

像这样的东西

select A.x as x1, B.x as x2, ...
from A join B on A.x = B.x

答案 1 :(得分:0)

您可以执行与以下类似的操作,但这意味着您不能在列名称中使用特殊字符。

set hive.support.quoted.identifiers=none;
select * from
  (select C.*,t1.`(y)?+.+` from 
      (select A.*,B.`(x)?+.+` from A join B on A.x = B.x) t1
  join C on t1.y = C.y) t2
join D on t2.x = D.x

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-REGEXColumnSpecification

答案 2 :(得分:0)

我有完全相同的问题和解决方案对我来说只需通过重新创建具有修改架构的Dataframe来重命名重复列。以下是一些示例代码:

  def renameDuplicatedColumns(df: DataFrame): DataFrame = {
    val duplicatedColumns = df.columns
      .groupBy(identity)
      .filter(_._2.length > 1)
      .keys
      .toSet
    val newIndexes = mutable.Map[String, Int]().withDefaultValue(0)

    val schema: StructType = StructType(
      df.schema
        .collect {
          case field if duplicatedColumns.contains(field.name) =>
            val idx = newIndexes(field.name)
            newIndexes.update(field.name, idx + 1)
            field.copy(name = field.name + "__" + idx)
          case field =>
            field
        }
    )
    df.sqlContext.createDataFrame(df.rdd, schema)
  }