如何在Spark中检索DataFrame的别名

时间:2016-12-20 19:25:21

标签: apache-spark apache-spark-sql

我正在使用Spark 2.0.2。我有一个在其上有别名的DataFrame,我希望能够检索它。我想要的简化示例如下。

def check(ds: DataFrame) = {
   assert(ds.count > 0, s"${df.getAlias} has zero rows!")    
}

上述代码当然失败了,因为DataFrame没有 getAlias 函数。有没有办法做到这一点?

3 个答案:

答案 0 :(得分:5)

你可以试试这样的事情,但我不会到目前为止声称它得到支持:

  • Spark< 2.1:

    import org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias
    import org.apache.spark.sql.Dataset
    
    def getAlias(ds: Dataset[_]) = ds.queryExecution.analyzed match {
      case SubqueryAlias(alias, _) => Some(alias)
      case _ => None
    }
    
  • Spark 2.1 +:

    def getAlias(ds: Dataset[_]) = ds.queryExecution.analyzed match {
      case SubqueryAlias(alias, _, _) => Some(alias)
      case _ => None
    }
    

使用示例:

val plain = Seq((1, "foo")).toDF
getAlias(plain)
Option[String] = None
val aliased = plain.alias("a dataset")
getAlias(aliased)
Option[String] = Some(a dataset)

答案 1 :(得分:1)

免责声明:如上所述,此代码依赖未公开的API,可能会发生变化。它从Spark 2.3开始工作。

在深入研究了大多数未公开的Spark方法之后,下面是完整的代码以提取字段列表以及PySpark中数据框的表别名:

def schema_from_plan(df):
    plan = df._jdf.queryExecution().analyzed()
    all_fields = _schema_from_plan(plan)

    iterator = plan.output().iterator()
    output_fields = {}
    while iterator.hasNext():
        field = iterator.next()
        queryfield = all_fields.get(field.exprId().id(),{})
        if not queryfield=={}:
            tablealias = queryfield["tablealias"]
        else:
            tablealias = ""
        output_fields[field.exprId().id()] = {
            "tablealias": tablealias,
            "dataType": field.dataType().typeName(),
            "name": field.name()
        }
    return list(output_fields.values())

def _schema_from_plan(root,tablealias=None,fields={}):
    iterator = root.children().iterator()
    while iterator.hasNext():
        node = iterator.next()
        nodeClass = node.getClass().getSimpleName()
        if (nodeClass=="SubqueryAlias"):
            # get the alias and process the subnodes with this alias
            _schema_from_plan(node,node.alias(),fields)
        else:
            if tablealias:
                # add all the fields, along with the unique IDs, and a new tablealias field            
                iterator = node.output().iterator()
                while iterator.hasNext():
                    field = iterator.next()
                    fields[field.exprId().id()] = {
                        "tablealias": tablealias,
                        "dataType": field.dataType().typeName(),
                        "name": field.name()
                    }
            _schema_from_plan(node,tablealias,fields)
    return fields

# example: fields = schema_from_plan(df)

答案 2 :(得分:0)

对于Java

正如@veinhorn提到的,也可以在 Java 中获得别名。这是一个实用程序方法示例:

public static <T> Optional<String> getAlias(Dataset<T> dataset){
    final LogicalPlan analyzed = dataset.queryExecution().analyzed();
    if(analyzed instanceof SubqueryAlias) {
        SubqueryAlias subqueryAlias = (SubqueryAlias) analyzed;
        return Optional.of(subqueryAlias.alias());
    }
    return Optional.empty();
}