我正在使用Spark 2.0.2。我有一个在其上有别名的DataFrame,我希望能够检索它。我想要的简化示例如下。
def check(ds: DataFrame) = {
assert(ds.count > 0, s"${df.getAlias} has zero rows!")
}
上述代码当然失败了,因为DataFrame没有 getAlias 函数。有没有办法做到这一点?
答案 0 :(得分:5)
你可以试试这样的事情,但我不会到目前为止声称它得到支持:
Spark< 2.1:
import org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias
import org.apache.spark.sql.Dataset
def getAlias(ds: Dataset[_]) = ds.queryExecution.analyzed match {
case SubqueryAlias(alias, _) => Some(alias)
case _ => None
}
Spark 2.1 +:
def getAlias(ds: Dataset[_]) = ds.queryExecution.analyzed match {
case SubqueryAlias(alias, _, _) => Some(alias)
case _ => None
}
使用示例:
val plain = Seq((1, "foo")).toDF
getAlias(plain)
Option[String] = None
val aliased = plain.alias("a dataset")
getAlias(aliased)
Option[String] = Some(a dataset)
答案 1 :(得分:1)
免责声明:如上所述,此代码依赖未公开的API,可能会发生变化。它从Spark 2.3开始工作。
在深入研究了大多数未公开的Spark方法之后,下面是完整的代码以提取字段列表以及PySpark中数据框的表别名:
def schema_from_plan(df):
plan = df._jdf.queryExecution().analyzed()
all_fields = _schema_from_plan(plan)
iterator = plan.output().iterator()
output_fields = {}
while iterator.hasNext():
field = iterator.next()
queryfield = all_fields.get(field.exprId().id(),{})
if not queryfield=={}:
tablealias = queryfield["tablealias"]
else:
tablealias = ""
output_fields[field.exprId().id()] = {
"tablealias": tablealias,
"dataType": field.dataType().typeName(),
"name": field.name()
}
return list(output_fields.values())
def _schema_from_plan(root,tablealias=None,fields={}):
iterator = root.children().iterator()
while iterator.hasNext():
node = iterator.next()
nodeClass = node.getClass().getSimpleName()
if (nodeClass=="SubqueryAlias"):
# get the alias and process the subnodes with this alias
_schema_from_plan(node,node.alias(),fields)
else:
if tablealias:
# add all the fields, along with the unique IDs, and a new tablealias field
iterator = node.output().iterator()
while iterator.hasNext():
field = iterator.next()
fields[field.exprId().id()] = {
"tablealias": tablealias,
"dataType": field.dataType().typeName(),
"name": field.name()
}
_schema_from_plan(node,tablealias,fields)
return fields
# example: fields = schema_from_plan(df)
答案 2 :(得分:0)
对于Java :
正如@veinhorn提到的,也可以在 Java 中获得别名。这是一个实用程序方法示例:
public static <T> Optional<String> getAlias(Dataset<T> dataset){
final LogicalPlan analyzed = dataset.queryExecution().analyzed();
if(analyzed instanceof SubqueryAlias) {
SubqueryAlias subqueryAlias = (SubqueryAlias) analyzed;
return Optional.of(subqueryAlias.alias());
}
return Optional.empty();
}