如何动态创建列引用?

时间:2018-05-01 19:40:34

标签: scala apache-spark apache-spark-sql

我的DataFrame root |-- author: array (nullable = true) | |-- element: integer (containsNull = true) |-- client: array (nullable = true) | |-- element: integer (containsNull = true) |-- outbound_link: array (nullable = true) | |-- element: string (containsNull = true) |-- url: string (nullable = true) 具有以下结构:

val sourceField = "outbound_link" // set automatically
val targetField = "url"           // set automatically
val nodeId = "client"             // set automatically

val result = df.as("df1").join(df.as("df2"),
        $"df1."+sourceField === $"df2."+targetField
        ).groupBy(
          ($"df1."+nodeId).as("nodeId_1"),
          ($"df2."+nodeId).as("nodeId_2")
        )
        .agg(
          count("*") as "value", max($"df1."+timestampField) as "timestamp"
        )
        .toDF("source", "target", "value", "timestamp")

我运行此代码:

Exception in thread "main" org.apache.spark.sql.AnalysisException: syntax error in attribute name: df1.;

但我收到错误:

sourceField

由于某些原因,变量targetFieldjoin handleClick() { var name = this.name.value; console.log(name); } 操作中不可见。这些变量不为空,包含字段名称。我必须使用变量,因为我在上一步代码中自动定义它们。

1 个答案:

答案 0 :(得分:2)

确实是一个有趣的案例。请查看$"df1."+sourceField并考虑何时将$"df1."转换为Column"df1."+sourceField的串联。

scala> val sourceField = "id"
sourceField: String = id

scala> $"df1."+sourceField
org.apache.spark.sql.AnalysisException: syntax error in attribute name: df1.;
  at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:151)
  at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:170)
  at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.quotedString(unresolved.scala:142)
  at org.apache.spark.sql.Column.<init>(Column.scala:137)
  at org.apache.spark.sql.ColumnName.<init>(Column.scala:1203)
  at org.apache.spark.sql.SQLImplicits$StringToColumn.$(SQLImplicits.scala:45)
  ... 55 elided

$"df1."+sourceField替换为使用colcolumn函数,您应该没问题。

scala> col(s"df1.$sourceField")
res7: org.apache.spark.sql.Column = df1.id