Question

我正在尝试创建一个具有嵌套结构的DataFrame。因此，首先，我使用所有列（甚至将要包含在嵌套结构中的列）创建DataFrame。

var df = spark.read.text(inputFile)
        .select(substring(col("value"), 41, 1).alias("carrier"), substring(col("value"), 42, 1).alias("currency"),
          substring(col("value"), 43, 3).alias("amount"), substring(col("value"), 46, 3).alias("country"),
          substring(col("value"), 49, 4).alias("code"), substring(col("value"), 53, 8).alias("quantity"),.......)

然后我有了一个case类，用于指定嵌套结构内的列：

  case class temp(currency: String, amount: String, country: String, code: String, quantity: String,....)

然后我创建UDF

  val makeStruct = udf((currency: String, amount: String, country: String, code: String, quantity: String,....) => temp(currency, amount, country, code, quantity,....))

最后是生成的DataFrame

df = df.withColumn("segments", makeStruct(col("currency"), col("amount"), col("country"), col("code"),
        col("quantity"),....)))
        .drop("currency", "amount", "country", "code", "quantity",...)

如果我的UDF中的参数少于10个，这将非常有效，但是我没有。如何做到这一点，以使我有一个具有所需嵌套结构的DataFrame，而UDF可以在其中容纳10个以上的参数？

Answer 1

您可以使用functions.array

将所有必需的列作为数组传递

例如，我有以下udf：

  val myUdf = functions.udf((r: mutable.WrappedArray[_])=> {
    // You can have access to your data via `r` with the order 
    true // a test value
  })

因此，您可以按照以下示例将列传递到udf中

df.groupBy(df("id"),df("code"))
    .agg(count(myUdf(array(df("country"),df("city"),df("district")))))
    .show(10)

在上面的示例中，我将所有3列放入一个单个数组中：国家，城市，地区。然后将其传递到我的udf中。

创建具有10个以上参数的UDF

1 个答案: