Question

我试图在Spark中找出新的数据帧API。看起来好像是向前迈出了一大步，但却做了一件非常简单的事情。我有一个包含2列的数据框，＆＃34; ID＆＃34;和＆＃34;金额＆＃34;。作为一个通用示例，假设我想返回一个名为＆＃34; code＆＃34;返回基于＆＃34; Amt＆＃34;的值的代码。我可以写一个像这样的函数：

def coder(myAmt:Integer):String {
  if (myAmt > 100) "Little"
  else "Big"
}

当我尝试使用它时：

val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")

myDF.withColumn("Code", coder(myDF("Amt")))

我遇到类型不匹配错误

found   : org.apache.spark.sql.Column
required: Integer

我已经尝试将我的函数的输入类型更改为org.apache.spark.sql.Column但是我随后在函数编译时开始收到错误，因为它在if语句中需要一个布尔值。

我这样做错了吗？有没有比使用withColumn更好/另一种方法呢？

感谢您的帮助。

Answer 1

假设您的架构中有“Amt”列：

public class DoSomeMath {
    int num1C = 3;
    int num1A = 7;
    public void addStuff(){
        //Show the Equation//
        System.out.println("Adding num1C + num1A: " + Integer.toString(num1C) + Integer.toString(num1A));
        //Show the Answer//
        System.out.println("Adding num1C + num1A: " + num1C + num1A);
    }
}

我认为withColumn是添加列的正确方法

Answer 2

由于列udf和serialization的开销，我们应尽量避免定义deserialization个函数。

您可以使用简单的when火花功能实现解决方案，如下所示

val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")

myDF.withColumn("Code", when(myDF("Amt") < 100, "Little").otherwise("Big"))

Answer 3

另一种方法：您可以创建任何函数，但根据上述错误，您应该将函数定义为变量

示例：

val coder = udf((myAmt:Integer) => {
  if (myAmt > 100) "Little"
  else "Big"
})

现在这句话非常有效：

myDF.withColumn("Code", coder(myDF("Amt")))

使用Spark Dataframe中的函数创建新列

3 个答案: