Spark数据帧:根据另一列

时间:2015-11-10 19:49:14

标签: scala apache-spark dataframe apache-spark-sql

我有一个包含加入价格表的交易的数据框:

+----------+----------+------+-------+-------+
|   paid   | currency | EUR  |  USD  |  GBP  |
+----------+----------+------+-------+-------+
|   49.5   |   EUR    | 99   |  79   |  69   |
+----------+----------+------+-------+-------+

客户已支付49.5欧元,如"货币"柱。我现在想要将付费价格与价格表中的价格进行比较。

因此我需要根据" currency"的值来访问正确的列。像这样:

df.withColumn("saved", df.col(df.col($"currency")) - df.col("paid"))

我希望会成为

df.withColumn("saved", df.col("EUR") - df.col("paid"))

然而,这失败了。我尝试了所有可以成像的东西,包括UDF,无处可去。

我想这有一些优雅的解决方案吗?有人可以帮忙吗?

2 个答案:

答案 0 :(得分:5)

假设列名与import org.apache.spark.sql.functions.{lit, col, coalesce} import org.apache.spark.sql.Column // Dummy data val df = sc.parallelize(Seq( (49.5, "EUR", 99, 79, 69), (100.0, "GBP", 80, 120, 50) )).toDF("paid", "currency", "EUR", "USD", "GBP") // A list of available currencies val currencies: List[String] = List("EUR", "USD", "GBP") // Select listed value val listedPrice: Column = coalesce( currencies.map(c => when($"currency" === c, col(c)).otherwise(lit(null))): _*) df.select($"*", (listedPrice - $"paid").alias("difference")).show // +-----+--------+---+---+---+----------+ // | paid|currency|EUR|USD|GBP|difference| // +-----+--------+---+---+---+----------+ // | 49.5| EUR| 99| 79| 69| 49.5| // |100.0| GBP| 80|120| 50| -50.0| // +-----+--------+---+---+---+----------+ 列中的值匹配:

listedPrice

SQL等效于COALESCE( CASE WHEN (currency = 'EUR') THEN EUR ELSE null, CASE WHEN (currency = 'USD') THEN USD ELSE null, CASE WHEN (currency = 'GBP') THEN GBP ELSE null ) 表达式是这样的:

foldLeft

替代使用import org.apache.spark.sql.functions.when val listedPriceViaFold = currencies.foldLeft( lit(null))((acc, c) => when($"currency" === c, col(c)).otherwise(acc)) df.select($"*", (listedPriceViaFold - $"paid").alias("difference")).show // +-----+--------+---+---+---+----------+ // | paid|currency|EUR|USD|GBP|difference| // +-----+--------+---+---+---+----------+ // | 49.5| EUR| 99| 79| 69| 49.5| // |100.0| GBP| 80|120| 50| -50.0| // +-----+--------+---+---+---+----------+

listedPriceViaFold

其中CASE WHEN (currency = 'GBP') THEN GBP ELSE CASE WHEN (currency = 'USD') THEN USD ELSE CASE WHEN (currency = 'EUR') THEN EUR ELSE null 转换为以下SQL:

CASE currency
    WHEN 'EUR' THEN EUR
    WHEN 'USD' THEN USD
    WHEN 'GBP' THEN GBP
    ELSE null
END

不幸的是,我不知道任何可以像这样直接表达SQL的内置函数

currency

但您可以在原始SQL中使用此构造。

我的假设不正确您可以简单地在列名和currencies.map( // for each currency filter and add difference c => df.where($"currency" === c).withColumn("difference", $"paid" - col(c)) ).reduce((df1, df2) => df1.unionAll(df2)) // Union 列中的值之间添加映射。

修改

另一个选项,如果源支持谓词下推和有效的列修剪,可能是有效的,是按货币和联合子集:

SELECT *,  EUR - paid AS difference FROM df WHERE currency = 'EUR'
UNION ALL
SELECT *,  USD - paid AS difference FROM df WHERE currency = 'USD'
UNION ALL
SELECT *,  GBP - paid AS difference FROM df WHERE currency = 'GBP'

它等同于SQL:

SELECT 1 from dual;
-- don't want to see the resulting row(s) !

var c refcursor
exec some_procedure(:c);
print
-- don't want to see the resulting row(s) variable C !

答案 1 :(得分:0)

我无法想到用DataFrame来做这件事的方式,我怀疑有简单的方法,但如果你把那张桌子放到RDD

// On top of my head, warn if wrong.
// Would be more elegant with match .. case 
def d(l: (Int, String, Int, Int, Int)): Int = {
  if(l._2 == "EUR")
    l._3 - l._1
  else if (l._2 == "USD")
    l._4 - l._1
  else 
    l._5 -l._1
}
val rdd = df.rdd
val diff = rdd.map(r => (r, r(d)))

很可能会引发类型错误,我希望你能绕过这些错误。