我有一个包含加入价格表的交易的数据框:
+----------+----------+------+-------+-------+
| paid | currency | EUR | USD | GBP |
+----------+----------+------+-------+-------+
| 49.5 | EUR | 99 | 79 | 69 |
+----------+----------+------+-------+-------+
客户已支付49.5欧元,如"货币"柱。我现在想要将付费价格与价格表中的价格进行比较。
因此我需要根据" currency"的值来访问正确的列。像这样:
df.withColumn("saved", df.col(df.col($"currency")) - df.col("paid"))
我希望会成为
df.withColumn("saved", df.col("EUR") - df.col("paid"))
然而,这失败了。我尝试了所有可以成像的东西,包括UDF,无处可去。
我想这有一些优雅的解决方案吗?有人可以帮忙吗?
答案 0 :(得分:5)
假设列名与import org.apache.spark.sql.functions.{lit, col, coalesce}
import org.apache.spark.sql.Column
// Dummy data
val df = sc.parallelize(Seq(
(49.5, "EUR", 99, 79, 69), (100.0, "GBP", 80, 120, 50)
)).toDF("paid", "currency", "EUR", "USD", "GBP")
// A list of available currencies
val currencies: List[String] = List("EUR", "USD", "GBP")
// Select listed value
val listedPrice: Column = coalesce(
currencies.map(c => when($"currency" === c, col(c)).otherwise(lit(null))): _*)
df.select($"*", (listedPrice - $"paid").alias("difference")).show
// +-----+--------+---+---+---+----------+
// | paid|currency|EUR|USD|GBP|difference|
// +-----+--------+---+---+---+----------+
// | 49.5| EUR| 99| 79| 69| 49.5|
// |100.0| GBP| 80|120| 50| -50.0|
// +-----+--------+---+---+---+----------+
列中的值匹配:
listedPrice
SQL等效于COALESCE(
CASE WHEN (currency = 'EUR') THEN EUR ELSE null,
CASE WHEN (currency = 'USD') THEN USD ELSE null,
CASE WHEN (currency = 'GBP') THEN GBP ELSE null
)
表达式是这样的:
foldLeft
替代使用import org.apache.spark.sql.functions.when
val listedPriceViaFold = currencies.foldLeft(
lit(null))((acc, c) => when($"currency" === c, col(c)).otherwise(acc))
df.select($"*", (listedPriceViaFold - $"paid").alias("difference")).show
// +-----+--------+---+---+---+----------+
// | paid|currency|EUR|USD|GBP|difference|
// +-----+--------+---+---+---+----------+
// | 49.5| EUR| 99| 79| 69| 49.5|
// |100.0| GBP| 80|120| 50| -50.0|
// +-----+--------+---+---+---+----------+
:
listedPriceViaFold
其中CASE
WHEN (currency = 'GBP') THEN GBP
ELSE CASE
WHEN (currency = 'USD') THEN USD
ELSE CASE
WHEN (currency = 'EUR') THEN EUR
ELSE null
转换为以下SQL:
CASE currency
WHEN 'EUR' THEN EUR
WHEN 'USD' THEN USD
WHEN 'GBP' THEN GBP
ELSE null
END
不幸的是,我不知道任何可以像这样直接表达SQL的内置函数
currency
但您可以在原始SQL中使用此构造。
我的假设不正确您可以简单地在列名和currencies.map(
// for each currency filter and add difference
c => df.where($"currency" === c).withColumn("difference", $"paid" - col(c))
).reduce((df1, df2) => df1.unionAll(df2)) // Union
列中的值之间添加映射。
修改强>:
另一个选项,如果源支持谓词下推和有效的列修剪,可能是有效的,是按货币和联合子集:
SELECT *, EUR - paid AS difference FROM df WHERE currency = 'EUR'
UNION ALL
SELECT *, USD - paid AS difference FROM df WHERE currency = 'USD'
UNION ALL
SELECT *, GBP - paid AS difference FROM df WHERE currency = 'GBP'
它等同于SQL:
SELECT 1 from dual;
-- don't want to see the resulting row(s) !
var c refcursor
exec some_procedure(:c);
print
-- don't want to see the resulting row(s) variable C !
答案 1 :(得分:0)
我无法想到用DataFrame
来做这件事的方式,我怀疑有简单的方法,但如果你把那张桌子放到RDD
:
// On top of my head, warn if wrong.
// Would be more elegant with match .. case
def d(l: (Int, String, Int, Int, Int)): Int = {
if(l._2 == "EUR")
l._3 - l._1
else if (l._2 == "USD")
l._4 - l._1
else
l._5 -l._1
}
val rdd = df.rdd
val diff = rdd.map(r => (r, r(d)))
很可能会引发类型错误,我希望你能绕过这些错误。