Scala / Spark-将数据框列中的每个值与整数相乘

时间:2017-04-18 07:02:46

标签: scala apache-spark

我有一个示例数据框

df_that_I_have
+---------+---------+-------+
| country | members | some  |
+---------+---------+-------+
| India   | 50      | 1     |
+---------+---------+-------+
| Japan   | 20      | 3     |
+---------+---------+-------+
| India   | 20      | 1     |
+---------+---------+-------+
| Japan   | 10      | 3     |
+---------+---------+-------+

我想要一个看起来像这样的数据框

df_that_I_want
+---------+---------+-------+
| country | members | some  |
+---------+---------+-------+
| India   | 70      | 10    | // 5 * Sum of "some" for India, i.e. (1 + 1)
+---------+---------+-------+
| Japan   | 30      | 30    | // 5 * Sum of "some" for Japan, i.e. (3 + 3)
+---------+---------+-------+

第二个数据框的总和为members,而some的总和乘以5。

这就是我为实现这一目标所做的工作

val df_that_I_want = df_that_I_have
                        .select(df_that_I_have("country"),
                                df_that_I_have.groupBy("country").sum("members"),
                                5 * df_that_I_have.groupBy("country").sum("some")) //Problem here

但编译器不允许我这样做,因为显然我不能将5与列相乘。

如何将Integer值乘以每个国家/地区some的总和?

3 个答案:

答案 0 :(得分:3)

您可以尝试lit功能。

scala> val df_that_I_have = Seq(("India",50,1),("India",20,1),("Japan",20,3),("Japan",10,3)).toDF("Country","Members","Some")
df_that_I_have: org.apache.spark.sql.DataFrame = [Country: string, Members: int, Some: int]

scala> val df1 = df_that_I_have.groupBy("country").agg(sum("members"), sum("some") * lit(5))
df1: org.apache.spark.sql.DataFrame = [country: string, sum(members): bigint, ((sum(some),mode=Complete,isDistinct=false) * 5): bigint]

scala> val df_that_I_want= df1.select($"Country",$"sum(Members)".alias("Members"), $"((sum(Some),mode=Complete,isDistinct=false) * 5)".alias("Some"))
df_that_I_want: org.apache.spark.sql.DataFrame = [Country: string, Members: bigint, Some: bigint]

scala> df_that_I_want.show

+-------+-------+----+
|Country|Members|Some|
+-------+-------+----+
|  India|     70|  10|
|  Japan|     30|  30|
+-------+-------+----+

答案 1 :(得分:1)

请试试这个

df_that_I_have.select("country").groupBy("country").agg(sum("members"), sum("some") * lit(5))

答案 2 :(得分:0)

df_that_I_have.select("country").groupBy("country").agg(sum("members"), sum("some") * lit(5))

lit函数用于创建此处为5的文字值列。

当您无法直接乘以5时,它会创建一个包含5的列并与其相乘。