根据条件汇总DataFrame的值

时间:2016-11-18 16:20:44

标签: scala apache-spark dataframe apache-spark-sql

我有一个按如下方式创建的DataFrame:

  df = sc
       .textFile("s3n://bucket/key/data.txt")
       .map(_.split(","))
       .toDF()

这是data.txt

的内容

123,2016-11-09,1
124,2016-11-09,2
123,2016-11-10,1
123,2016-11-11,1
123,2016-11-12,1
124,2016-11-13,1
124,2016-11-14,1

是否可以过滤df以获得从现在开始的最后N天123的第3列值的总和?我对一个灵活的解决方案很感兴趣,因此可以将N定义为参数。

例如,如果今天为2016-11-16N等于5,则124的第3列值之和将等于2

这是我目前的解决方案:

  df = sc
       .textFile("s3n://bucket/key/data.txt")
       .map(_.split(","))
       .toDF(["key","date","qty"])

val starting_date = LocalDate.now().minusDays(x_last_days)
df.filter(col("key") === "124")
                    .filter(to_date(df("date")).gt(starting_date))
                    .agg(sum(col("qty")))

但它似乎无法正常工作。 1.我定义列名["key","date","qty"]的行不能为Scala 2.10.6和Spark 1.6.2编译。 2.它还返回一个数据帧,而我需要Int。我应该toString.toInt吗?

1 个答案:

答案 0 :(得分:4)

以下两种情况都不会编译:

scala> val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF(["key","date","qty"])
// <console>:1: error: illegal start of simple expression
//       val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF(["key","date","qty"])
                                                                                                                                                                                             ^

scala> val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF
// <console>:27: error: value toDF is not a member of org.apache.spark.rdd.RDD[Array[String]]
//       val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF
                                                                                                                                                                                          ^

第一个不会,因为它是一个不正确的语法,而对于第二个,这是因为,就像错误所说,它不是成员,换句话说,不支持该行为。

后者将使用Spark 2.x进行编译,但以下解决方案也适用,或者您将拥有一个DataFrame类型的ArrayType列。

现在让我们解决问题:

scala> :pa
// Entering paste mode (ctrl-D to finish)
import sqlContext.implicits._ // you don't need to import this in the shell.
val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1"))
           .map{ _.split(",") match { case Array(a,b,c) => (a,b,c) }}.toDF("key","date","qty")

// Exiting paste mode, now interpreting.

// df: org.apache.spark.sql.DataFrame = [key: string, date: string, qty: string]

您可以应用所需的任何过滤器并计算所需的聚合,例如:

scala> val df2 = df.filter(col("key") === "124").agg(sum(col("qty")))
// df2: org.apache.spark.sql.DataFrame = [sum(qty): double]

scala> df2.show
// +--------+                                                                      
// |sum(qty)|
// +--------+
// |     4.0|
// +--------+

PS:以上代码已在Spark 1.6.2和2.0.0

中测试过