我有一个按如下方式创建的DataFrame:
df = sc
.textFile("s3n://bucket/key/data.txt")
.map(_.split(","))
.toDF()
这是data.txt
:
123,2016-11-09,1
124,2016-11-09,2
123,2016-11-10,1
123,2016-11-11,1
123,2016-11-12,1
124,2016-11-13,1
124,2016-11-14,1
是否可以过滤df
以获得从现在开始的最后N天123
的第3列值的总和?我对一个灵活的解决方案很感兴趣,因此可以将N定义为参数。
例如,如果今天为2016-11-16
且N
等于5,则124
的第3列值之和将等于2
。
这是我目前的解决方案:
df = sc
.textFile("s3n://bucket/key/data.txt")
.map(_.split(","))
.toDF(["key","date","qty"])
val starting_date = LocalDate.now().minusDays(x_last_days)
df.filter(col("key") === "124")
.filter(to_date(df("date")).gt(starting_date))
.agg(sum(col("qty")))
但它似乎无法正常工作。 1.我定义列名["key","date","qty"]
的行不能为Scala 2.10.6和Spark 1.6.2编译。 2.它还返回一个数据帧,而我需要Int
。我应该toString.toInt
吗?
答案 0 :(得分:4)
以下两种情况都不会编译:
scala> val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF(["key","date","qty"])
// <console>:1: error: illegal start of simple expression
// val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF(["key","date","qty"])
^
scala> val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF
// <console>:27: error: value toDF is not a member of org.apache.spark.rdd.RDD[Array[String]]
// val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF
^
第一个不会,因为它是一个不正确的语法,而对于第二个,这是因为,就像错误所说,它不是成员,换句话说,不支持该行为。
后者将使用Spark 2.x进行编译,但以下解决方案也适用,或者您将拥有一个DataFrame
类型的ArrayType
列。
现在让我们解决问题:
scala> :pa
// Entering paste mode (ctrl-D to finish)
import sqlContext.implicits._ // you don't need to import this in the shell.
val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1"))
.map{ _.split(",") match { case Array(a,b,c) => (a,b,c) }}.toDF("key","date","qty")
// Exiting paste mode, now interpreting.
// df: org.apache.spark.sql.DataFrame = [key: string, date: string, qty: string]
您可以应用所需的任何过滤器并计算所需的聚合,例如:
scala> val df2 = df.filter(col("key") === "124").agg(sum(col("qty")))
// df2: org.apache.spark.sql.DataFrame = [sum(qty): double]
scala> df2.show
// +--------+
// |sum(qty)|
// +--------+
// | 4.0|
// +--------+
PS:以上代码已在Spark 1.6.2和2.0.0
中测试过