Question

data=
"""
user date      item1 item2
1    2015-12-01 14  5.6
1    2015-12-01 10  0.6
1    2015-12-02 8   9.4
1    2015-12-02 90  1.3
2    2015-12-01 30  0.3
2    2015-12-01 89  1.2
2    2015-12-30 70  1.9
2    2015-12-31 20  2.5
3    2015-12-01 19  9.3
3    2015-12-01 40  2.3
3    2015-12-02 13  1.4
3    2015-12-02 50  1.0
3    2015-12-02 19  7.8
"""

如果我有上面的数据，我怎样才能得到每个用户的最新一天的记录？我试着使用groupByKey，但不知道。

val user = data.map{
case(user,date,item1,item2)=>((user,date),Array(item1,item2))
}.groupByKey()

然后我不知道如何处理它。谁能给我一些建议？非常感谢：）

更新

我更改了数据，现在用户在最近一天有几条记录，我希望得到所有这些记录。 THX：）

第二次更新：

我想得到的结果是：

user1 (2015-12-02,Array(8,9.4),Array(90,1.3))
user2 (2015-12-31,Array(20,2.5))
user3 (2015-12-02,Array(13,1.4),Array(50,1.0),Array(19,7,8))

现在我写了一些代码：

val data2=data.trim.split("\\n").map(_split("\\s+")).map{
f=>{(f(0),ArrayBuffer(
                    f(1),
                    f(2).toInt,
                    f(3).toDouble)
    )}
}
val data3 = sc.parallelize(data2)
data3.reduceByKey((x,y)=>
             if(x(0).toString.compareTo(y(0).toString)>=0) x++=y
                  else y).foreach(println)

结果是：

(2,ArrayBuffer(2015-12-31, 20, 2.5))
(1,ArrayBuffer(2015-12-02, 8, 9.4, 2015-12-02, 90, 1.3))
(3,ArrayBuffer(2015-12-02, 13, 1.4, 2015-12-02, 50, 1.0, 2015-12-02, 19, 7.8))

有什么可以改善的吗？：）

Answer 1

我认为您最好的选择是将输入数据映射到(user, (date, item1, item2))元组的RDD，以便rdd为userRdd: RDD[(Int, (Date, Int, Double))]

从这里你可以创建一个reducer，它将采用两个元组并生成另一个格式相同的元组，即具有更大日期值的元组：

reduceMaxDate(a: (Date, Int, Double), b: (Date, Int, Double)) : (Date, Int, Double) = {
     if(a._1 > b._1) a else b
}

从这里，您可以通过以下方式找到每个用户的最大值：

userRdd.reduceByKey(reduceMaxDate).

这将产生具有每个用户的最大时间戳的元组。

Answer 2

以下是脚本

对于scala

val data = sc.textFile("file:///home/cloudera/data.txt")
val dataMap = data.map(x => (x.split(" +")(0), x))
val dataReduce = dataMap.reduceByKey((x, y) =>
  if(x.split(" +")(1) >= y.split(" +")(1)) x 
  else y)

val dataUserAndDateKey = data.map(rec => ((rec.split(" +")(0), rec.split(" +")(1)), rec))

val dataReduceUserAndDateKey = dataReduce.map(rec => ((rec._2.split(" +")(0), rec._2.split(" +")(1)), rec(1)))

val joinData = dataUserAndDateKey.join(dataReduceUserAndDateKey)

joinData.map(rec => rec._2._1).foreach(println)

对于pyspark

import re

data = sc.textFile("file:///home/cloudera/data.txt")
dataMap = data.map(lambda rec: (re.split('\s+', rec)[0], rec))
dataReduce = dataMap.reduceByKey(lambda x, y: x if(re.split('\s+', x)[1] >= re.split('\s+', y)[1]) else y)

dataUserAndDateKey = data.map(lambda rec: ((re.split('\s+', rec)[0], re.split('\s+', rec)[1]), rec))

dataReduceUserAndDateKey = dataReduce.map(lambda rec: ((re.split('\s+', rec[1])[0], re.split('\s+', rec[1])[1]), rec[1]))

joinData = dataUserAndDateKey.join(dataReduceUserAndDateKey)
for i in joinData.collect(): print(i[1][0])

这是输出：

3    2015-12-02 13  1.4
3    2015-12-02 50  1.0
3    2015-12-02 19  7.8
2    2015-12-31 20  2.5
1    2015-12-02 8   9.4
1    2015-12-02 90  1.3

您还可以使用数据框在SparkContext的HiveContext中使用SQL。

Answer 3

以下是我的解决方案，包括以下4个步骤。将其复制/粘贴到shell中以查看每个步骤的输出

//Step 1. Prepare data

val input="""user date      item1 item2
1    2015-12-01 14  5.6
1    2015-12-01 10  0.6
1    2015-12-02 8   9.4
1    2015-12-02 90  1.3
2    2015-12-01 30  0.3
2    2015-12-01 89  1.2
2    2015-12-30 70  1.9
2    2015-12-31 20  2.5
3    2015-12-01 19  9.3
3    2015-12-01 40  2.3
3    2015-12-02 13  1.4
3    2015-12-02 50  1.0
3    2015-12-02 19  7.8
"""
val inputLines=sc.parallelize(input.split("\\r?\\n"))
//filter the header row
val data=inputLines.filter(l=> !l.startsWith("user") )
data.foreach(println)

//Step 2. Find the latest date of each user

val keyByUser=data.map(line => { val a = line.split("\\s+"); ( a(0), line ) })
//For each user, find his latest date
val latestByUser = keyByUser.reduceByKey( (x,y) => if(x.split("\\s+")(1) > y.split("\\s+")(1)) x else y )
latestByUser.foreach(println)

//Step 3. Join the original data with the latest date to get the result

val latestKeyedByUserAndDate = latestByUser.map( x => (x._1 + ":"+x._2.split("\\s+")(1), x._2))
val originalKeyedByUserAndDate = data.map(line => { val a = line.split("\\s+"); ( a(0) +":"+a(1), line ) })
val result=latestKeyedByUserAndDate.join(originalKeyedByUserAndDate)
result.foreach(println)

//Step 4. Transform the result into the format you desire

def createCombiner(v:(String,String)):List[(String,String)] = List[(String,String)](v)
def mergeValue(acc:List[(String,String)], value:(String,String)) : List[(String,String)] = value :: acc
def mergeCombiners(acc1:List[(String,String)], acc2:List[(String,String)]) : List[(String,String)] = acc2 ::: acc1
//use combineByKey
val transformedResult=result.mapValues(l=> { val a=l._2.split(" +"); (a(2),a(3)) } ).combineByKey(createCombiner,mergeValue,mergeCombiners)
transformedResult.foreach(println)

准备数据
查找每个用户的最新日期
使用最新日期加入原始数据以获得结果
将结果转换为您想要的格式

Answer 4

问题在于传统的窗口小说概念。您的问题的答案是按用户划分，并使用排名功能按日期排序。如果你对同一天的所有记录进行排名获得相同的排名，那么你可以简单地过滤出rank = 1过滤器的最新记录。

 val data = sc.textFile("/user/hadoop/data.txt");


    val df=data.map(_.split("\\s+")).map{f=>{(f(0),f(1),f(2).toInt,f(3).toDouble)}}.toDF();

    import org.apache.spark.sql.expressions.Window

    import org.apache.spark.sql.functions._

    val w = Window.partitionBy("_1").orderBy("_2");




    df.withColumn("Rank",rank().over(w)).show()


+---+----------+---+---+----+
| _1|        _2| _3| _4|Rank|
+---+----------+---+---+----+
|  3|2015-12-01| 19|9.3|   1|
|  3|2015-12-01| 40|2.3|   1|
|  3|2015-12-02| 13|1.4|   3|
|  3|2015-12-02| 50|1.0|   3|
|  3|2015-12-02| 19|7.8|   3|
|  1|2015-12-01| 14|5.6|   1|
|  1|2015-12-01| 10|0.6|   1|
|  1|2015-12-02|  8|9.4|   3|
|  1|2015-12-02| 90|1.3|   3|
|  2|2015-12-01| 30|0.3|   1|
|  2|2015-12-01| 89|1.2|   1|
|  2|2015-12-30| 70|1.9|   3|
|  2|2015-12-31| 20|2.5|   4|
+---+----------+---+---+----+

现在您可以过滤rank = 1记录。

Answer 5

假设此数据集较大，如果您的数据检索模式由日期键入，则您可能希望按日期进行分区。

这将避免在读取时对所有数据进行完全扫描/随机播放 - 而是在写入时将行保留在正确的分区中。

scala引发了如何获得最新一天的记录

更新

第二次更新：

5 个答案: