Spark-通过用户ID减少输入文件

时间:2018-12-17 02:07:30

标签: scala apache-spark

我正在使用包含userId,seqId,eventType和国家/地区的结构化输入文件。我需要通过按seqId排序后获取每个字段的最后一个非空值的userId来减少它。对于给定的输入:

@Component
@DependsOn({"A"})
public class B extends A{
@PostConstruct
   public void setup{
       super.setup();
   }

减少的结果应该是:

userId    seqId eventType country
A1600001    2   Update  JP
A1600001    3   Update  
B2301001    2   Update  CH
A1600001    1   Create  CH
C1200011    2   Update  
C1200011    1   Create  IN

我从以下内容开始:

A1600001    3   Update  JP
C1200011    2   Update  IN
B2301001    2   Update  CH

现在我想scala> val file = sc.textFile("/tmp/sample-events.tsv") scala> val lines = file.map( x => (x.split("\t")(0), x) ) scala> lines.foreach(x => println(x)) (A1600001,A1600001 2 Update JP) (A1600001,A1600001 3 Update ) (B2301001,B2301001 2 Update CH) (A1600001,A1600001 1 Create CH) (C1200011,C1200011 2 Update ) (C1200011,C1200011 1 Create IN) 行(我猜是?),但是我对这个主题还很陌生,我不知道如何构造归约函数。有人可以帮忙吗?

3 个答案:

答案 0 :(得分:1)

使用spark-sql和window函数。

scala> val df = Seq(("A1600001",2,"Update","JP"),("A1600001",3,"Update",""),("B2301001",2,"Update","CH"),("A1600001",1,"Create","CH"),("C1200011",2,"Update",""),("C1200011",1,"Create","IN")).toDF("userId","seqId","eventType","country")
df: org.apache.spark.sql.DataFrame = [userId: string, seqId: int ... 2 more fields]

scala> df.createOrReplaceTempView("samsu")

scala> spark.sql(""" with tb1(select userId, seqId, eventType, country, lag(country) over(partition by userid order by seqid) lg1, row_number() over(partition by userid order by seqid) rw1,co
unt(*) over(partition by userid) cw1 from samsu) select userId, seqId, eventType,case when country="" then lg1 else country end country from tb1 where rw1=cw1 """).show(false)
+--------+-----+---------+-------+                                              
|userId  |seqId|eventType|country|
+--------+-----+---------+-------+
|A1600001|3    |Update   |JP     |
|C1200011|2    |Update   |IN     |
|B2301001|2    |Update   |CH     |
+--------+-----+---------+-------+


scala>

答案 1 :(得分:0)

一种可能的方式(假设seqId永远不会为空):

  1. 首先通过使用映射器过滤掉所有空的pair_rdd1值来准备eventType,然后在key = reduceByKey上应用userId以查找最新的非空eventType每个userId。假设reducer函数需要两对[seqId, eventType]并返回[seqId, eventType]对,reduce函数应类似于:(v1 v2) => ( if(v1[seqId] > v2[seqId]) then v1 else v2 )
  2. 首先通过使用映射器过滤掉所有空的pair_rdd2值来准备country,然后在key = reduceByKey上应用userId以查找最新的非空country每个userId。假设reducer函数需要两对[seqId, country]并返回[seqId, country]对,reduce函数应类似于:(v1 v2) => ( if(v1[seqId] > v2[seqId]) then v1 else v2 )
  3. 由于我们还需要每个seqId最新的userId,因此我们通过在key = {pair_rdd3上使用reduceByKey和化简函数{{{ 1}}
  4. 现在我们执行userId得到(seqId1 seqId2) => max(seqId1, seqId2),然后在左联接的结果上执行pair_rdd3.leftOuterJoin(pair_rdd1)最后得到[userId, seqId, eventType](两个联接都在key = {{ 1}})

请注意,我们这里使用.leftOuterJoin(pair_rdd2)而不是[userId, seqId, eventType, country],因为可能存在具有所有eventType或所有国家/地区为空的用户ID

答案 2 :(得分:0)

我可以用ReduceByKey想到的最简单的解决方案在这里。

//0: userId    1: seqId  2: eventType 3: country
val inputRdd = spark.sparkContext.textFile("data/input.txt")
  .map(_.split("\\s+", 4))

//Here reduce by userId and taking the record which is having max(seqId)
// order by seqId so that if the max value missing country, can be merged that value from the immediate seqId
inputRdd
  .map(ls => (ls(0), ls))
  .sortBy(_._2(1).toInt)
  .reduceByKey {
    (acc, y) =>
      if (acc(1).toInt < y(1).toInt)
        if (y.length == 3) y :+ acc(3) else y
      else
        acc
  }.map(_._2.mkString("\t"))
  .foreach(println)

data / input.txt

A1600001    2   Update  JP
A1600001    3   Update
B2301001    2   Update  CH
A1600001    1   Create  CH
C1200011    2   Update
C1200011    1   Create  IN

输出:

B2301001    2   Update  CH
C1200011    2   Update  IN
A1600001    3   Update  JP