当我有两个键时,Scala如何使用reduceBykey

时间:2018-06-01 02:58:41

标签: scala mapreduce

一行的数据格式:

id: 123456  
Topiclist: ABCDE:1_8;5_10#BCDEF:1_3;7_11 

一个ID可以包含多行:

id: 123456 
Topiclist:ABCDE:1_1;7_2;#BCDEF:1_2;7_11# 

目标:(123456, (ABCDE,9,2),(BCDEF,5,2))

主题列表中的记录按#拆分,因此ABCDE:1_8;5_10是一条记录。

记录的格式为<topicid>:<topictype>_<topicvalue>

ABCDE:1_8已经

topicid = ABCDE

topictype = 1

topicvalue = 8

目标:总计TopicType1的总值,并计算TopicType1的频率 所以应该是(id, (topicid, value,frequency)),例如:(123456, (ABCDE,9,2),(BCDEF,5,2))

1 个答案:

答案 0 :(得分:0)

假设您的数据是&#34; 123456!ABCDE:1_8; 5_10#BCDEF:1_3; 7_11&#34; &#34; 123456!ABCDE:1_1; 7_2#BCDEF:1_2; 7_11&#34;,所以我们使用&#34;!&#34;获取您的用户ID&#34; 123456&#34;

rdd.map{f=>
          val userID = f.split("!")(0)
          val items = f.split("!")(1).split("#")
          var result = List[Array[String]]()
          for (item <- items){
            val topicID = item.split(":")(0)
            for (topicTypeValue <- item.split(":")(1).split(";") ){
              println(topicTypeValue);
              if (topicTypeValue.split("_")(0)=="1"){result = result:+Array(topicID,topicTypeValue.split("_")(1),"1") }
            }
          }
          (userID,result)
          }
    .flatMapValues(x=>x).filter(f=>f._2.length==3)
    .map{f=>( (f._1,f._2(0)),(f._2(1).toInt,f._2(2).toInt) )}
    .reduceByKey{case(x,y)=> (x._1+y._1,x._2+y._2) }
    .map(f=>(f._1._1,(f._1._2,f._2._1,f._2._2)))   // (userID, (TopicID,valueSum,frequences) )

输出是(&#34; 12345&#34;,(&#34; ABCDE&#34;,9,2)),(&#34; 12345&#34;,(&#34; BCDEF&#34) ;,5,2))与你的输出略有不同,你可以将这个结果分组,如果你真的需要(&#34; 12345&#34;,(&#34; ABCDE&#34;,9,2),(& #34; BCDEF&#34;,5,2))