Scala:GroupBy如何对String值求和?

时间:2016-09-12 15:52:28

标签: json scala apache-spark rdd spark-dataframe

我有RDD [Row]:

  |---itemId----|----Country-------|---Type----------|
  |     11      |     US           |      Movie      | 
  |     11      |     US           |      TV         | 
  |     101     |     France       |      Movie      |     

如何做GroupBy itemId以便我可以将结果保存为json的List,其中每行是单独的json对象(RDD中的每一行):

{"itemId" : 11, 
"Country": {"US" :2 },"Type": {"Movie" :1 , "TV" : 1} },
{"itemId" : 101, 
"Country": {"France" :1 },"Type": {"Movie" :1} }

RDD:

我试过了:

import com.mapping.data.model.MappingUtils
import com.mapping.data.model.CountryInfo


val mappingPath = "s3://.../"    
val input = sc.textFile(mappingPath)

输入是jsons列表,其中每一行都是json,我使用MappingUtils映射到POJO类CountryInfo,它负责JSON解析和转换:

val MappingsList = input.map(x=> {
                    val countryInfo = MappingUtils.getCountryInfoString(x);
                    (countryInfo.getItemId(), countryInfo)
                 }).collectAsMap

MappingsList: scala.collection.Map[String,com.mapping.data.model.CountryInfo] 


def showCountryInfo(x: Option[CountryInfo]) = x match {
      case Some(s) => s
   }


val events = sqlContext.sql( "select itemId  EventList")

val itemList =  events.map(row => {
    val itemId = row.getAs[String](1);
    val çountryInfo =  showTitleInfo(MappingsList.get(itemId));
    val country = if (countryInfo.getCountry() == 'unknown)' "US" else countryInfo.getCountry()
    val type = countryInfo.getType()

    Row(itemId, country, type)
      })

有人可以告诉我如何实现这个目标吗?

谢谢!

1 个答案:

答案 0 :(得分:3)

我无法承受额外的时间来完成这项工作,但可以给你一个开始。

我们的想法是将RDD[Row]聚合成一个代表您的JSON结构的Map。聚合是一个需要两个函数参数的折叠:

  1. seqOp如何将元素集合折叠到目标类型
  2. combOp如何合并两种目标类型。
  3. 在合并时,棘手的部分会出现在combOp中,因为您需要累积seqOp中看到的值的计数。我把它留作练习,因为我有一架飞机可以抓住!如果您遇到麻烦,希望其他人可以填补空白。

      case class Row(id: Int, country: String, tpe: String)
    
      def foo: Unit = {
    
        val rows: RDD[Row] = ???
    
        def seqOp(acc: Map[Int, (Map[String, Int], Map[String, Int])], r: Row) = {
          acc.get(r.id) match {
            case None => acc.updated(r.id, (Map(r.country, 1), Map(r.tpe, 1)))
            case Some((countries, types)) =>
              val countries_ = countries.updated(r.country, countries.getOrElse(r.country, 0) + 1)
              val types_ = types.updated(r.tpe, types.getOrElse(r.tpe, 0) + 1)
              acc.updated(r.id, (countries_, types_))
          }
        }
    
        val z = Map.empty[Int, (Map[String, Int], Map[String, Int])]
    
        def combOp(l: Map[Int, (Map[String, Int], Map[String, Int])], r: Map[Int, (Map[String, Int], Map[String, Int])]) = {
          l.foldLeft(z) { case (acc, (id, (countries, types))) =>
              r.get(id) match {
                case None => acc.updated(id, (countries, types))
                case Some(otherCountries, otherTypes) => 
                  // todo - continue by merging countries with otherCountries
                  // and types with otherTypes, then update acc
              }
          }
        }
    
        val summaryMap = rows.aggregate(z) { seqOp, combOp }