SPARK:如何在Scala中从RDD [Row]创建聚合

时间:2016-09-12 04:58:55

标签: json scala apache-spark dataframe rdd

如何在RDD / DF中创建List / Map以便我可以获得聚合?

我有一个文件,其中每一行都是一个JSON对象:

{
itemId :1122334,

language: [
        {
            name: [
                "US", "FR"
            ],
            value: [
                "english", "french"
            ]
        },
        {
            name: [
                "IND"
            ],
            value: [
                "hindi"
            ]
        }
    ],

country: [
    {
        US: [
            {
                startTime: 2016-06-06T17: 39: 35.000Z,
                endTime: 2016-07-28T07: 00: 00.000Z
            }
        ],
        CANADA: [
            {
                startTime: 2016-06-06T17: 39: 35.000Z,
                endTime: 2016-07-28T07: 00: 00.000Z
            }
        ],
        DENMARK: [
            {
                startTime: 2016-06-06T17: 39: 35.000Z,
                endTime: 2016-07-28T07: 00: 00.000Z
            }
        ],
        FRANCE: [
            {
                startTime: 2016-08-06T17: 39: 35.000Z,
                endTime: 2016-07-28T07: 00: 00.000Z
            }
        ]
    }
]
}, 

{
itemId :1122334,

language: [
        {
            name: [
                "US", "FR"
            ],
            value: [
                "english", "french"
            ]
        },
        {
            name: [
                "IND"
            ],
            value: [
                "hindi"
            ]
        }
    ],

country: [
    {
        US: [
            {
                startTime: 2016-06-06T17: 39: 35.000Z,
                endTime: 2016-07-28T07: 00: 00.000Z
            }
        ],
        CANADA: [
            {
                startTime: 2016-07-06T17: 39: 35.000Z,
                endTime: 2016-07-28T07: 00: 00.000Z
            }
        ],
        DENMARK: [
            {
                startTime: 2016-06-06T17: 39: 35.000Z,
                endTime: 2016-07-28T07: 00: 00.000Z
            }
        ],
        FRANCE: [
            {
                startTime: 2016-08-06T17: 39: 35.000Z,
                endTime: 2016-07-28T07: 00: 00.000Z
            }
        ]
    }
]
}

我有匹配的POJO,它从JSON获取值。

import com.mapping.data.model.MappingUtils
import com.mapping.data.model.CountryInfo


val mappingPath = "s3://.../"

val timeStamp = "2016-06-06T17: 39: 35.000Z"
val endTimeStamp = "2016-06-07T17: 39: 35.000Z"


val COUNTRY_US = "US"
val COUNTRY_CANADA = "CANADA"
val COUNTRY_DENMARK = "DENMARK"
val COUNTRY_FRANCE = "FRANCE"


val input = sc.textFile(mappingPath)

输入是jsons列表,其中每一行都是json,我使用MappingUtils映射到POJO类CountryInfo,它负责JSON解析和转换:

val MappingsList = input.map(x=> {
                    val countryInfo = MappingUtils.getCountryInfoString(x);
                    (countryInfo.getItemId(), countryInfo)
                 }).collectAsMap

MappingsList: scala.collection.Map[String,com.mapping.data.model.CountryInfo] 


def showCountryInfo(x: Option[CountryInfo]) = x match {
      case Some(s) => s
   }

但我需要创建一个DF / RDD,以便我可以根据itemId获取国家和语言的聚合。

在给定的示例中,如果国家/地区的开始时间不小于“2016-06-07T17:39:35.000Z”,则该值将为零。

哪种格式最适合创建最终聚合json:

1. List ?

    |-----itemId-------|----country-------------------|-----language---------------------|
    |     1122334      |  [US, CANADA,DENMARK]        |      [english,hindi,french]      | 
    |     1122334      |  [US,DENMARK]                |      [english]                   | 
    |------------------|------------------------------|----------------------------------|

2. Map ?    



|-----itemId-------|----country---------------------------------|-----language---------------------|
  |     1122334      |  (US,2) (CANADA,1) (DENMARK,2) (FRANCE, 0) |(english,2) (hindi,1) (french,1)  | 
                |....                                                                                              |
                |....                                                                                              |    
                |....                                                                                              |
                |------------------|--------------------------------------------|----------------------------------|

我想创建一个最终的json,其聚合值如下:

{
    itemId: "1122334",
    country: {
        "US" : 2,
        "CANADA" : 1,
        "DENMARK" : 2,
        "FRANCE" : 0

    },
    language: {
        "english" : 2,
        "french" : 1,
        "hindi" : 1
    }
    }

我试过List:

val events = sqlContext.sql( "select itemId  EventList")

    val itemList =  events.map(row => {
        val itemId = row.getAs[String](1);
        val countryInfo  = showTitleInfo(MappingsList.get(itemId));

        val country =  new ListBuffer[String]()
        country +=  if (countryInfo.getCountry().getUS().get(0).getStartTime() < endTimeStamp) COUNTRY_US;
        country +=  if (countryInfo.getCountry().getCANADA().get(0).getStartTime() < endTimeStamp) COUNTRY_CANADA;
        country +=  if (countryInfo.getCountry().getDENMARK().get(0).getStartTime() < endTimeStamp) COUNTRY_DENMARK;
        country +=  if (countryInfo.getCountry().getFRANCE().get(0).getStartTime() < endTimeStamp) COUNTRY_FRANCE;

        val languageList =  new ListBuffer[String]()
        val language = countryInfo.getLanguages().collect.foreach(x => languageList += x.getValue());

        Row(itemId, country.toList, languageList.toList)
          })

和地图:

    val itemList =  events.map(row => {
    val itemId = row.getAs[String](1);
    val countryInfo  = showTitleInfo(MappingsList.get(itemId));

   val country: Map[String, Int] = Map()
   country +=  if (countryInfo.getCountry().getUS().get(0).getStartTime() < endTimeStamp) ('COUNTRY_US' -> 1) else ('COUNTRY_US' -> 0)
   country +=  if (countryInfo.getCountry().getUS().get(0).getStartTime() < endTimeStamp) ('COUNTRY_CANADA' -> 1) else ('COUNTRY_CANADA' -> 0)
   country +=  if (countryInfo.getCountry().getUS().get(0).getStartTime() < endTimeStamp) ('COUNTRY_DENMARK' -> 1) else ('COUNTRY_DENMARK' -> 0)
   country +=  if (countryInfo.getCountry().getUS().get(0).getStartTime() < endTimeStamp) ('COUNTRY_FRANCE' -> 1) else ('COUNTRY_FRANCE' -> 0)


   val language: Map[String, Int] = Map()
   countryInfo.getLanguages().collect.foreach(x => language += (x.getValue -> 1)) ;

    Row(itemId, country, language)
      })

但两人都在Zeppelin被冻结了。有没有更好的方法将聚合作为json?哪个更好的List / Map构造最终的aggreagate?

1 个答案:

答案 0 :(得分:0)

如果您根据Spark DataFrame / Dataset和Row重述您的问题,将会很有帮助;我知道你最终想要使用JSON,但JSON输入/输出的细节是一个单独的问题。

您正在寻找的功能是Spark SQL aggregate function(请参阅该页面上的一组)。函数 collect_list collect_set 是相关的,但您尚未实现所需的功能。

您可以通过org.spark.spark.sql.expressions.UserDefinedAggregateFunction派生来实现我称之为 count_by_value 的内容。这需要深入了解Spark SQL的工作原理。

实施 count_by_value 后,您可以像这样使用它:

df.groupBy("itemId").agg(count_by_value(df("country")), count_by_value(df("language")))