Question

我有一个由rowkey = client_id制作的RDD，campaign =一个{campaign_id：campaign_name}的Json数组

val clientsRDD = resultRDD.map(ClientRow.parseClientRow)
// change  RDD of ClientRow  objects to a DataFrame
val clientsDF = clientsRDD.toDF()
// Return the schema of this DataFrame
clientsDF.printSchema()
// print each line DataFrame
clientsDF.collect().foreach(println)

输出：

root
 |-- rowkey: string (nullable = true)
 |-- campaigns: string (nullable = true)

[1,[{"1000":"campaign1"},{"1001":"campaign2"}]]
[2,[{"1002":"campaign3"}]]

我还有一个RDD，其中包含HBase的所有客户和广告系列数据记录。

recordsRDD

rowkey                 type         body
client_id-campaign_id, record_type, record_text

我的目标是为每个客户（针对其所有广告系列）和每个广告系列生成统计信息，例如计算所有client_id记录，按类型分组并计算每个广告系列记录，按类型对其进行分组。

client1
records:100, login:20, actions:80

client1 campaign1  
records:70, login:16, actions:50

client1 campaign2
records:30, login:4, actions:30

最后我想写统计数据。

使用Scala在Spark中执行此操作的最佳方法是什么？我是否必须迭代客户端RDD（映射？），并为每一行生成不同的RDD映射记录RDD？

Answer 1

首先，您需要为广告系列字段定义架构：它的意思是您可以使用

定义架构

val schema = StructType(Seq(StructField("rowkey", StringType, true),
StructField("campaigns", StructType(
  StructField("id", StringType, true) ::
    StructField("name", StringType, true) :: Nil
))

））

然后，您可以在广告系列字段中使用explode方法将行平放。

val df = sqlContext.createDataFrame(clientsRDD, schema)
df.select(col("rowkey"), explode(col("campaigns")).as("campaign")).filter(col("campaign.id") === 1)

Apache Spark在Scala中嵌套迭代以生成统计信息RDD

1 个答案: