我想使用列表地图传播rdd。
输入样本为
Log("key1", "key2", "key3", Map(tk1 -> tv1, tk2 -> tv2, tk3 -> tv3))
我想要的输出样本是
RDD[(String, String, String, String, String)]
("key1", "key2", "key3", "tk1", "tv1")
("key1", "key2", "key3", "tk2", "tv2")
("key1", "key2", "key3", "tk3", "tv3")
最后,我要进行如下所示的reduce操作。 但这不起作用。
val mapCnt = logs.map(log => {
log.textMap.foreach { tmap =>
var tkey = tmap._1
var tvalue = tmap._2
}
((log.key1, log.key2, log.key3, tkey, tvalue), 1L)
}).reduceByKey(_ + _)
这是我使用的输入对象。
case class Log(
val key1: String,
val key2: String,
val key3: String,
val TextMap: Map[String, String]
)
我该如何转变呢?
谢谢您的帮助。
答案 0 :(得分:0)
您可以在foreach
中计算结果,并立即将其丢弃。此外,这些值超出范围。最好在这里使用flatMap
。
val mapCnt = logs.flatMap(log => {
for {
(tkey, tvalue) <- tmap
} yield ((log.key1, log.key2, log.key3, tkey, tvalue), 1L)
}).reduceByKey(_ + _)
答案 1 :(得分:0)
不确定第二部分,但是下面是第一部分的DF解决方案。
scala> case class Log(
| val key1: String,
| val key2: String,
| val key3: String,
| val TextMap: Map[String, String]
| )
defined class Log
scala> val df = Seq(Log("key1", "key2", "key3", Map("tk1" -> "tv1", "tk2" -> "tv2", "tk3" -> "tv3"))).toDF().as[Log]
df: org.apache.spark.sql.Dataset[Log] = [key1: string, key2: string ... 2 more fields]
scala> val df2 = df.withColumn("mapk",map_keys('TextMap))
df2: org.apache.spark.sql.DataFrame = [key1: string, key2: string ... 3 more fields]
scala> val df3 = df2.select('key1,'key2,'key3,'TextMap,'mapk, explode('mapk).as("exp1")).withColumn("exp2",('Textmap)('exp1)).drop("TextMap","mapk")
df3: org.apache.spark.sql.DataFrame = [key1: string, key2: string ... 3 more fields]
scala> df3.show
+----+----+----+----+----+
|key1|key2|key3|exp1|exp2|
+----+----+----+----+----+
|key1|key2|key3| tk1| tv1|
|key1|key2|key3| tk2| tv2|
|key1|key2|key3| tk3| tv3|
+----+----+----+----+----+
scala> df3.printSchema
root
|-- key1: string (nullable = true)
|-- key2: string (nullable = true)
|-- key3: string (nullable = true)
|-- exp1: string (nullable = true)
|-- exp2: string (nullable = true)
scala>