我有一条要读取的csv的路径。该csv包括三列:“主题,键,值”我正在使用spark将此文件读取为csv文件。该文件如下所示(lookupFile.csv):
Topic,Key,Value
fruit,aaa,apple
fruit,bbb,orange
animal,ccc,cat
animal,ddd,dog
//I'm reading the file as follows
val lookup = SparkSession.read.option("delimeter", ",").option("header", "true").csv(lookupFile)
我想拿我刚刚读的东西,并返回具有以下属性的地图:
我希望我能得到一张如下图的地图:
val result = Map("fruit" -> Map("aaa" -> "apple", "bbb" -> "orange"),
"animal" -> Map("ccc" -> "cat", "ddd" -> "dog"))
关于如何做到这一点的任何想法?
答案 0 :(得分:1)
scala> val in = spark.read.option("header", true).option("inferSchema", true).csv("""Topic,Key,Value
| fruit,aaa,apple
| fruit,bbb,orange
| animal,ccc,cat
| animal,ddd,dog""".split("\n").toSeq.toDS)
in: org.apache.spark.sql.DataFrame = [Topic: string, Key: string ... 1 more field]
scala> val res = in.groupBy('Topic).agg(map_from_entries(collect_list(struct('Key, 'Value))).as("subMap"))
res: org.apache.spark.sql.DataFrame = [Topic: string, subMap: map<string,string>]
scala> val scalaMap = res.collect.map{
| case org.apache.spark.sql.Row(k : String, v : Map[String, String]) => (k, v)
| }.toMap
<console>:26: warning: non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure
case org.apache.spark.sql.Row(k : String, v : Map[String, String]) => (k, v)
^
scalaMap: scala.collection.immutable.Map[String,Map[String,String]] = Map(animal -> Map(ccc -> cat, ddd -> dog), fruit -> Map(aaa -> apple, bbb -> orange))
答案 1 :(得分:0)
读入数据
val df1= spark.read.format("csv").option("inferSchema", "true").option("header", "true").load(path)
首先将“键,值”放入array和groupBy主题中以获取目标 分为关键部分和价值部分。
val df2= df.groupBy("Topic").agg(collect_list(array($"Key",$"Value")).as("arr"))
现在转换为数据集
val ds= df2.as[(String,Seq[Seq[String]])]
在字段上应用逻辑以获取地图并收集
val ds1 =ds.map(x=> (x._1,x._2.map(y=> (y(0),y(1))).toMap)).collect
现在,您已将“主题”作为键,并将“键,值”作为值来设置数据,因此现在应用Map即可获取结果
ds1.toMap
Map(animal -> Map(ccc -> cat, ddd -> dog), fruit -> Map(aaa -> apple, bbb -> orange))