使用spark处理地图结构

时间:2018-03-03 11:39:05

标签: apache-spark spark-dataframe

我有一个包含需要处理的地图结构的文件。我使用了下面的代码。我得到了RDD [ROW] .Data的中间结果。如下所示。

val conf=new SparkConf().setAppName("student-example").setMaster("local")
    val sc = new SparkContext(conf)
    val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
    val studentdataframe = sqlcontext.read.parquet("C:\\student_marks.parquet")
    studentdataframe.take(4).foreach(println)

数据看起来像这样。

  [("Name=aaa","sub=math",Map("weekly" -> Array(25,24,23),"quaterly" -> Array(25,20,19),"annual" -> Array(90,95,97)),"2018-02-03")],
  [("Name=bbb","sub=science",Map("weekly" -> Array(25,24,23),"quaterly" -> Array(25,20,19)),"2018-02-03")],
  [("Name=ccc","sub=math",Map("weekly" -> Array(20,21,18),"quaterly" -> Array(25,16,25)),"2018-02-03")],
  [("Name=ddd","sub=math",Map("weekly" -> Array(25,24,23),"quaterly" -> Array(21,19,15),"annual" -> Array(91,86,64)),"2018-02-03")]

数据是RDD [ROW]格式。这里我只想要年度标记的总和。如果没有年度标记,我想跳过记录。我想要这样的输出。

Name=aaa|sub=math|282
Name=ddd|sub=math|241

请帮帮我。

1 个答案:

答案 0 :(得分:1)

您可以使用udf功能来达到您的要求,甚至不需要转换为rdd

我使用您提供的示例数据作为将dataframe形成为

的方法
val studentdataframe = Seq(
  ("Name=aaa","sub=math",Map("weekly" -> Array(25,24,23),"quaterly" -> Array(25,20,19),"annual" -> Array(90,95,97)),"2018-02-03"),
  ("Name=bbb","sub=science",Map("weekly" -> Array(25,24,23),"quaterly" -> Array(25,20,19)),"2018-02-03"),
  ("Name=ccc","sub=math",Map("weekly" -> Array(20,21,18),"quaterly" -> Array(25,16,25)),"2018-02-03"),
  ("Name=ddd","sub=math",Map("weekly" -> Array(25,24,23),"quaterly" -> Array(21,19,15),"annual" -> Array(91,86,64)),"2018-02-03")
).toDF("name", "sub", "marks", "date")

给了我

+--------+-----------+-----------------------------------------------------------------------------------------------------------------+----------+
|name    |sub        |marks                                                                                                            |date      |
+--------+-----------+-----------------------------------------------------------------------------------------------------------------+----------+
|Name=aaa|sub=math   |Map(weekly -> WrappedArray(25, 24, 23), quaterly -> WrappedArray(25, 20, 19), annual -> WrappedArray(90, 95, 97))|2018-02-03|
|Name=bbb|sub=science|Map(weekly -> WrappedArray(25, 24, 23), quaterly -> WrappedArray(25, 20, 19))                                    |2018-02-03|
|Name=ccc|sub=math   |Map(weekly -> WrappedArray(20, 21, 18), quaterly -> WrappedArray(25, 16, 25))                                    |2018-02-03|
|Name=ddd|sub=math   |Map(weekly -> WrappedArray(25, 24, 23), quaterly -> WrappedArray(21, 19, 15), annual -> WrappedArray(91, 86, 64))|2018-02-03|
+--------+-----------+-----------------------------------------------------------------------------------------------------------------+----------+

正如我所说,一个简单的udf函数应该可以解决您的需求,因此udf函数可以如下所示

import org.apache.spark.sql.functions._
def sumAnnual = udf((annual: Map[String, collection.mutable.WrappedArray[Int]]) => if (annual.keySet.contains("annual")) annual("annual").sum else 0)

您可以按以下方式使用

studentdataframe.select(col("name"), col("sub"), sumAnnual(col("marks")).as("sum")).filter(col("sum") =!= 0).show(false)

将提供您所需的dataframe

+--------+--------+---+
|name    |sub     |sum|
+--------+--------+---+
|Name=aaa|sub=math|282|
|Name=ddd|sub=math|241|
+--------+--------+---+

我希望答案很有帮助