Scala / Spark:如何基于公共列进行外部联接

时间:2018-08-22 21:49:39

标签: scala apache-spark

我有2个数据数据帧:

  • 第一个数据帧包含温度信息。

  • 第二个数据帧包含降水信息”

我读取这些文件并创建数据框为:

val dataRecordsTemp = sc.textFile(tempFile).map{rec=>
            val splittedRec = rec.split("\\s+")
            Temparature(splittedRec(0),splittedRec(1),splittedRec(2),splittedRec(3),splittedRec(4))
        }.map{x => Row.fromSeq(x.getDataFields())}

val headerFieldsForTemp = Seq("YEAR","MONTH","DAY","MAX_TEMP","MIN_TEMP")
val schemaTemp = StructType(headerFieldsForTemp.map{f => StructField(f, StringType, nullable=true)})
val dfTemp = session.createDataFrame(dataRecordsTemp,schemaTemp)
              .orderBy(desc("year"), desc("month"), desc("day"))

println("Printing temparature data ...............................")
dfTemp.select("YEAR","MONTH","DAY","MAX_TEMP","MIN_TEMP").take(10).foreach(println)

val dataRecordsPrecip = sc.textFile(precipFile).map{rec=>
        val splittedRec = rec.split("\\s+")
        Precipitation(splittedRec(0),splittedRec(1),splittedRec(2),splittedRec(3),splittedRec(4),splittedRec(5))
      }.map{x => Row.fromSeq(x.getDataFields())}

val headerFieldsForPrecipitation = Seq("YEAR","MONTH","DAY","PRECIPITATION","SNOW","SNOW_COVER")
val schemaPrecip = StructType(headerFieldsForPrecipitation.map{f => StructField(f, StringType, nullable=true)})
val dfPrecip = session.createDataFrame(dataRecordsPrecip,schemaPrecip)
      .orderBy(desc("year"), desc("month"), desc("day"))

println("Printing precipitation data ...............................")
dfPrecip.select("YEAR","MONTH","DAY","PRECIPITATION","SNOW","SNOW_COVER").take(10).foreach(println)

我必须基于共同的列(年,月,日)加入2个RDD。输入文件具有标题,输出文件也具有标题。第一个文件具有温度信息(例如):

year month day min-temp mav-temp
2017 12    13  13       25
2017 12    16  25       32
2017 12    25  34       56

第二个文件具有信息沉淀(例如)

year month day precipitation snow snow-cover
2018  7    6   0.00          0.0  0
2017  12   13  0.04          0.0  0
2017  12   16  0.4           0.04 1

我的预期输出应为(按日期异步排序,如果找不到值,则为空白):

year month day min-temp mav-temp precipitation snow snow-cover
2017 12    13  13       25       0.04          0.0  0
2017 12    16  25       32       0.4           0.04 1
2017 12    25  34       56                 
2018  7    6                     0.00          0.0  0

我可以在Scala中获得帮助吗?

1 个答案:

答案 0 :(得分:1)

您需要外部连接这两个数据集,然后按以下顺序排序结果:

import org.apache.spark.sql.functions._

dfTemp
      .join(dfPrecip, Seq("year", "month", "day"), "outer")
      .orderBy(desc("year"), desc("month"), desc("day"))
      .na.fill("")

如果您不需要空白值并使用null进行罚款,则可以避免使用.na.fill("")

希望有帮助!