Scala,将数据帧列表转换为单个数据帧,然后将其与具有特定列的另一个数据帧合并

时间:2017-05-16 18:56:46

标签: scala apache-spark spark-dataframe

我有一个数据帧列表我要转换成单个数据帧,然后与另一个数据帧合并,该数据帧与该数据帧的特定列相同,数据类型应该与新数据帧一样。其中dfList是List [sql .Dataframe]。任何帮助将不胜感激。

dfList[sql.Dataframe]=List([A: int]:Dataframe, [B: string]:Dataframe, [C: long]:Dataframe, [D: string]:Dataframe)

dfList = List( +-------+----------+--------+--------+
               |  A    |     B    |     C  |   D    |
               +-------+----------+--------+--------+
               |     41|    912AEQ| 2016022|      UJ|
               |     82|    912ARD| 2016022|      GH|
               |    903|    912AYQ| 2016022|      KL|
               |    454|    912AKK| 2016022|      KL|
               |     95|    912AHG| 2016022|      KH|
               +-------+----------+--------+--------+ )

the data type of df is Id: int, v1: string, v2: long, v3: string

df[Dataframe] =            
+---+---+-----------+-----+
| Id| v1|    v2     | v3  |
+---+---+-----------+-----+
| 11| AS| 0989765498|SDAWQ|
| 12| GH| 7654998599|TRUDR|
| 13| IO|10654998580|ABUCK|
| 14|1JG|65499855101|KLBCK|
| 15| RT|10265499852|BCKKL|
+---+---+-----------+-----+            

The newDF will be combination of dfList and df.
The datatype of newDF should be Id: int, A: int, B: string, C: long, D: string 


newDF =
    +---+------+----------+--------+--------+
    | Id| A    |     B    |     C  |   D    |
    +---+------+----------+--------+--------+
    | 11|    41|    912AEQ| 2016022|      UJ|
    | 12|    82|    912ARD| 2016022|      GH|
    | 13|   903|    912AYQ| 2016022|      KL|
    | 14|   454|    912AKK| 2016022|      KL|
    | 15|    95|    912AHG| 2016022|      KH|
    +---+------+----------+--------+--------+

1 个答案:

答案 0 :(得分:0)

下面是完整的解决方案,因为您没有任何键来连接两个数据帧,您需要在两个数据帧上添加索引列,最后加入它们并删除索引。通过reduce和union从数据框列表中创建单个数据框。

import spark.implicits._

  //Create dfList dataframe

  val d1 = spark.sparkContext
    .parallelize(
      Seq(
        (41, "912AEQ", 2016022l, "UJ"),
        (82, "912ARD", 2016022l, "GH"),
        (903, "912AYQ", 2016022l, "KL")
      ))
    .toDF("A", "B", "C", "D")


  val d2 = spark.sparkContext
    .parallelize(
      Seq(
        (454, "912AKK", 2016022l, "KL"),
        (95, "912AHG", 2016022l, "KH")
      ))
    .toDF("A", "B", "C", "D")

  val allD = List(d1, d2)  //list of dataframe

  //create a singe dataframe from list of dataframe
  val dfList = allD.reduce(_ union _)


  //Create df dataframe
  val df = spark.sparkContext
    .parallelize(
      Seq(
        (11, "AS", 989765498l, "SDAWQ"),
        (12, "GH", 7654998599l, "TRUDR"),
        (13, "IO", 10654998580l, "ABUCK"),
        (14, "1JG", 65499855101l, "KLBCK"),
        (15, "RT", 10265499852l, "BCKKL")
      ))
    .toDF("id", "v1", "v2", "v3")

  val dfListWithIndex = addIndex(dfList) // and index column
  val dfWithIndex = addIndex(df).drop("v1", "v2", "v3") //add index column and remove unnecessary columns

  val newDF = dfWithIndex.join(dfListWithIndex, "index").drop("index") //join two dataframe and drop index column


  dfListWithIndex.show()
  dfWithIndex.show()
  newDF.printSchema()
  newDF.show

  def addIndex(df: DataFrame) = spark.sqlContext.createDataFrame(
    // Add index column
    df.rdd.zipWithIndex.map {
      case (row, index) => Row.fromSeq(row.toSeq :+ index)
    },
    // Create schema for index column
    StructType(df.schema.fields :+ StructField("index", LongType, false))
  )