我有一个数据帧列表我要转换成单个数据帧,然后与另一个数据帧合并,该数据帧与该数据帧的特定列相同,数据类型应该与新数据帧一样。其中dfList是List [sql .Dataframe]。任何帮助将不胜感激。
dfList[sql.Dataframe]=List([A: int]:Dataframe, [B: string]:Dataframe, [C: long]:Dataframe, [D: string]:Dataframe)
dfList = List( +-------+----------+--------+--------+
| A | B | C | D |
+-------+----------+--------+--------+
| 41| 912AEQ| 2016022| UJ|
| 82| 912ARD| 2016022| GH|
| 903| 912AYQ| 2016022| KL|
| 454| 912AKK| 2016022| KL|
| 95| 912AHG| 2016022| KH|
+-------+----------+--------+--------+ )
the data type of df is Id: int, v1: string, v2: long, v3: string
df[Dataframe] =
+---+---+-----------+-----+
| Id| v1| v2 | v3 |
+---+---+-----------+-----+
| 11| AS| 0989765498|SDAWQ|
| 12| GH| 7654998599|TRUDR|
| 13| IO|10654998580|ABUCK|
| 14|1JG|65499855101|KLBCK|
| 15| RT|10265499852|BCKKL|
+---+---+-----------+-----+
The newDF will be combination of dfList and df.
The datatype of newDF should be Id: int, A: int, B: string, C: long, D: string
newDF =
+---+------+----------+--------+--------+
| Id| A | B | C | D |
+---+------+----------+--------+--------+
| 11| 41| 912AEQ| 2016022| UJ|
| 12| 82| 912ARD| 2016022| GH|
| 13| 903| 912AYQ| 2016022| KL|
| 14| 454| 912AKK| 2016022| KL|
| 15| 95| 912AHG| 2016022| KH|
+---+------+----------+--------+--------+
答案 0 :(得分:0)
下面是完整的解决方案,因为您没有任何键来连接两个数据帧,您需要在两个数据帧上添加索引列,最后加入它们并删除索引。通过reduce和union从数据框列表中创建单个数据框。
import spark.implicits._
//Create dfList dataframe
val d1 = spark.sparkContext
.parallelize(
Seq(
(41, "912AEQ", 2016022l, "UJ"),
(82, "912ARD", 2016022l, "GH"),
(903, "912AYQ", 2016022l, "KL")
))
.toDF("A", "B", "C", "D")
val d2 = spark.sparkContext
.parallelize(
Seq(
(454, "912AKK", 2016022l, "KL"),
(95, "912AHG", 2016022l, "KH")
))
.toDF("A", "B", "C", "D")
val allD = List(d1, d2) //list of dataframe
//create a singe dataframe from list of dataframe
val dfList = allD.reduce(_ union _)
//Create df dataframe
val df = spark.sparkContext
.parallelize(
Seq(
(11, "AS", 989765498l, "SDAWQ"),
(12, "GH", 7654998599l, "TRUDR"),
(13, "IO", 10654998580l, "ABUCK"),
(14, "1JG", 65499855101l, "KLBCK"),
(15, "RT", 10265499852l, "BCKKL")
))
.toDF("id", "v1", "v2", "v3")
val dfListWithIndex = addIndex(dfList) // and index column
val dfWithIndex = addIndex(df).drop("v1", "v2", "v3") //add index column and remove unnecessary columns
val newDF = dfWithIndex.join(dfListWithIndex, "index").drop("index") //join two dataframe and drop index column
dfListWithIndex.show()
dfWithIndex.show()
newDF.printSchema()
newDF.show
def addIndex(df: DataFrame) = spark.sqlContext.createDataFrame(
// Add index column
df.rdd.zipWithIndex.map {
case (row, index) => Row.fromSeq(row.toSeq :+ index)
},
// Create schema for index column
StructType(df.schema.fields :+ StructField("index", LongType, false))
)