我试图将数据帧列表转换为单个数据帧,如下所示 其中dfList是List [sql.Dataframe]
dfList=List([ID: bigint, A: string], [ID: bigint, B: string], [ID: bigint, C: string], [ID: bigint, D: string])
dfList = List( +--------+-------------+ +--------+-------------+ +--------+--------+ +--------+--------+
| ID | A | ID | B | | ID | C | | ID | D |
+--------+-------------+ +--------+-------------+ +--------+--------+ +--------+--------+
| 9574| F| | 9574| 005912| | 9574| 2016022| | 9574| VD|
| 9576| F| | 9576| 005912| | 9576| 2016022| | 9576| VD|
| 9578| F| | 9578| 005912| | 9578| 2016022| | 9578| VD|
| 9580| F| | 9580| 005912| | 9580| 2016022| | 9580| VD|
| 9582| F| | 9582| 005912| | 9582| 2016022| | 9582| VD|
+--------+-------------+, +--------+-------------+,+--------+--------+,+--------+--------+ )
例外输出
+--------+-------------+----------+--------+-------+
| ID | A | B | C | D |
+--------+-------------+----------+--------+-------+
| 9574| F| 005912| 2016022| 00|
| 9576| F| 005912| 2016022| 01|
| 9578| F| 005912| 2016022| 20|
| 9580| F| 005912| 2016022| 19|
| 9582| F| 005912| 2016022| 89|
+--------+-------------+----------+--------+-------+
答案 0 :(得分:3)
您需要将foldLeft
与join
一起使用。
scala> val dfList = ('a' to 'd').map(col => (1 to 5).zip(col.toInt to col.toInt + 4).toDF("ID", col.toString)).toList
dfList: List[org.apache.spark.sql.DataFrame] = List([ID: int, a: int], [ID: int, b: int], [ID: int, c: int], [ID: int, d: int])
这给了我以下DataFrames:
+---+---+ +---+---+ +---+---+ +---+---+
| ID| a| | ID| b| | ID| c| | ID| d|
+---+---+ +---+---+ +---+---+ +---+---+
| 1| 97| | 1| 98| | 1| 99| | 1|100|
| 2| 98| | 2| 99| | 2|100| | 2|101|
| 3| 99| | 3|100| | 3|101| | 3|102|
| 4|100| | 4|101| | 4|102| | 4|103|
| 5|101| | 5|102| | 5|103| | 5|104|
+---+---+ +---+---+ +---+---+ +---+---+
scala> val joinedDF = dfList.tail.foldLeft(dfList.head)((accDF, newDF) => accDF.join(newDF, Seq("ID")))
joinedDF: org.apache.spark.sql.DataFrame = [ID: int, a: int ... 3 more fields]
scala> joinedDF.show
+---+---+---+---+---+
| ID| a| b| c| d|
+---+---+---+---+---+
| 1| 97| 98| 99|100|
| 2| 98| 99|100|101|
| 3| 99|100|101|102|
| 4|100|101|102|103|
| 5|101|102|103|104|
+---+---+---+---+---+
在Scala中,fold
是一种将集合缩减为单个元素的方法。在这种情况下,我们从列表的头部(dfList.head
)开始,然后将列表尾部的每个元素(dfList.tail
)连接在一起,以获得一个最终的DataFrame。 accDF
是累积的DataFrame(从"迭代"到#34;迭代")传递,然后newDF
是要添加的下一个或新的DataFrame。
答案 1 :(得分:1)
@ evan058提供了一个有效的解决方案,但我想补充一点reduce
可能是parallelized operations的更好选择:
val joinedDF = dfList.reduce((accDF, nextDF) => accDF.join(nextDF, Seq("ID")))