从Spark数据框中删除具有相同值的重复列

时间:2020-07-16 12:10:39

标签: apache-spark apache-spark-sql

代码:

15

输入:

import sparkSession.sqlContext.implicits._
val table_df = Seq((1, 20, 1), (2, 200, 2), (3, 222, 3), (4, 2123, 4), (5, 2321, 5)).toDF("ID", "Weight", "ID")
table_df.show(false)

预期输出:

+---+------+---+
|ID |Weight|ID |
+---+------+---+
|1  |20    |1  |
|2  |200   |2  |
|3  |222   |3  |
|4  |2123  |4  |
|5  |2321  |5  |
+---+------+---+

我正在按照以下方式使用放置

+---+------+
|ID |Weight|
+---+------+
|1  |20    |
|2  |200   |
|3  |222   |
|4  |2123  |
|5  |2321  |
+---+------+

这将删除两个“ ID”列。如何在此处删除重复的第二列“ ID”?

2 个答案:

答案 0 :(得分:2)

您可以使用Dataframe map方法来修剪重复的ID列,如下所示,

table_df.map(row => (row.getInt(0),row.getInt(1))).toDF("ID","Weight").show() 


+---+------+
| ID|Weight|
+---+------+
|  1|    20|
|  2|   200|
|  3|   222|
|  4|  2123|
|  5|  2321|
+---+------+

新架构如下所示,

table_df.map(row => (row.getInt(0),row.getInt(1))).toDF("ID","Weight").schema.treeString

root
 |-- ID: integer (nullable = false)
 |-- Weight: integer (nullable = false)

答案 1 :(得分:0)

您可以在重命名要删除的特定实例后删除该列。

满足此要求的示例代码-

val table_df = Seq((1, 20, 1), (2, 200, 2), (3, 222, 3), (4, 2123, 4), (5, 2321, 5)).toDF("ID", "Weight", "ID")

val newColNames = Seq("ID","Weight","X1")

table_df.toDF(newColNames:_*).show(false)
+---+------+---+
|ID |Weight|X1 |
+---+------+---+
|1  |20    |1  |
|2  |200   |2  |
|3  |222   |3  |
|4  |2123  |4  |
|5  |2321  |5  |
+---+------+---+


table_df.toDF(newColNames:_*).drop("X1").show(false)
+---+------+
|ID |Weight|
+---+------+
|1  |20    |
|2  |200   |
|3  |222   |
|4  |2123  |
|5  |2321  |
+---+------+