代码:
15
输入:
import sparkSession.sqlContext.implicits._
val table_df = Seq((1, 20, 1), (2, 200, 2), (3, 222, 3), (4, 2123, 4), (5, 2321, 5)).toDF("ID", "Weight", "ID")
table_df.show(false)
预期输出:
+---+------+---+
|ID |Weight|ID |
+---+------+---+
|1 |20 |1 |
|2 |200 |2 |
|3 |222 |3 |
|4 |2123 |4 |
|5 |2321 |5 |
+---+------+---+
我正在按照以下方式使用放置
+---+------+
|ID |Weight|
+---+------+
|1 |20 |
|2 |200 |
|3 |222 |
|4 |2123 |
|5 |2321 |
+---+------+
这将删除两个“ ID”列。如何在此处删除重复的第二列“ ID”?
答案 0 :(得分:2)
您可以使用Dataframe map
方法来修剪重复的ID
列,如下所示,
table_df.map(row => (row.getInt(0),row.getInt(1))).toDF("ID","Weight").show()
+---+------+
| ID|Weight|
+---+------+
| 1| 20|
| 2| 200|
| 3| 222|
| 4| 2123|
| 5| 2321|
+---+------+
新架构如下所示,
table_df.map(row => (row.getInt(0),row.getInt(1))).toDF("ID","Weight").schema.treeString
root
|-- ID: integer (nullable = false)
|-- Weight: integer (nullable = false)
答案 1 :(得分:0)
您可以在重命名要删除的特定实例后删除该列。
满足此要求的示例代码-
val table_df = Seq((1, 20, 1), (2, 200, 2), (3, 222, 3), (4, 2123, 4), (5, 2321, 5)).toDF("ID", "Weight", "ID")
val newColNames = Seq("ID","Weight","X1")
table_df.toDF(newColNames:_*).show(false)
+---+------+---+
|ID |Weight|X1 |
+---+------+---+
|1 |20 |1 |
|2 |200 |2 |
|3 |222 |3 |
|4 |2123 |4 |
|5 |2321 |5 |
+---+------+---+
table_df.toDF(newColNames:_*).drop("X1").show(false)
+---+------+
|ID |Weight|
+---+------+
|1 |20 |
|2 |200 |
|3 |222 |
|4 |2123 |
|5 |2321 |
+---+------+