Question

我有这个DataFrame：

val df = Seq(
   ("LeBron", 36, 18, 12),
   ("Kevin", 42, 8, 9),
   ("Russell", 44, 5, 14)).
   toDF("player", "points", "rebounds", "assists")

df.show()

+-------+------+--------+-------+
| player|points|rebounds|assists|
+-------+------+--------+-------+
| LeBron|    36|      18|     12|
|  Kevin|    42|       8|      9|
|Russell|    44|       5|     14|
+-------+------+--------+-------+

我想添加＆＃34; season_high＆＃34;到除player之外的每个列名称。我还想使用函数来执行此操作，因为我的实际数据集有250列。

我已经提出了下面的方法来获取我想要的输出，但是我想知道是否有办法将规则传递给renamedColumns映射函数它使列名player无法切换到season_high_player，然后使用其他player函数返回.withColumnRenamed。

val renamedColumns = df.columns.map(name => col(name).as(s"season_high_$name"))

val df2 = df.select(renamedColumns : _*).
    withColumnRenamed("season_high_player", "player")

df2.show()

+-------+------------------+--------------------+-------------------+
| player|season_high_points|season_high_rebounds|season_high_assists|
+-------+------------------+--------------------+-------------------+
| LeBron|                36|                  18|                 12|
|  Kevin|                42|                   8|                  9|
|Russell|                44|                   5|                 14|
+-------+------------------+--------------------+-------------------+

Answer 1

@philantrovert是对的，但他只是忘了告诉你如何使用＆＃34;公式＆＃34;，所以你走了：

val selection : Seq[Column] = Seq(col("player")) ++ df.columns.filter(_ != "player")
                                        .map(name => col(name).as(s"season_high_$name"))
df.select(selection : _*).show
// +-------+------------------+--------------------+-------------------+
// | player|season_high_points|season_high_rebounds|season_high_assists|
// +-------+------------------+--------------------+-------------------+
// | LeBron|                36|                  18|                 12|
// |  Kevin|                42|                   8|                  9|
// |Russell|                44|                   5|                 14|
// +-------+------------------+--------------------+-------------------+

所以我们在这里做的是过滤掉我们不需要的列名（这是普通的scala）。然后我们将我们保留的列名称映射到我们重命名的列。

Answer 2

您可以通过创建一个您不想重命名为第一列的列并执行以下逻辑来执行以下操作

import org.apache.spark.sql.functions._
val columnsRenamed = col(df.columns.head) +: df.columns.tail.map(name => col(name).as(s"season_high_$name"))
df.select(columnsRenamed :_*).show(false)

你应该输出

+-------+------------------+--------------------+-------------------+
|player |season_high_points|season_high_rebounds|season_high_assists|
+-------+------------------+--------------------+-------------------+
|LeBron |36                |18                  |12                 |
|Kevin  |42                |8                   |9                  |
|Russell|44                |5                   |14                 |
+-------+------------------+--------------------+-------------------+

Answer 3

另一种不取决于字段位置的变化形式。

scala>     val df = Seq(
     |       ("LeBron", 36, 18, 12),
     |       ("Kevin", 42, 8, 9),
     |       ("Russell", 44, 5, 14)).
     |       toDF("player", "points", "rebounds", "assists")
df: org.apache.spark.sql.DataFrame = [player: string, points: int ... 2 more fields]

scala> val newColumns = df.columns.map( x => x match { case "player" => col("player") case x =>  col(x).as(s"season_high_$x")}  )
newColumns: Array[org.apache.spark.sql.Column] = Array(player, points AS `season_high_points`, rebounds AS `season_high_rebounds`, assists AS `season_high_assists`)

scala> df.select(newColumns:_*).show(false)
+-------+------------------+--------------------+-------------------+
|player |season_high_points|season_high_rebounds|season_high_assists|
+-------+------------------+--------------------+-------------------+
|LeBron |36                |18                  |12                 |
|Kevin  |42                |8                   |9                  |
|Russell|44                |5                   |14                 |
+-------+------------------+--------------------+-------------------+


scala>

以编程方式重命名所有但一列Spark Scala

3 个答案: