Question

我需要转换以下数据框：

╔══════╦════════╦════════╦════════╗
║ Year ║  ColA  ║  ColB  ║  ColC  ║
╠══════╬════════╬════════╬════════╣
║ 2017 ║      1 ║      2 ║      3 ║
║ 2018 ║      4 ║      5 ║      6 ║
║ 2019 ║      7 ║      8 ║      9 ║
╚══════╩════════╩════════╩════════╝

对此：

╔══════╦════════╦═══════╗
║ Year ║ColName ║ Value ║
╠══════╬════════╬═══════╣
║ 2017 ║  ColA  ║     1 ║
║ 2017 ║  ColB  ║     2 ║
║ 2017 ║  ColC  ║     3 ║
║ 2018 ║  ColA  ║     4 ║
║ 2018 ║  ColB  ║     5 ║
║ 2018 ║  ColC  ║     6 ║
║ 2019 ║  ColA  ║     7 ║
║ 2019 ║  ColB  ║     8 ║
║ 2019 ║  ColC  ║     9 ║
╚══════╩════════╩═══════╝

除了第一个“年份”（可能是1个或很多）之外，它还需要支持任意数量的列。而且它应该是通用的解决方案，这意味着它不应在任何地方使用硬编码的列名，而应直接从原始数据框中读取列名。

我正在使用Databricks和用Scala编写的笔记本。对于Spark和Scala来说都是新手。

更新

我已经在Python中找到了一种效果很好的解决方案，但是我很难将其转换为Scala。

def columnsToRows(df, by):
  # Filter dtypes and split into column names and type description.
  # Only get columns not in "by".
  cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))

  # Create and explode an array of (column_name, column_value) structs
  kvs = F.explode(F.array([
    F.struct(F.lit(c.strip()).alias("ColName"), F.col(c).alias("Value")) for c in cols
  ])).alias("kvs")

  return df.select(by + [kvs]).select(by + ["kvs.ColName", "kvs.Value"])

Answer 1

您可以使用stack来转置数据

val fixedColumns = Seq("Year", "FixedColumn")

val cols = df.columns
  .filter(c => !(fixedColumns.contains(c)))
  .map(c => (s"'${c}', ${c}" ))

val exp= cols.mkString(s"stack(${cols.size}, ", "," , ") as (Point, Value)")


df.select($"Year", expr(exp))

输出：

  +----+------+-----+
  |Year|Point |Value|
  +----+------+-----+
  |2017|PointA|1    |
  |2017|PointB|2    |
  |2017|PointC|3    |
  |2018|PointA|4    |
  |2018|PointB|5    |
  |2018|PointC|6    |
  |2019|PointA|7    |
  |2019|PointB|8    |
  |2019|PointC|9    |
  +----+------+-----+

Answer 2

您的python代码翻译如下：

val colsToKeep = Seq("year").map(col) 
val colsToTransform = Seq("colA","colB","colC")

df.select((colsToKeep :+
  explode(
    array(colsToTransform.map(c => struct(lit(c).alias("colName"),col(c).alias("colValue"))):_*)
  ).as("NameValue")):_*)
  .select((colsToKeep :+ $"nameValue.colName":+$"nameValue.colValue"):_*)
  .show()

如何使用ColumnName和ColumnValue将数据框中的每一列转换为一行

2 个答案: