Question

我有一对包含许多（计数和分数）列的数据框。这种情况不是一个支点，而是一个类似的非透视。例如：

|house_score | house_count | mobile_score | mobile_count | sport_score | sport_count | ....<other couple columns>.....| 
|   20            2              48              6             6             78     |
|   40            78             47              74            69             6     |

我想要一个只有两列的新日期框，得分为e。新数据帧只减少了几列中的所有列。

_________________
| score | count |
|   20  |   2   |
|   40  |   78  |
|   48  |   6   |
|   47  |   74  |
|   6   |   78  |
|   69  |   6   |
|_______________|

什么是最佳解决方案（优雅的代码/性能）？

Answer 1

您可以使用foldLeft对列名称（不包括_之后的部分）来实现此目的。这是相当有效的，因为所有密集型操作都是分布式的，并且代码相当简洁。

// df from example
val df = sc.parallelize(List((20,2,48,6,6,78), (40,78,47,74,69,6) )).toDF("house_score", "house_count", "mobile_score", "mobile_count", "sport_score", "sport_count")

// grab column names (part before the _)
val cols = df.columns.map(col => col.split("_")(0)).distinct

// fold left over all columns
val result = cols.tail.foldLeft( 
   // init with cols.head column
   df.select(col(s"${cols.head}_score").as("score"), col(s"${cols.head}_count").as("count")) 
){case (acc,c) => {
   // union current column c
   acc.unionAll(df.select(col(s"${c}_score").as("score"),     col(s"${c}_count").as("count")))
}}

result.show

Answer 2

在另一个答案中建议使用unionAlls将要求您多次扫描数据，并在每个扫描项目上将df扫描到仅2列。从性能角度来看，如果您可以在1次传递中完成工作，则应避免多次扫描数据，尤其是如果您有大型数据集不可缓存或需要进行多次扫描。

你可以通过生成所有元组（得分，计数）然后平面映射它们来完成1次传递。我让你决定它有多优雅：

Scala Spark - 如何在单个列中减少包含许多列的数据帧？

2 个答案: