Question

以下内容：

val pivotDF = df.groupBy("Product").pivot("Country").sum("Amount")
pivotDF.show()

我不记得看到有能力对透视列进行排序。排序的假设是什么？始终提升。找不到。没有确定性？

欢迎提示。

Answer 1

根据scala docs：

pivot函数有两种版本：一种要求调用者指定要在其上进行枢轴操作的不同值的列表，而另一种则不需要。后者更为简洁，但效率较低，因为Spark需要首先在内部计算不同值的列表。

看看latter one works

// This is to prevent unintended OOM errors when the number of distinct values is large
val maxValues = df.sparkSession.sessionState.conf.dataFramePivotMaxValues
// Get the distinct values of the column and sort them so its consistent
val values = df.select(pivotColumn)
  .distinct()
  .limit(maxValues + 1)
  .sort(pivotColumn)  // ensure that the output columns are in a consistent logical order
  .collect()
  .map(_.get(0))
  .toSeq

和values传递到以前的版本。因此，当使用自动检测值的版本时，列始终使用值的自然顺序进行排序。如果需要其他排序，则可以很容易地复制自动检测机制，然后使用显式值调用该版本：

val df = Seq(("Foo", "UK", 1), ("Bar", "UK", 1), ("Foo", "FR", 1), ("Bar", "FR", 1))
  .toDF("Product", "Country", "Amount")
df.groupBy("Product")
  .pivot("Country", Seq("UK", "FR")) // natural ordering would be "FR", "UK"
  .sum("Amount")
  .show()

输出：

+-------+---+---+
|Product| UK| FR|
+-------+---+---+
|    Bar|  1|  1|
|    Foo|  1|  1|
+-------+---+---+

Spark Scala排序PIVOT列

1 个答案: