Question

我有一个数据框（scala），我想在数据框上执行类似下面的操作：

我想按专栏分组＆＃39; a＆＃39;并从分组列中选择第1列中的任何值并将其应用于所有行。对于a = 1，则b应该是所有3行上的x或y或h，其余列应不受影响。对此有何帮助？

Answer 1

您可以尝试这一点，即创建另一个包含a, b列的数据框，其中b每a有一个值，然后将其与原始数据框重新连接：

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number

val w = Window.partitionBy($"a").orderBy($"b")  
// create the window object so that we can create a column that gives unique row number 
// for each unique a

(df.withColumn("rn", row_number.over(w)).where($"rn" === 1).select("a", "b")
// create the row number column for each unique a and choose the first row for each group
// which returns a reduced data frame one row per group

   .join(df.select("a", "c"), Seq("a"), "inner").show)
// join the reduced data frame back with the original data frame(a,c columns), then b column 
// will have just one value

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  h|  g|
|  1|  h|  y|
|  1|  h|  x|
|  2|  c|  d|
|  2|  c|  x|

如何按数据框上的列分组并将单个值应用于分组的所有行的列？

1 个答案: