Question

我有一个Spark 2.0数据帧example，其结构如下：

id, hour, count
id1, 0, 12
id1, 1, 55
..
id1, 23, 44
id2, 0, 12
id2, 1, 89
..
id2, 23, 34
etc.

每个id包含24个条目（一天中每小时一个），并使用orderBy函数按id，小时排序。

我创建了一个聚合器groupConcat：

  def groupConcat(separator: String, columnToConcat: Int) = new Aggregator[Row, String, String] with Serializable {
    override def zero: String = ""

    override def reduce(b: String, a: Row) = b + separator + a.get(columnToConcat)

    override def merge(b1: String, b2: String) = b1 + b2

    override def finish(b: String) = b.substring(1)

    override def bufferEncoder: Encoder[String] = Encoders.STRING

    override def outputEncoder: Encoder[String] = Encoders.STRING
  }.toColumn

它帮助我将列连接成字符串以获取最终的数据帧：

id, hourly_count
id1, 12:55:..:44
id2, 12:89:..:34
etc.

我的问题是，如果我example.orderBy($"id",$"hour").groupBy("id").agg(groupConcat(":",2) as "hourly_count")，那是否可以保证每小时的数量在各自的桶中正确排序？

我读到RDD不一定是这种情况（参见Spark sort by key and then group by to get ordered iterable?），但是对于DataFrames可能有所不同吗？

如果没有，我该如何解决？

Answer 1

groupBy在orderBy没有维持秩序后，正如其他人所指出的那样。你想要做的是使用一个Window函数 - id上的分区和按小时排序。您可以对此进行collect_list，然后获取结果列表中的最大（最大），因为它们累积起来（即第一个小时将只在列表中出现，第二个小时将在列表中有2个元素，依此类推）。

完整的示例代码：

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._

val data = Seq(( "id1", 0, 12),
  ("id1", 1, 55),
  ("id1", 23, 44),
  ("id2", 0, 12),
  ("id2", 1, 89),
  ("id2", 23, 34)).toDF("id", "hour", "count")

    val mergeList = udf{(strings: Seq[String]) => strings.mkString(":")}
    data.withColumn("collected", collect_list($"count")
                                                    .over(Window.partitionBy("id")
                                                                 .orderBy("hour")))
            .groupBy("id")
            .agg(max($"collected").as("collected"))
            .withColumn("hourly_count", mergeList($"collected"))
            .select("id", "hourly_count").show

这使我们保持在DataFrame世界中。我还简化了OP正在使用的UDF代码。

输出：

+---+------------+
| id|hourly_count|
+---+------------+
|id1|    12:55:44|
|id2|    12:89:34|
+---+------------+

Answer 2

我有一个案例，订单并不总是保留：有时是，大多数没有。

我的数据框有200个分区在Spark 1.6上运行

df_group_sort = data.orderBy(times).groupBy(group_key).agg(
                                                  F.sort_array(F.collect_list(times)),
                                                  F.collect_list(times)
                                                           )

检查排序我比较

的返回值

F.sort_array(F.collect_list(times))

和

F.collect_list(times)

给予例如（左：sort_array（collect_list（））;右：collect_list（））

2016-12-19 08:20:27.172000 2016-12-19 09:57:03.764000
2016-12-19 08:20:30.163000 2016-12-19 09:57:06.763000
2016-12-19 08:20:33.158000 2016-12-19 09:57:09.763000
2016-12-19 08:20:36.158000 2016-12-19 09:57:12.763000
2016-12-19 08:22:27.090000 2016-12-19 09:57:18.762000
2016-12-19 08:22:30.089000 2016-12-19 09:57:33.766000
2016-12-19 08:22:57.088000 2016-12-19 09:57:39.811000
2016-12-19 08:23:03.085000 2016-12-19 09:57:45.770000
2016-12-19 08:23:06.086000 2016-12-19 09:57:57.809000
2016-12-19 08:23:12.085000 2016-12-19 09:59:56.333000
2016-12-19 08:23:15.086000 2016-12-19 10:00:11.329000
2016-12-19 08:23:18.087000 2016-12-19 10:00:14.331000
2016-12-19 08:23:21.085000 2016-12-19 10:00:17.329000
2016-12-19 08:23:24.085000 2016-12-19 10:00:20.326000

左列始终排序，而右列仅排序排序。对于take（）的不同执行，右列中块的顺序是不同的。

Answer 3

如果您想解决Java中的实现问题（Scala和Python应该相似）：

example.orderBy(“hour”).groupBy(“id”).agg(functions.sort_array(functions.collect_list(functions.struct(dataRow.col(“hour”),dataRow.col(“count”))),false).as(“hourly_count”));

Answer 4

顺序可能相同也可能不同，具体取决于分区数量和数据分布。我们可以使用rdd本身解决。

例如::

我将以下示例数据保存在文件中并将其加载到hdfs中。

1,type1,300
2,type1,100
3,type2,400
4,type2,500
5,type1,400
6,type3,560
7,type2,200
8,type3,800

并执行以下命令：

sc.textFile("/spark_test/test.txt").map(x=>x.split(",")).filter(x=>x.length==3).groupBy(_(1)).mapValues(x=>x.toList.sortBy(_(2)).map(_(0)).mkString("~")).collect()

输出：

Array[(String, String)] = Array((type3,6~8), (type1,2~1~5), (type2,7~3~4))

也就是说，我们按类型对数据进行分组，然后按价格进行分类，并将ID与＆＃34;〜＆＃34;连接起来。作为分隔符。上面的命令可以打破如下：

val validData=sc.textFile("/spark_test/test.txt").map(x=>x.split(",")).filter(x=>x.length==3)

val groupedData=validData.groupBy(_(1))  //group data rdds

val sortedJoinedData=groupedData.mapValues(x=>{
   val list=x.toList
   val sortedList=list.sortBy(_(2))
   val idOnlyList=sortedList.map(_(0))
   idOnlyList.mkString("~")
}
)
sortedJoinedData.collect()

然后我们可以使用命令

来获取特定组

sortedJoinedData.filter(_._1=="type1").collect()

输出：

Array[(String, String)] = Array((type1,2~1~5))

Answer 5

否，groupByKey内的排序不一定会得到维持，但是众所周知，这很难在一个节点上的内存中重现。如前所述，最典型的方式是发生groupByKey需要重新分区的情况。我设法通过在repartition之后手动进行sort来重现此问题。然后，我将结果传递到groupByKey。

case class Numbered(num:Int, group:Int, otherData:Int)

// configure spark with "spark.sql.shuffle.partitions" = 2 or some other small number 

val v =
  (1 to 100000)
    // Make waaay more groups then partitions. I added an extra integer just to mess with the sort hash computation (i.e. so it won't be monotonic, not sure if needed)
    .map(Numbered(_, Random.nextInt(300), Random.nextInt(1000000))).toDS()
    // Be sure they are stored in a small number of partitions
    .repartition(2)
    .sort($"num")
    // Repartition again with a waaay bigger number then there are groups so that when things need to be merged you can get them out of order.
    .repartition(200)
    .groupByKey(_.group)
    .mapGroups {
      case (g, nums) =>
        nums             // all you need is .sortBy(_.num) here to fix the problem          
          .map(_.num)
          .mkString("~")
    }
    .collect()

// Walk through the concatenated strings. If any number ahead 
// is smaller than the number before it, you know that something
// is out of order.
v.zipWithIndex.map { case (r, i) =>
  r.split("~").map(_.toInt).foldLeft(0) { case (prev, next) =>
    if (next < prev) {
      println(s"*** Next: ${next} less then ${prev} for dataset ${i + 1} ***")
    }
    next
  }
}

Answer 6

简短回答是，每小时计数将维持相同的顺序。

要概括，在分组前进行排序非常重要。此外，排序必须与您实际想要排序的组+列相同。

一个例子就是：

employees
    .sort("company_id", "department_id", "employee_role")
    .groupBy("company_id", "department_id")
    .agg(Aggregators.groupConcat(":", 2) as "count_per_role")

Spark DataFrame：在orderBy维护该命令之后是否groupBy？

6 个答案: