Question

我有一个如下的数据框

我希望输出数据帧如下所示

range              sum
1-100              109
101-10000          202
10001-1000000      10005
1000001-100000000  2000002
...                ...

如何实现这一目标。我是新来的火花和斯卡拉。

Answer 1

我建议您先使用when / otherwise找到值的范围，然后按range分组并在{{1}上执行sum聚合}}：

articles

Answer 2

您可以使用groupByKey上的Dataset方法轻松定义您的键控，而不是像通常使用groupBy那样按单个列值进行分组。以下示例可以在spark-shell上运行，否则请记住创建SparkSession和import org.apache.spark.sql.functions.sum：

// relevant types: one for actual data, the other to define ranges
final case class Data(articles: Int)
final case class Range(from: Int, to: Int)

// the data we want to process
val dataset = spark.createDataset(
  Seq(Data(10), Data(99), Data(101), Data(101), Data(10005), Data(1000001), Data(1000001)))

// the ranges we wanto _bucket_ our data in
val ranges = spark.sparkContext.broadcast(
  Seq(Range(1, 100), Range(101, 10000), Range(10001, 1000000), Range(1000001, 100000000)))

// the actual operation: group by range and sum the values in each bucket
dataset.groupByKey {
  d =>
    ranges.value.find(r => d.articles >= r.from && d.articles <= r.to).orNull
}.agg(sum("articles").as[Long])

这将是此代码片段的输出：

+-------------------+-------------+
|                key|sum(articles)|
+-------------------+-------------+
|            [1,100]|          109|
|        [101,10000]|          202|
|    [10001,1000000]|        10005|
|[1000001,100000000]|      2000002|
+-------------------+-------------+

我们做了什么：

通过
使用广播范围集 bucket 将数据放入范围
sum articles并将结果转换为Long（输入Dataset s所需）

不属于特定广告资源的数据将被归为null范围内的行。

请注意，我没有使用 bucket 这个词来表达按范围分组的含义，但这与Hive bucketing无关（当你尝试优化Spark上的连接时，你可能会听到很多）。

Answer 3

我会使用UDF对文章进行分类（bucketize），然后使用普通groupBy().agg()来计算总和。

case class Bucket(start: Long, end: Long) {
  def contains(l: Long) = start <= l && end >= l
  override def toString: String = s"$start - $end"
}

val buckets = Seq(
  Bucket(1L, 100L),
  Bucket(101L, 10000L),
  Bucket(10001L, 100000L),
  Bucket(1000001L, 10000000L)
)

val bucketize = udf((l: Long) => buckets.find(_.contains(l)).map(_.toString))

df
  .withColumn("bucket", bucketize($"article"))
  .groupBy($"bucket")
  .agg(
    sum($"article").as("sum")
  )

使用Scala从Spark中列的一系列值汇总到一个新列

3 个答案: