遍历Spark 1.6中的分组数据集

时间:2019-02-14 17:04:40

标签: apache-spark apache-spark-1.6

在有序数据集中,我想聚合数据直到满足条件,但要按特定键分组。

要为我的问题设置背景,我将问题简化为以下问题陈述:

  

在Spark中,我需要聚合字符串,当用户停止时按键分组   “喊”(字符串中的第二个字符不是大写)。

数据集示例:

ID, text, timestamps

1, "OMG I like bananas", 123
1, "Bananas are the best", 234
1, "MAN I love banana", 1235
2, "ORLY? I'm more into grapes", 123565
2, "BUT I like apples too", 999
2, "unless you count veggies", 9999
2, "THEN don't forget tomatoes", 999999

预期结果将是:

1, "OMG I like bananas Bananas are the best"
2, "ORLY? I'm more into grapes BUT I like apples too unless you count veggies"

通过groupby和agg我似乎无法将条件设置为“找到大写字符时停止”。

1 个答案:

答案 0 :(得分:2)

这仅适用于Spark 2.1或更高版本

您可能想要做的事,但这可能会非常昂贵。

首先,让我们创建一些测试数据。作为一般建议,当您在Stackoverflow上提出问题时,请提供与此类似的内容,以便人们有所作为。

import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._

val df = List(
    (1,  "OMG I like bananas", 1),
    (1, "Bananas are the best", 2),
    (1, "MAN I love banana", 3),
    (2, "ORLY? I'm more into grapes", 1),
    (2, "BUT I like apples too", 2),
    (2, "unless you count veggies", 3),
    (2, "THEN don't forget tomatoes", 4)
).toDF("ID", "text", "timestamps")

为了使列中的文本按顺序排列,我们需要使用window函数添加一个新列。

使用火花壳:

scala> val df2 = df.withColumn("coll", collect_list("text").over(Window.partitionBy("id").orderBy("timestamps")))
df2: org.apache.spark.sql.DataFrame = [ID: int, text: string ... 2 more fields]

scala> val x = df2.groupBy("ID").agg(max($"coll").as("texts"))
x: org.apache.spark.sql.DataFrame = [ID: int, texts: array<string>]

scala> x.collect.foreach(println)
[1,WrappedArray(OMG I like bananas, Bananas are the best, MAN I love banana)]
[2,WrappedArray(ORLY? I'm more into grapes, BUT I like apples too, unless you count veggies, THEN don't forget tomatoes)]

要获取实际文本,我们可能需要UDF。这是我的(我距离Scala的专家还很远,所以请多多包涵)

import scala.collection.mutable

val aggText: Seq[String] => String = (list: Seq[String]) => {
    def tex(arr: Seq[String], accum: Seq[String]): Seq[String] = arr match {
        case Seq() => accum
        case Seq(single) => accum :+ single
        case Seq(str, xs @_*) => if (str.length >= 2 && !(str.charAt(0).isUpper && str.charAt(1).isUpper))
            tex(Nil, accum :+ str )
        else
            tex(xs, accum :+ str)
    }

    val res = tex(list, Seq())
    res.mkString(" ")
}

val textUDF = udf(aggText(_: mutable.WrappedArray[String]))

因此,我们有一个数据框,其中包含按正确顺序收集的文本,以及一个Scala函数(包装为UDF)。让我们拼凑一下:

scala> val x = df2.groupBy("ID").agg(max($"coll").as("texts"))
x: org.apache.spark.sql.DataFrame = [ID: int, texts: array<string>]

scala> val y = x.select($"ID", textUDF($"texts"))
y: org.apache.spark.sql.DataFrame = [ID: int, UDF(texts): string]

scala> y.collect.foreach(println)
[1,OMG I like bananas Bananas are the best]
[2,ORLY? I'm more into grapes BUT I like apples too unless you count veggies]

scala>

我认为这是您想要的结果。