Groupby正则表达式Spark Scala

时间:2019-05-21 12:42:42

标签: scala apache-spark group-by apache-spark-sql

让我们假设我有一个看起来像这样的数据框:

val df2 = Seq({"A:job_1, B:whatever1"}, {"A:job_1, B:whatever2"} , {"A:job_2, B:whatever3"}).toDF("values")
df2.show()

如何通过“ job_”之类的正则表达式对其进行分组,然后使第一个元素以类似以下内容结尾:

|A:job_1, B:whatever1|
|A:job_2, B:whatever3|

非常感谢

2 个答案:

答案 0 :(得分:2)

您可能应该只使用regexp_extract创建一个新列并将其删除!

import org.apache.spark.sql.{functions => F}

df2.
    withColumn("A", F.regexp_extract($"values", "job_[0-9]+", 0)). // Extract the key of the groupBy
    groupBy("A").
    agg(F.first("values").as("first value")). // Get the first value
    drop("A").
    show()

如果您想进一步理解,这里是催化剂!

如您在优化的逻辑计划中所见,以下两个是完全等效的:

  • 使用以下内容显式创建新列:.withColumn("A", F.regexp_extract($"values", "job_[0-9]+", 0))
  • 按新列分组:.groupBy(F.regexp_extract($"values", "job_[0-9]+", 0).alias("A"))

这是催化剂计划:

== Parsed Logical Plan ==
'Aggregate [A#198], [A#198, first('values, false) AS first value#206]
+- Project [values#3, regexp_extract(values#3, job_[0-9]+, 0) AS A#198]
   +- Project [value#1 AS values#3]
      +- LocalRelation [value#1]

== Analyzed Logical Plan ==
A: string, first value: string
Aggregate [A#198], [A#198, first(values#3, false) AS first value#206]
+- Project [values#3, regexp_extract(values#3, job_[0-9]+, 0) AS A#198]
   +- Project [value#1 AS values#3]
      +- LocalRelation [value#1]

== Optimized Logical Plan ==
Aggregate [A#198], [A#198, first(values#3, false) AS first value#206]
+- LocalRelation [values#3, A#198]

答案 1 :(得分:1)

将数据转换为具有两列的Seq并对其进行操作:

val aux = Seq({"A:job_1, B:whatever1"}, {"A:job_1, B:whatever2"} , {"A:job_2, B:whatever3"})
  .map(x=>(x.split(",")(0).replace("A:","")
    ,x.split(",")(1).replace("B:","")))
  .toDF("A","B")
  .groupBy("A")

我删除了A:B:,但这不是必需的。

或者您可以尝试:

df2.withColumn("A",col("value").substr(4,8))
  .groupBy("A")