让我们假设我有一个看起来像这样的数据框:
val df2 = Seq({"A:job_1, B:whatever1"}, {"A:job_1, B:whatever2"} , {"A:job_2, B:whatever3"}).toDF("values")
df2.show()
如何通过“ job_”之类的正则表达式对其进行分组,然后使第一个元素以类似以下内容结尾:
|A:job_1, B:whatever1|
|A:job_2, B:whatever3|
非常感谢
答案 0 :(得分:2)
您可能应该只使用regexp_extract
创建一个新列并将其删除!
import org.apache.spark.sql.{functions => F}
df2.
withColumn("A", F.regexp_extract($"values", "job_[0-9]+", 0)). // Extract the key of the groupBy
groupBy("A").
agg(F.first("values").as("first value")). // Get the first value
drop("A").
show()
如果您想进一步理解,这里是催化剂!
如您在优化的逻辑计划中所见,以下两个是完全等效的:
.withColumn("A", F.regexp_extract($"values", "job_[0-9]+", 0))
.groupBy(F.regexp_extract($"values", "job_[0-9]+", 0).alias("A"))
这是催化剂计划:
== Parsed Logical Plan ==
'Aggregate [A#198], [A#198, first('values, false) AS first value#206]
+- Project [values#3, regexp_extract(values#3, job_[0-9]+, 0) AS A#198]
+- Project [value#1 AS values#3]
+- LocalRelation [value#1]
== Analyzed Logical Plan ==
A: string, first value: string
Aggregate [A#198], [A#198, first(values#3, false) AS first value#206]
+- Project [values#3, regexp_extract(values#3, job_[0-9]+, 0) AS A#198]
+- Project [value#1 AS values#3]
+- LocalRelation [value#1]
== Optimized Logical Plan ==
Aggregate [A#198], [A#198, first(values#3, false) AS first value#206]
+- LocalRelation [values#3, A#198]
答案 1 :(得分:1)
将数据转换为具有两列的Seq并对其进行操作:
val aux = Seq({"A:job_1, B:whatever1"}, {"A:job_1, B:whatever2"} , {"A:job_2, B:whatever3"})
.map(x=>(x.split(",")(0).replace("A:","")
,x.split(",")(1).replace("B:","")))
.toDF("A","B")
.groupBy("A")
我删除了A:
和B:
,但这不是必需的。
或者您可以尝试:
df2.withColumn("A",col("value").substr(4,8))
.groupBy("A")