使用scala将重复的单词放在长字符串中

时间:2018-10-03 07:44:10

标签: scala pyspark apache-spark-sql

我很好奇要学习如何在数据框列中包含的字符串中删除重复的单词。我想使用scala完成它。 举例来说,您可以在下面找到我要转换的数据框。

数据帧:

file/fodler

结果:

val dataset1 = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID") 

+----+-------+---+
|KEY1|   KEY2| ID|
+----+-------+---+
|  66|a,b,c,a|  4|
|  67|a,f,g,t|  0|
|  70|b,b,b,d|  4|
+----+-------+---+

使用pyspark我已经使用以下代码来获得以上结果。我无法通过Scala重写这样的代码。你有什么建议吗?预先感谢您,祝您愉快。

pyspark代码:

+----+----------+---+
|KEY1|      KEY2| ID|
+----+----------+---+
|  66|   a, b, c|  4|
|  67|a, f, g, t|  0|
|  70|      b, d|  4|
+----+----------+---+

3 个答案:

答案 0 :(得分:0)

可能会有更优化的解决方案,但这可以为您提供帮助。

val rdd2 = dataset1.rdd.map(x => x(1).toString.split(",").distinct.mkString(", "))

//,然后将其转换为数据集 //或

val distinctUDF = spark.udf.register("distinctUDF", (s: String) => s.split(",").distinct.mkString(", "))

dataset1.createTempView("dataset1")

spark.sql("Select KEY1, distinctUDF(KEY2), ID from dataset1").show

答案 1 :(得分:0)

import org.apache.spark.sql._

 val dfUpdated = dataset1.rdd.map{
     case Row(x: String, y: String,z:String) => (x,y.split(",").distinct.mkString(", "),z)
 }.toDF(dataset1.columns:_*)

在星空壳中:

scala> val dataset1 = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")    
dataset1: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]

scala> dataset1.show
+----+-------+---+
|KEY1|   KEY2| ID|
+----+-------+---+
|  66|a,b,c,a|  4|
|  67|a,f,g,t|  0|
|  70|b,b,b,d|  4|
+----+-------+---+

scala> val dfUpdated = dataset1.rdd.map{
           case Row(x: String, y: String,z:String) => (x,y.split(",").distinct.mkString(", "),z)
       }.toDF(dataset1.columns:_*)
dfUpdated: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]

scala> dfUpdated.show
+----+----------+---+
|KEY1|      KEY2| ID|
+----+----------+---+
|  66|   a, b, c|  4|
|  67|a, f, g, t|  0|
|  70|      b, d|  4|
+----+----------+---+

答案 2 :(得分:0)

数据框解决方案

scala> val df = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
df: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]

scala> val distinct :String => String = _.split(",").toSet.mkString(",")
distinct: String => String = <function1>

scala> val distinct_id = udf (distinct)
distinct_id: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))

scala> df.select('key1,distinct_id('key2).as("distinct"),'id).show
+----+--------+---+
|key1|distinct| id|
+----+--------+---+
|  66|   a,b,c|  4|
|  67| a,f,g,t|  0|
|  70|     b,d|  4|
+----+--------+---+


scala>