Spark在RDD中查找字段的重复记录

时间:2016-08-02 19:26:53

标签: apache-spark duplicates rdd

我有数据集 10,"名称",2016年,"国家" 11,"名1",2016年," COUNTRY1" 10,"名称",2016年,"国家" 10,"名称",2016年,"国家" 12,"名称2",2017年," COUNTRY2"

我的问题陈述是我必须按年查找总计数和重复数。我的结果应该是(年,总记录,重复) 2016,4,3 2017,1,0。

我试图通过

来解决这个问题
val records = rdd.map {
              x => 
               val array = x.split(",")
               (array(2),x)
             }.groupByKey()
val duplicates = records.map {
                 x => val totalcount = x._2.size
                      val duplicates = // find duplicates in iterator
                     (x._1,totalcount,duplicates)
                }

运行良好,最高可达10GB数据。如果我在更多数据上运行它需要很长时间。我发现groupByKey不是最好的方法。

请提出解决此问题的最佳方法。

1 个答案:

答案 0 :(得分:0)

我不是那个以你的例子显示的方式计算重复项的SQL专家。但是我想这会让你开始使用数据帧。我的理解是,数据帧的性能明显优于直接RDD。

scala> import com.databricks.spark.csv._
import com.databricks.spark.csv._

scala> 

scala> val  s = List("""10,"Name",2016,"Country"""", """11,"Name1",2016,"country1"""", """10,"Name",2016,"Country"""", """10,"Name",2016,"Country"""", """12,"Name2",2017,"Country2"""")
s: List[String] = List(10,"Name",2016,"Country", 11,"Name1",2016,"country1", 10,"Name",2016,"Country", 10,"Name",2016,"Country", 12,"Name2",2017,"Country2")

scala> val rdd = sc.parallelize(s)
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[92] at parallelize at <console>:38

scala> 

scala> val df = new CsvParser().withDelimiter(',').withInferSchema(true).withParseMode("DROPMALFORMED").csvRdd(sqlContext, rdd)
df: org.apache.spark.sql.DataFrame = [C0: int, C1: string, C2: int, C3: string]

scala> 

scala> df.registerTempTable("test")

scala> 

scala> val dfCount = sqlContext.sql("select C2, count(*), count(distinct C0,C2,C1,C3) from test group by C2")
dfCount: org.apache.spark.sql.DataFrame = [C2: int, _c1: bigint, _c2: bigint]

scala> 

scala> dfCount.show
+----+---+---+                                                                  
|  C2|_c1|_c2|
+----+---+---+
|2016|  4|  2|
|2017|  1|  1|
+----+---+---+