我有数据集 10,"名称",2016年,"国家" 11,"名1",2016年," COUNTRY1" 10,"名称",2016年,"国家" 10,"名称",2016年,"国家" 12,"名称2",2017年," COUNTRY2"
我的问题陈述是我必须按年查找总计数和重复数。我的结果应该是(年,总记录,重复) 2016,4,3 2017,1,0。
我试图通过
来解决这个问题val records = rdd.map {
x =>
val array = x.split(",")
(array(2),x)
}.groupByKey()
val duplicates = records.map {
x => val totalcount = x._2.size
val duplicates = // find duplicates in iterator
(x._1,totalcount,duplicates)
}
运行良好,最高可达10GB数据。如果我在更多数据上运行它需要很长时间。我发现groupByKey不是最好的方法。
请提出解决此问题的最佳方法。
答案 0 :(得分:0)
我不是那个以你的例子显示的方式计算重复项的SQL专家。但是我想这会让你开始使用数据帧。我的理解是,数据帧的性能明显优于直接RDD。
scala> import com.databricks.spark.csv._
import com.databricks.spark.csv._
scala>
scala> val s = List("""10,"Name",2016,"Country"""", """11,"Name1",2016,"country1"""", """10,"Name",2016,"Country"""", """10,"Name",2016,"Country"""", """12,"Name2",2017,"Country2"""")
s: List[String] = List(10,"Name",2016,"Country", 11,"Name1",2016,"country1", 10,"Name",2016,"Country", 10,"Name",2016,"Country", 12,"Name2",2017,"Country2")
scala> val rdd = sc.parallelize(s)
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[92] at parallelize at <console>:38
scala>
scala> val df = new CsvParser().withDelimiter(',').withInferSchema(true).withParseMode("DROPMALFORMED").csvRdd(sqlContext, rdd)
df: org.apache.spark.sql.DataFrame = [C0: int, C1: string, C2: int, C3: string]
scala>
scala> df.registerTempTable("test")
scala>
scala> val dfCount = sqlContext.sql("select C2, count(*), count(distinct C0,C2,C1,C3) from test group by C2")
dfCount: org.apache.spark.sql.DataFrame = [C2: int, _c1: bigint, _c2: bigint]
scala>
scala> dfCount.show
+----+---+---+
| C2|_c1|_c2|
+----+---+---+
|2016| 4| 2|
|2017| 1| 1|
+----+---+---+