如何比较数据集?

时间:2019-02-28 15:41:38

标签: apache-spark

我有一个Spark应用程序,可对数据集运行自定义格式良好的查询。这些中的每一个仅操作整个数据集的一个称为“组”的子集,这些子集实际上只是数据集上的一个过滤器,可以由程序员定义。

type Group = DataFrame => DataFrame
val groupA = _.filter($"column1" > 0)
val groupB = _.filter($"column2" > 0 && $"column3" === 0)

val constraint1 = constraint(groupA, _.count == 0)
val constraint2 = constraint(groupA, _.dropDuplicates($"column3").count == 1)
// and so on
val constraint3 = constraint(groupB, _.count == 0)
...

framework.add(constraint1, constraint2, constraint3)
framework.execute()

每个组都有很多约束,因此为了加快速度,我想按组收集约束,将组缓存并依次(或并行)运行约束。

因此,为了确定两个约束是否属于同一组,我需要某种方式比较数据集是否相等

我的想法是使用数据集的逻辑计划中的semanticHash比较它们,但是有几个逻辑计划与单个数据集相关联,我想知道选择哪个。

最好的方法是什么?

1 个答案:

答案 0 :(得分:0)

所以我做了一些实验,在Spark 2.4.0上发现了以下内容

def equal(a: Dataset[Row], b: Dataset[Row], expected: Boolean) = {
  println(s"by logical hashCode ${a.queryExecution.logical.semanticHash == b.queryExecution.logical.semanticHash}")
  println(s"by logical sameResult ${a.queryExecution.logical.sameResult(b.queryExecution.logical)}")
  println(s"by optimized hashCode ${a.queryExecution.optimizedPlan.semanticHash == b.queryExecution.optimizedPlan.semanticHash}")
  println(s"by optimized sameResult ${a.queryExecution.optimizedPlan.sameResult(b.queryExecution.optimizedPlan)}")
  println(s"expected: $expected")
  println("\n")
}

val a = spark.createDataset(Seq(1, 2)).filter($"value" > 1).filter($"value" > 1).toDF
val b = spark.createDataset(Seq(1, 2)).filter($"value" > 1).toDF
val c = spark.createDataset(Seq(2, 3)).filter($"value" > 1).toDF
val d = spark.createDataset(Seq(2, 3)).filter($"value" < 1).toDF
val e = spark.read.parquet("/test_1")
val f = spark.read.parquet("/test_1")
val g = spark.read.parquet("/test_2")
val h = spark.read.parquet("/test_1").filter($"value" < 1)
val i = spark.read.parquet("/test_1").filter($"value" > 1)

equal(a, b, true)
// by logical hashCode false 
// by logical sameResult false 
// by optimized hashCode true 
// by optimized sameResult true 
// expected: true 

equal(b, c, false)
// by logical hashCode false 
// by logical sameResult false 
// by optimized hashCode false 
// by optimized sameResult false 
// expected: false 


equal(c, d, false)
// by logical hashCode true 
// by logical sameResult false 
// by optimized hashCode false 
// by optimized sameResult false 
// expected: false 


equal(e, f, true)
// by logical hashCode true 
// by logical sameResult true 
// by optimized hashCode true 
// by optimized sameResult true 
// expected: true 

equal(e, g, false)
// by logical hashCode false 
// by logical sameResult false 
// by optimized hashCode false 
// by optimized sameResult false 
// expected: false

equal(h, i, false)
// by logical hashCode true 
// by logical sameResult false
// by optimized hashCode true
// by optimized sameResult false
// expected: false

所以我想我想在优化计划中选择sameResults