使用Spark从多个数据集中查找完整性

时间:2018-04-07 11:10:36

标签: apache-spark apache-spark-sql

我需要使用Spark 2验证多个数据集(或csv)中存在的某些数据。这可以定义为匹配所有数据集中的两个以上的键,并生成来自所有数据集的所有匹配和非匹配键的报告。登记/> 对于例如有四个数据集,每个数据集有两个匹配的键和其他键。这意味着需要在定义的两个匹配键上匹配所有数据集。假设userId,userName是以下数据集中的两个匹配键:

Dataset A: userId,userName,age,contactNumber
Dataset B: orderId,orderDetails,userId,userName
Dataset C: departmentId,userId,userName,departmentName
Dataset D: userId,userName,address,pin

Dataset A:    
userId,userName,age,contactNumber
1,James,29,1111
2,Ferry,32,2222
3,Lotus,21,3333

Dataset B:    
orderId,orderDetails,userId,userName
DF23,Chocholate,1,James
DF45,Gifts,3,Lotus

Dataset C:    
departmentId,userId,userName,departmentName
N99,1,James,DE
N100,2,Ferry,AI

Dataset D:    
userId,userName,address,pin
1,James,minland street,cvk-dfg

需要生成类似于

的报告(或类似报告)
------------------------------------------
userId,userName,status
------------------------------------------
1,James,MATCH
2,Ferry,MISSING IN B, MISSING IN D
3,Lotus,MISSING IN B, MISSING IN C, MISSING IN D

我已尝试将数据集连接为follwos

DatsetA-B:
userId,userName,age,contactNumber,orderId,orderDetails,userId,userName,status
1,James,29,1111,DF23,Chocholate,1,James,MATCH
2,Ferry,32,2222,,,,,Missing IN Left
3,Lotus,21,3333,DF45,Gifts,3,Lotus,MATCH

DatsetC-D:
departmentId,userId,userName,departmentName,userId,userName,address,pin,status
N99,1,James,DE,1,James,minland street,cvk-dfg,MATCH
N100,2,Ferry,AI,,,,,Missing IN Right

DatsetAB-CD:
Joining criteria: userId and userName of A with C, userId and userName of B with D
userId,userName,age,contactNumber,orderId,orderDetails,userId,userName,status,departmentId,userId,userName,departmentName,userId,userName,address,pin,status,status
1,James,29,1111,DF23,Chocholate,1,James,MATCH,N99,1,James,DE,1,James,minland street,cvk-dfg,MATCH,MATCH
2,Ferry,32,2222,,,,,Missing IN Left,N100,2,Ferry,AI,,,,,Missing IN Right,Missing IN Right

userId 3没有行

1 个答案:

答案 0 :(得分:1)

如果数据定义为:

val dfA = Seq((1, "James", 29, 1111), (2, "Ferry", 32, 2222),(3, "Lotus", 21, 3333)).toDF("userId,userName,age,contactNumber".split(","): _*)
val dfB = Seq(("DF23", "Chocholate", 1, "James"), ("DF45", "Gifts", 3, "Lotus")).toDF("orderId,orderDetails,userId,userName".split(","): _*)
val dfC = Seq(("N99", 1, "James", "DE"), ("N100", 2, "Ferry", "AI")).toDF("departmentId,userId,userName,departmentName".split(","): _*)
val dfD = Seq((1, "James", "minland street", "cvk-dfg")).toDF("userId,userName,address,pin".split(","): _*)

定义键:

import org.apache.spark.sql.functions._

val keys = Seq("userId", "userName")

结合Datasets

val dfs = Map("A" -> dfA, "B" -> dfB, "C" -> dfC, "D" -> dfD)

val combined = dfs.map {
  case (key, df) => df.withColumn("df", lit(key)).select("df", keys: _*) 
}.reduce(_ unionByName _)

将结果转换为布尔值:

val pivoted = combined
  .groupBy(keys.head, keys.tail: _*)
  .pivot("df", dfs.keys.toSeq)
  .count()

val result = dfs.keys.foldLeft(pivoted)(
  (df, c) => df.withColumn(c, col(c).isNotNull.alias(c))
)

// +------+--------+----+-----+-----+-----+
// |userId|userName|   A|    B|    C|    D|
// +------+--------+----+-----+-----+-----+
// |     1|   James|true| true| true| true|
// |     3|   Lotus|true| true|false|false|
// |     2|   Ferry|true|false| true|false|
// +------+--------+----+-----+-----+-----+

使用生成的布尔矩阵生成最终报告。

当数据集变大时,这可能变得相当昂贵。如果您知道一个数据集包含所有可能的键,并且您不需要精确结果(可以接受一些误报),则可以使用Bloom过滤器。

我们可以使用dfA作为参考:

val expectedNumItems = dfA.count
val fpp = 0.00001

val key = struct(keys map col: _*).cast("string").alias("key")

val filters = dfs.filterKeys(_ != "A").mapValues(df => { 
  val f = df.select(key).stat.bloomFilter("key", expectedNumItems, fpp); 
  udf((s: String) => f.mightContain(s))
 })

filters.foldLeft(dfA.select(keys map col: _*)){
  case (df, (c, f)) => df.withColumn(c, f(key))
}.show

// +------+--------+-----+-----+-----+
// |userId|userName|    B|    C|    D|
// +------+--------+-----+-----+-----+
// |     1|   James| true| true| true|
// |     2|   Ferry|false| true|false|
// |     3|   Lotus| true|false|false|
// +------+--------+-----+-----+-----+