我需要使用Spark 2验证多个数据集(或csv)中存在的某些数据。这可以定义为匹配所有数据集中的两个以上的键,并生成来自所有数据集的所有匹配和非匹配键的报告。登记/>
对于例如有四个数据集,每个数据集有两个匹配的键和其他键。这意味着需要在定义的两个匹配键上匹配所有数据集。假设userId,userName是以下数据集中的两个匹配键:
Dataset A: userId,userName,age,contactNumber
Dataset B: orderId,orderDetails,userId,userName
Dataset C: departmentId,userId,userName,departmentName
Dataset D: userId,userName,address,pin
Dataset A:
userId,userName,age,contactNumber
1,James,29,1111
2,Ferry,32,2222
3,Lotus,21,3333
Dataset B:
orderId,orderDetails,userId,userName
DF23,Chocholate,1,James
DF45,Gifts,3,Lotus
Dataset C:
departmentId,userId,userName,departmentName
N99,1,James,DE
N100,2,Ferry,AI
Dataset D:
userId,userName,address,pin
1,James,minland street,cvk-dfg
需要生成类似于
的报告(或类似报告)------------------------------------------
userId,userName,status
------------------------------------------
1,James,MATCH
2,Ferry,MISSING IN B, MISSING IN D
3,Lotus,MISSING IN B, MISSING IN C, MISSING IN D
我已尝试将数据集连接为follwos
DatsetA-B:
userId,userName,age,contactNumber,orderId,orderDetails,userId,userName,status
1,James,29,1111,DF23,Chocholate,1,James,MATCH
2,Ferry,32,2222,,,,,Missing IN Left
3,Lotus,21,3333,DF45,Gifts,3,Lotus,MATCH
DatsetC-D:
departmentId,userId,userName,departmentName,userId,userName,address,pin,status
N99,1,James,DE,1,James,minland street,cvk-dfg,MATCH
N100,2,Ferry,AI,,,,,Missing IN Right
DatsetAB-CD:
Joining criteria: userId and userName of A with C, userId and userName of B with D
userId,userName,age,contactNumber,orderId,orderDetails,userId,userName,status,departmentId,userId,userName,departmentName,userId,userName,address,pin,status,status
1,James,29,1111,DF23,Chocholate,1,James,MATCH,N99,1,James,DE,1,James,minland street,cvk-dfg,MATCH,MATCH
2,Ferry,32,2222,,,,,Missing IN Left,N100,2,Ferry,AI,,,,,Missing IN Right,Missing IN Right
userId 3没有行
答案 0 :(得分:1)
如果数据定义为:
val dfA = Seq((1, "James", 29, 1111), (2, "Ferry", 32, 2222),(3, "Lotus", 21, 3333)).toDF("userId,userName,age,contactNumber".split(","): _*)
val dfB = Seq(("DF23", "Chocholate", 1, "James"), ("DF45", "Gifts", 3, "Lotus")).toDF("orderId,orderDetails,userId,userName".split(","): _*)
val dfC = Seq(("N99", 1, "James", "DE"), ("N100", 2, "Ferry", "AI")).toDF("departmentId,userId,userName,departmentName".split(","): _*)
val dfD = Seq((1, "James", "minland street", "cvk-dfg")).toDF("userId,userName,address,pin".split(","): _*)
定义键:
import org.apache.spark.sql.functions._
val keys = Seq("userId", "userName")
结合Datasets
:
val dfs = Map("A" -> dfA, "B" -> dfB, "C" -> dfC, "D" -> dfD)
val combined = dfs.map {
case (key, df) => df.withColumn("df", lit(key)).select("df", keys: _*)
}.reduce(_ unionByName _)
将结果转换为布尔值:
val pivoted = combined
.groupBy(keys.head, keys.tail: _*)
.pivot("df", dfs.keys.toSeq)
.count()
val result = dfs.keys.foldLeft(pivoted)(
(df, c) => df.withColumn(c, col(c).isNotNull.alias(c))
)
// +------+--------+----+-----+-----+-----+
// |userId|userName| A| B| C| D|
// +------+--------+----+-----+-----+-----+
// | 1| James|true| true| true| true|
// | 3| Lotus|true| true|false|false|
// | 2| Ferry|true|false| true|false|
// +------+--------+----+-----+-----+-----+
使用生成的布尔矩阵生成最终报告。
当数据集变大时,这可能变得相当昂贵。如果您知道一个数据集包含所有可能的键,并且您不需要精确结果(可以接受一些误报),则可以使用Bloom过滤器。
我们可以使用dfA
作为参考:
val expectedNumItems = dfA.count
val fpp = 0.00001
val key = struct(keys map col: _*).cast("string").alias("key")
val filters = dfs.filterKeys(_ != "A").mapValues(df => {
val f = df.select(key).stat.bloomFilter("key", expectedNumItems, fpp);
udf((s: String) => f.mightContain(s))
})
filters.foldLeft(dfA.select(keys map col: _*)){
case (df, (c, f)) => df.withColumn(c, f(key))
}.show
// +------+--------+-----+-----+-----+
// |userId|userName| B| C| D|
// +------+--------+-----+-----+-----+
// | 1| James| true| true| true|
// | 2| Ferry|false| true|false|
// | 3| Lotus| true|false|false|
// +------+--------+-----+-----+-----+