如何使用spark比较两个文件?

时间:2016-09-15 19:21:27

标签: scala apache-spark hadoop2 hadoop-streaming bigdata

我想比较两个文件,如果不匹配的额外记录加载到具有不匹配记录的另一个文件中。 比较文件和记录数量中的每个字段。

1 个答案:

答案 0 :(得分:5)

我们假设您有两个文件:

scala> val a = spark.read.option("header", "true").csv("a.csv").alias("a"); a.show
+---+-----+
|key|value|
+---+-----+
|  a|    b|
|  b|    c|
+---+-----+

a: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> val b = spark.read.option("header", "true").csv("b.csv").alias("b"); b.show
+---+-----+
|key|value|
+---+-----+
|  b|    c|
|  c|    d|
+---+-----+

b: org.apache.spark.sql.DataFrame = [key: string, value: string]

目前还不清楚您正在寻找哪种不匹配的记录,但很容易通过join的任何定义找到它们:

scala> a.join(b, Seq("key")).show
+---+-----+-----+
|key|value|value|
+---+-----+-----+
|  b|    c|    c|
+---+-----+-----+

scala> a.join(b, Seq("key"), "left_outer").show
+---+-----+-----+
|key|value|value|
+---+-----+-----+
|  a|    b| null|
|  b|    c|    c|
+---+-----+-----+

scala> a.join(b, Seq("key"), "right_outer").show
+---+-----+-----+
|key|value|value|
+---+-----+-----+
|  b|    c|    c|
|  c| null|    d|
+---+-----+-----+

scala> a.join(b, Seq("key"), "outer").show
+---+-----+-----+
|key|value|value|
+---+-----+-----+
|  c| null|    d|
|  b|    c|    c|
|  a|    b| null|
+---+-----+-----+

如果您要查找b.csv中不存在的a.csv中的记录:

scala> val diff = a.join(b, Seq("key"), "right_outer").filter($"a.value" isNull).drop($"a.value")
scala> diff.show
+---+-----+
|key|value|
+---+-----+
|  c|    d|
+---+-----+

scala> diff.write.csv("diff.csv")