如何检测来自不同数据帧的行的更改

时间:2016-04-12 09:33:00

标签: scala apache-spark apache-spark-sql spark-dataframe

我有两个数据框,用于保存两个不同时间戳中某些人的值。以下代码中列出了之前和之后可能发生的变化。

val before = Seq(
(1, "soccer", "1", "2", "3", "4", ""),
(2, "soccer", "",  "",  "",  "",  ""),
(3, "soccer", "1", "",  "",  "",  ""),
(4, "soccer", "1", "",  "",  "",  ""),
(5, "soccer", "1", "",  "",  "",  ""),
(6, "soccer", "1", "",  "",  "",  "")
).toDF("id", "sport", "var1", "var2", "var3", "var4", "var5")

before.show                   //> +---+------+----+----+----+----+----+
                              //| | id| sport|var1|var2|var3|var4|var5|
                              //| +---+------+----+----+----+----+----+
                              //| |  1|soccer|   1|   2|   3|   4|    |
                              //| |  2|soccer|    |    |    |    |    |
                              //| |  3|soccer|   1|    |    |    |    |
                              //| |  4|soccer|   1|    |    |    |    |
                              //| |  5|soccer|   1|    |    |    |    |
                              //| |  6|soccer|   1|    |    |    |    |
                              //| +---+------+----+----+----+----+----+
                              //| 

val after = Seq(
(1, "soccer", "1", "2", "3", "4", ""), // Same
(2, "soccer", "1", "",  "",  "",  ""), // Addition
(3, "soccer", "1", "1", "",  "",  ""), // Addition
(4, "soccer", "",  "",  "",  "",  ""), // Remove
(5, "soccer", "2", "1", "",  "",  ""), // Slide
(6, "soccer", "2", "",  "",  "",  "")  // Change
).toDF("id", "sport", "var1", "var2", "var3", "var4", "var5")

after.show                    //> +---+------+----+----+----+----+----+
                              //| | id| sport|var1|var2|var3|var4|var5|
                              //| +---+------+----+----+----+----+----+
                              //| |  1|soccer|   1|   2|   3|   4|    |
                              //| |  2|soccer|   1|    |    |    |    |
                              //| |  3|soccer|   1|   1|    |    |    |
                              //| |  4|soccer|    |    |    |    |    |
                              //| |  5|soccer|   2|   1|    |    |    |
                              //| |  6|soccer|   2|    |    |    |    |
                              //| +---+------+----+----+----+----+----+
                              //| 

所以事情可以保持不变,可以添加或删除,最后可能会有更改或幻灯片。

我理想的输出是在数据帧之前和之后面对每一行并附加标签的东西:

outcome.show                   //> +---+------+------+
                               //| | id| sport|  diff|
                               //| +---+------+------+
                               //| |  1|soccer|  same|
                               //| |  2|soccer|   add|
                               //| |  3|soccer|   add|
                               //| |  4|soccer|remove|
                               //| |  5|soccer| slide|
                               //| |  6|soccer|change|
                               //| +---+------+------+
                               //| 

这个问题与this一个问题有关,但重点在于计算两行之间存在多少差异......这次我试图用更精细的颗粒来理解这些差异,但我我坚持定义不同的可能选项。

修改

由于我使用的是DataFrame,我希望坚持使用这种结构,而不是使用案例类。因此,我试图使用DataFrame来改编@iboss提出的内容。

我有这个应该完成所有工作的UDF:

val diff = udf { (bef:DataFrame, aft:DataFrame) => {
  "hello" // return just this string for now
  } : String
}

这个udf将按照@iboss的建议完成所有工作,以在outcome.show中生成输出,因此匹配两行后可能的结果将是一个String,更确切地说是一个"相同的&#34 ;,"添加","删除","幻灯片"或者"改变"。

然后我有这个代码来合并两个数据框并创建新列:

val mydiff = before.join(after, "id")
  .withColumn("diff", diff( before, after ) )
  .select("id", "diff")

但是,在调用这样抱怨的diff时出错:

type mismatch; found : org.apache.spark.sql.DataFrame required: org.apache.spark.sql.Column

我不明白为什么它不喜欢DataFrame以及如何解决它...

1 个答案:

答案 0 :(得分:0)

我不太确定那些变量是什么,但如果我是你,我将它们分组为元组或案例类,这对于进一步处理更容易。它可能看起来像这样:

val before = Seq(
    (1, "soccer", ("1", "2", "3", "4", "")),
    (2, "soccer", ("",  "",  "",  "",  "")),
    (3, "soccer", ("1", "",  "",  "",  "")),
    (4, "soccer", ("1", "",  "",  "",  "")),
    (5, "soccer", ("1", "",  "",  "",  "")),
    (6, "soccer", ("1", "",  "",  "",  ""))
).toDF("id", "sport", "vars")


val after = Seq(
    (1, "soccer", ("1", "2", "3", "4", "")),
    (2, "soccer", ("1",  "",  "",  "",  "")),
    (3, "soccer", ("1", "1",  "",  "",  "")),
    (4, "soccer", ("", "",  "",  "",  "")),
    (5, "soccer", ("2", "1",  "",  "",  "")),
    (6, "soccer", ("2", "",  "",  "",  ""))
).toDF("id", "sport", "vars")

然后你可以使用用户定义的函数来计算你的差异

type MyVars = (String, String, String, String, String)

val diff = udf { (before_vars: MyVars, after_vars: MyVars) =>
    // your implementation of diff function
}

before
    .join(after)
    .withColumn("diff", diff(before("vars"), after("vars")))
    .select("id", "sport", "diff")

修改

对于udf,通常他们会为你做一个类型推断,所以你可能不需要定义你的类型。但是,如果你想定义它,那么,你可以这样做

udf { (firstName: String, lastName: String) => s"$firstName $lastName": String }

或使用块

udf { (name: String) => {
    val hello = "hello "
    "hello, " + name
}: Int }

您也可以使用def

def getFullName(firstName: String, lastName: String): String =
    s"$firstName $lastName"

udf(getFullName _)

因为使用def不是定义函数而是方法而udf需要funcstion。所以我们需要使用部分应用程序表示法来转换它。

有关详情,请参阅此Difference between method and function in Scala

修改2

似乎我误解了你的问题。 diff udf必须分别应用于每一行。因此,您无法将整个DataFrame传递给它。

我建议你将这些变量(在每一行中)组合成一个元组,因为它更容易阅读。但如果您仍想使用原始表格,那么您可以这样做

val diff = udf { (
    beforeVar1: String, 
    beforeVar2: String, 
    beforeVar3: String, 
    beforeVar4: String, 
    beforeVar5: String, 
    afterVar1: String, 
    afterVar2: String, 
    afterVar3: String, 
    afterVar4: String, 
    afterVar5: String
  ) => {
    "hello" // return just this string for now
  } : String
}

before.join(after, "id")
  .withColumn("diff", diff(
     before("var1"),
     before("var2"),
     before("var3"),
     before("var4"),
     before("var5"),
     after("var1"),
     after("var2"),
     after("var3"),
     after("var4"),
     after("var5"),
  ))
  .select("id", "diff")