我有两个数据框,用于保存两个不同时间戳中某些人的值。以下代码中列出了之前和之后可能发生的变化。
val before = Seq(
(1, "soccer", "1", "2", "3", "4", ""),
(2, "soccer", "", "", "", "", ""),
(3, "soccer", "1", "", "", "", ""),
(4, "soccer", "1", "", "", "", ""),
(5, "soccer", "1", "", "", "", ""),
(6, "soccer", "1", "", "", "", "")
).toDF("id", "sport", "var1", "var2", "var3", "var4", "var5")
before.show //> +---+------+----+----+----+----+----+
//| | id| sport|var1|var2|var3|var4|var5|
//| +---+------+----+----+----+----+----+
//| | 1|soccer| 1| 2| 3| 4| |
//| | 2|soccer| | | | | |
//| | 3|soccer| 1| | | | |
//| | 4|soccer| 1| | | | |
//| | 5|soccer| 1| | | | |
//| | 6|soccer| 1| | | | |
//| +---+------+----+----+----+----+----+
//|
val after = Seq(
(1, "soccer", "1", "2", "3", "4", ""), // Same
(2, "soccer", "1", "", "", "", ""), // Addition
(3, "soccer", "1", "1", "", "", ""), // Addition
(4, "soccer", "", "", "", "", ""), // Remove
(5, "soccer", "2", "1", "", "", ""), // Slide
(6, "soccer", "2", "", "", "", "") // Change
).toDF("id", "sport", "var1", "var2", "var3", "var4", "var5")
after.show //> +---+------+----+----+----+----+----+
//| | id| sport|var1|var2|var3|var4|var5|
//| +---+------+----+----+----+----+----+
//| | 1|soccer| 1| 2| 3| 4| |
//| | 2|soccer| 1| | | | |
//| | 3|soccer| 1| 1| | | |
//| | 4|soccer| | | | | |
//| | 5|soccer| 2| 1| | | |
//| | 6|soccer| 2| | | | |
//| +---+------+----+----+----+----+----+
//|
所以事情可以保持不变,可以添加或删除,最后可能会有更改或幻灯片。
我理想的输出是在数据帧之前和之后面对每一行并附加标签的东西:
outcome.show //> +---+------+------+
//| | id| sport| diff|
//| +---+------+------+
//| | 1|soccer| same|
//| | 2|soccer| add|
//| | 3|soccer| add|
//| | 4|soccer|remove|
//| | 5|soccer| slide|
//| | 6|soccer|change|
//| +---+------+------+
//|
这个问题与this一个问题有关,但重点在于计算两行之间存在多少差异......这次我试图用更精细的颗粒来理解这些差异,但我我坚持定义不同的可能选项。
修改
由于我使用的是DataFrame,我希望坚持使用这种结构,而不是使用案例类。因此,我试图使用DataFrame来改编@iboss提出的内容。
我有这个应该完成所有工作的UDF:
val diff = udf { (bef:DataFrame, aft:DataFrame) => {
"hello" // return just this string for now
} : String
}
这个udf将按照@iboss的建议完成所有工作,以在outcome.show中生成输出,因此匹配两行后可能的结果将是一个String,更确切地说是一个"相同的&#34 ;,"添加","删除","幻灯片"或者"改变"。
然后我有这个代码来合并两个数据框并创建新列:
val mydiff = before.join(after, "id")
.withColumn("diff", diff( before, after ) )
.select("id", "diff")
但是,在调用这样抱怨的diff时出错:
type mismatch; found : org.apache.spark.sql.DataFrame required: org.apache.spark.sql.Column
我不明白为什么它不喜欢DataFrame以及如何解决它...
答案 0 :(得分:0)
我不太确定那些变量是什么,但如果我是你,我将它们分组为元组或案例类,这对于进一步处理更容易。它可能看起来像这样:
val before = Seq(
(1, "soccer", ("1", "2", "3", "4", "")),
(2, "soccer", ("", "", "", "", "")),
(3, "soccer", ("1", "", "", "", "")),
(4, "soccer", ("1", "", "", "", "")),
(5, "soccer", ("1", "", "", "", "")),
(6, "soccer", ("1", "", "", "", ""))
).toDF("id", "sport", "vars")
val after = Seq(
(1, "soccer", ("1", "2", "3", "4", "")),
(2, "soccer", ("1", "", "", "", "")),
(3, "soccer", ("1", "1", "", "", "")),
(4, "soccer", ("", "", "", "", "")),
(5, "soccer", ("2", "1", "", "", "")),
(6, "soccer", ("2", "", "", "", ""))
).toDF("id", "sport", "vars")
然后你可以使用用户定义的函数来计算你的差异
type MyVars = (String, String, String, String, String)
val diff = udf { (before_vars: MyVars, after_vars: MyVars) =>
// your implementation of diff function
}
before
.join(after)
.withColumn("diff", diff(before("vars"), after("vars")))
.select("id", "sport", "diff")
修改强>
对于udf,通常他们会为你做一个类型推断,所以你可能不需要定义你的类型。但是,如果你想定义它,那么,你可以这样做
udf { (firstName: String, lastName: String) => s"$firstName $lastName": String }
或使用块
udf { (name: String) => {
val hello = "hello "
"hello, " + name
}: Int }
您也可以使用def
def getFullName(firstName: String, lastName: String): String =
s"$firstName $lastName"
udf(getFullName _)
因为使用def
不是定义函数而是方法而udf
需要funcstion。所以我们需要使用部分应用程序表示法来转换它。
有关详情,请参阅此Difference between method and function in Scala
修改2
似乎我误解了你的问题。 diff
udf必须分别应用于每一行。因此,您无法将整个DataFrame传递给它。
我建议你将这些变量(在每一行中)组合成一个元组,因为它更容易阅读。但如果您仍想使用原始表格,那么您可以这样做
val diff = udf { (
beforeVar1: String,
beforeVar2: String,
beforeVar3: String,
beforeVar4: String,
beforeVar5: String,
afterVar1: String,
afterVar2: String,
afterVar3: String,
afterVar4: String,
afterVar5: String
) => {
"hello" // return just this string for now
} : String
}
before.join(after, "id")
.withColumn("diff", diff(
before("var1"),
before("var2"),
before("var3"),
before("var4"),
before("var5"),
after("var1"),
after("var2"),
after("var3"),
after("var4"),
after("var5"),
))
.select("id", "diff")