我在Spark中的数据框中有8列,
现在,我想比较前四列和后四列,即Name_a
与Name_b
,status_a
与status_b
等。如何在其中进行比较用scala
语言发火花吗?
答案 0 :(得分:2)
问题:我想将前四列与后四列进行比较 列,其中Name_a与Name_b,status_a与status_b等。如何 我可以使用Scala语言来做到这一点吗?
选项1:
以下是使用except
进行此操作的方式,您可以实现此目标...
except
,您可以找到下面的类似代码,这是不言而喻的。package com.examples
import org.apache.log4j.{Level, Logger}
import org.apache.spark.internal.Logging
import org.apache.spark.sql.SparkSession
/**
* @author : Ram Ghadiyaram
*/
object FindDataFrameColumnDifferences extends App with Logging {
Logger.getLogger("org").setLevel(Level.WARN)
case class Employee(Name_a: String, status_a: Int, date_a: String, ID_a: Int
, Name_b: String, status_b: Int, date_b: String, ID_b: Int)
val spark: SparkSession = SparkSession.builder().appName(this.getClass.getName).master("local[*]").getOrCreate()
//spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val df = List(
Employee("Ram", 1, "21-Mar-2019", 20048965, "Ram", 1, "21-Mar-2019", 20048965),
Employee("Ram", 1, "21-Mar-2019", 20048965, "Ram", 1, "21-Mar-2019", 20048965),
Employee("Mishy_tics", 1, "21-Mar-2019", 20048965, "Mishy", 1, "21-Mar-2019", 20048965),
Employee("Mishy_tics", 1, "21-Mar-2019", 20048965, "tics", 1, "21-Mar-2019", 20048965)
).toDF
logInfo("original dataframe with 8 columns")
df.show(false)
logInfo("Now take first 4 columns in the original dataframe and rename using alias ")
val firstDataFrame = df.selectExpr("Name_a as name", "status_a as status", "date_a as date", "ID_a as id")
logInfo("printing first dataframe ")
firstDataFrame.show
logInfo("Now take last 4 columns in the original dataframe and rename using alias ")
val secondDataFrame = df.selectExpr("Name_b as name", "status_b as status", "date_b as date", "ID_b as id")
logInfo("printing second dataframe ")
secondDataFrame.show
val columns = firstDataFrame.schema.fields.map(_.name)
logInfo("first except second")
var selectiveDifferences = columns.map(col => firstDataFrame.select(col).except(secondDataFrame.select(col)))
// columns contains different values.
selectiveDifferences.map(diff => {
if (diff.count > 0) diff.show
})
selectiveDifferences = columns.map(col => secondDataFrame.select(col).except(firstDataFrame.select(col)))
logInfo("second except first")
// columns contains different values.
selectiveDifferences.map(diff => {
if (diff.count > 0) diff.show
})
}
2019-05-05 19:10:05 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2019-05-05 19:10:13 INFO FindDataFrameColumnDifferences:54 - original dataframe with 8 columns +----------+--------+-----------+--------+------+--------+-----------+--------+ |Name_a |status_a|date_a |ID_a |Name_b|status_b|date_b |ID_b | +----------+--------+-----------+--------+------+--------+-----------+--------+ |Ram |1 |21-Mar-2019|20048965|Ram |1 |21-Mar-2019|20048965| |Ram |1 |21-Mar-2019|20048965|Ram |1 |21-Mar-2019|20048965| |Mishy_tics|1 |21-Mar-2019|20048965|Mishy |1 |21-Mar-2019|20048965| |Mishy_tics|1 |21-Mar-2019|20048965|tics |1 |21-Mar-2019|20048965| +----------+--------+-----------+--------+------+--------+-----------+--------+ 2019-05-05 19:10:13 INFO FindDataFrameColumnDifferences:54 - Now take first 4 columns in the original dataframe and rename using alias 2019-05-05 19:10:14 INFO FindDataFrameColumnDifferences:54 - printing first dataframe +----------+------+-----------+--------+ | name|status| date| id| +----------+------+-----------+--------+ | Ram| 1|21-Mar-2019|20048965| | Ram| 1|21-Mar-2019|20048965| |Mishy_tics| 1|21-Mar-2019|20048965| |Mishy_tics| 1|21-Mar-2019|20048965| +----------+------+-----------+--------+ 2019-05-05 19:10:14 INFO FindDataFrameColumnDifferences:54 - Now take last 4 columns in the original dataframe and rename using alias 2019-05-05 19:10:14 INFO FindDataFrameColumnDifferences:54 - printing second dataframe +-----+------+-----------+--------+ | name|status| date| id| +-----+------+-----------+--------+ | Ram| 1|21-Mar-2019|20048965| | Ram| 1|21-Mar-2019|20048965| |Mishy| 1|21-Mar-2019|20048965| | tics| 1|21-Mar-2019|20048965| +-----+------+-----------+--------+ 2019-05-05 19:10:14 INFO FindDataFrameColumnDifferences:54 - first except second +----------+ | name| +----------+ |Mishy_tics| +----------+ 2019-05-05 19:10:29 INFO FindDataFrameColumnDifferences:54 - second except first +-----+ | name| +-----+ |Mishy| | tics| +-----+
选项2: -另一种方法是在使用equi join/self join on name和status创建具有8列的第一个数据帧之后,您可以找到它们之间的差异。
请参阅:Joining Spark dataframes on the key
选项2是我感觉最简单的方式。
答案 1 :(得分:1)
假设一条记录可以表示为:
Person(name: String, status: Boolean, date: String, id: Int)
在您的情况下,每一行都包含Person
的重复记录。您可以将两个人包裹成一排,如下所示:
case class Person(name: String, status: Boolean, date: String, id: Int)
case class TuplePerson(a: Person, b: Person)
然后,您可以使用数据集比较a with b
。这是完整的代码:
case class Person(name: String, status: Boolean, date: String, id: Int)
case class TuplePerson(a: Person, b: Person)
val df = Seq(
(TuplePerson(Person("John", true,"15-05-2019", 54), Person("John", true,"15-05-2019", 54))),
(TuplePerson(Person("Sofia", true,"15-05-2019", 54),Person("John", true,"15-05-2019", 53))),
(TuplePerson(Person("John", true,"15-05-2019", 52), Person("John", true,"15-05-2019", 52))))
.toDS()
df.where($"a" === $"b").show(false)
输出:
+----------------------------+----------------------------+
|a |b |
+----------------------------+----------------------------+
|[John, true, 15-05-2019, 54]|[John, true, 15-05-2019, 54]|
|[John, true, 15-05-2019, 52]|[John, true, 15-05-2019, 52]|
+----------------------------+----------------------------+
或者得到左右部分之间的差异:
df.where($"a" =!= $"b").show(false)
+-----------------------------+----------------------------+
|a |b |
+-----------------------------+----------------------------+
|[Sofia, true, 15-05-2019, 54]|[John, true, 15-05-2019, 53]|
+-----------------------------+----------------------------+