比较Apache Spark Scala中的2个数据帧

时间:2018-07-13 16:39:16

标签: scala apache-spark dataframe

我有2个数据集:

数据集1:

CustID  CustName    CustRegion  
1       Joe         Canada  
2       Jane        UK  

数据集2:

CustID   CustName   CustCity    CustPrice
1        Joe        Berlin      20
2        Jane       UK          11
3        Bill       France      30

在合并两个数据框后希望获得如下输出:

CustID  CustName        CustRegion      CustCity            CustPrice
1       Joe             Canada          (null, Berlin)      (null, 20)
2       Jane            UK              (null, UK)          (null, 11)
3       (null, Bill)    (null, null)    (null, France)      (null, 30)

我尝试使用以下代码,但在以下情况下使用...时出错:

val cols = df1.columns.filter(_ != "CustID").toList

// function to create an expression that results in null for similar values,
// and with a two-item array with the differing values otherwise
def mapDiffs(name: String) = {
  when($"df1.$name" === $"df2.$name", df2.$name)
    .otherwise(array($"l.$name", $"r.$name"))
    .as(name)
}

// joining the two DFs on OrgId
val result = df1.as("l")
  .join(targetDF.as("r"), "CustID")
  .select($"CustID" :: cols.map(mapDiffs): _*)

result.show()

1 个答案:

答案 0 :(得分:0)

我发现您的尝试有两个错误

  1. 您已将数据帧别名为lr,但在您使用的函数中 df1df2
  2. 功能中的条件不足以达到预期 输出数据框

这是一个可能的解决方案。我已经创建了问题中提到的临时数据框,并在每个部分都添加了注释,以进行解释和说明

//creating temporary dataframes for testing
val df1 = Seq(
  ("1", "Joe", "Canada"),
  ("2", "Jane", "UK")
).toDF("CustID", "CustName", "CustRegion")

val targetDF = Seq(
  ("1", "Joe", "Berlin", "20"),
  ("2", "Jane", "UK", "11"),
  ("3", "Bill", "France", "30")
).toDF("CustID", "CustName", "CustCity", "CustPrice")

//selecting column names of both dataframe without CustID
val cols1 = df1.columns.filter(_ != "CustID").toList
val cols2 = targetDF.columns.filter(_ != "CustID").toList

//all column names without duplicates expect CustID of both dataframes
val allCols = cols1 ++ cols2 toSet

//extra columns needed for both dataframes columns to be same
val extraColsDf1 = allCols -- cols1
val extraColsTargetDf = allCols -- cols2

//adding the extra columns and populating with null
val tempDf1 = extraColsDf1.foldLeft(df1){(temp, c) => temp.withColumn(c, lit(null))}
val tempTargetDf = extraColsTargetDf.foldLeft(targetDF){(temp, c) => temp.withColumn(c, lit(null))}

//function for getting your desired output
def mapDiffs(name: String) = {
  when($"l.$name".isNull && $"r.$name".isNull, concat_ws(",", lit("null"), lit("null")))  // for outputs like null,null
      .otherwise(when($"r.$name".isNull, $"l.$name")  //for outputs like Canada and UK
        .otherwise(when($"l.$name".isNull, concat_ws(",", lit("null"), $"r.$name")) //for outputs like null,Bill
        .otherwise(when($"l.$name" === $"r.$name" && $"l.$name".isNotNull && $"r.$name".isNotNull, $"l.$name")  // for outputs like Joe
          .otherwise(concat_ws(",", $"l.$name", $"r.$name"))  //for rest of the outputs
        )
      )
    )
    .as(name)
}

// outer joining the two DFs on CustID and calling the above function
val result = tempDf1.as("l")
  .join(tempTargetDf.as("r"), Seq("CustID"), "outer")
  .select(Seq($"CustID") ++ allCols.map(mapDiffs): _*)

//showing the output
result.show(false)

应该给您

+------+---------+----------+-----------+---------+
|CustID|CustName |CustRegion|CustCity   |CustPrice|
+------+---------+----------+-----------+---------+
|3     |null,Bill|null,null |null,France|null,30  |
|1     |Joe      |Canada    |null,Berlin|null,20  |
|2     |Jane     |UK        |null,UK    |null,11  |
+------+---------+----------+-----------+---------+

您可以根据需要进行修改

我希望答案会有所帮助