比较两个Dataframe列并显示df1中可用的结果,而df2中不可用

时间:2019-04-03 20:36:52

标签: sql apache-spark apache-spark-sql pyspark-sql

比较两个数据帧df1(最近的数据)和df2(先前的数据),它们是从同一张表针对不同的时间戳导出的,并根据df2中不可用的列名(id)从df1提取数据

我使用行号提取最近和以前的数据,并将它们存储在df1(最新数据)和df2(先前数据)中。我尝试使用左联接,减去,但是我不确定我是否在正确的轨道上。

df1=

ID|Timestamp           |RowNum|
+----------+-------------------+
|1|2019-04-03 14:45:...|     1|
|2|2019-04-03 14:45:...|     1|
|3|2019-04-03 14:45:...|     1|

df2 = 
ID|Timestamp           |RowNum|
+----------+-------------------+
|2|2019-04-03 13:45:...|     2|
|3|2019-04-03 13:45:...|     2|


%%spark
result2 = df1.join(df2.select(['id']), ['id'], how='left')
result2.show(10)

but didn't give the desired output
Required Output:

ID|Timestamp           |RowNum|
+----------+-------------------+
|1|2019-04-03 14:45:...|     1|

3 个答案:

答案 0 :(得分:1)

您可以使用left_anti连接类型来完全执行您想要的操作:

result2 = df1.join(df2, ['id'], how='left_anti')

在Spark文档本身中并没有很好解释,但是例如,您可以找到有关此连接类型here的更多信息。

答案 1 :(得分:1)

有两种方法可以实现:

1 不可用-从查找数据帧创建列表(df2_list),并在isin()== False中使用该列表

df1 = spark.createDataFrame([(1,"A") ,(2,"B",),(3,"C",),(4,"D")], ['id','item'])

df2 = spark.createDataFrame([(1,10) ,(2,20)], ['id','otherItem'])

df2_list = df2.select('id').rdd.map(lambda row : row[0]).collect()

from pyspark.sql.functions import col

df1.where(col('id').isin(df2_list) == False).show()

2 左肛门联接-将主表放在左侧。

df1.join(df2,  df1.id==df2.id, 'left_anti').show()

答案 2 :(得分:0)

尝试一下。

scala> val df1 = Seq(("1","2019-04-03 14:45:00","1"),("2","2019-04-03 14:45:00","1"),("3","2019-04-03 14:45:00","1")).toDF("ID","Timestamp","RowNum")
df1: org.apache.spark.sql.DataFrame = [ID: string, Timestamp: string ... 1 more field]

scala> df1.show
+---+-------------------+------+
| ID|          Timestamp|RowNum|
+---+-------------------+------+
|  1|2019-04-03 14:45:00|     1|
|  2|2019-04-03 14:45:00|     1|
|  3|2019-04-03 14:45:00|     1|
+---+-------------------+------+

scala> val df2 = Seq(("2","2019-04-03 13:45:00","2"),("3","2019-04-03 13:45:00","2")).toDF("ID","Timestamp","RowNum")
df2: org.apache.spark.sql.DataFrame = [ID: string, Timestamp: string ... 1 more field]

scala> df2.show
+---+-------------------+------+
| ID|          Timestamp|RowNum|
+---+-------------------+------+
|  2|2019-04-03 13:45:00|     2|
|  3|2019-04-03 13:45:00|     2|
+---+-------------------+------+

scala> val idDiff = df1.select("ID").except(df2.select("ID"))
idDiff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: string]

scala> idDiff.show
+---+
| ID|
+---+
|  1|
+---+


scala> val outputDF = df1.join(idDiff, "ID")
outputDF: org.apache.spark.sql.DataFrame = [ID: string, Timestamp: string ... 1 more field]

scala> outputDF.show
+---+-------------------+------+
| ID|          Timestamp|RowNum|
+---+-------------------+------+
|  1|2019-04-03 14:45:00|     1|
+---+-------------------+------+