比较两个数据帧df1(最近的数据)和df2(先前的数据),它们是从同一张表针对不同的时间戳导出的,并根据df2中不可用的列名(id)从df1提取数据
我使用行号提取最近和以前的数据,并将它们存储在df1(最新数据)和df2(先前数据)中。我尝试使用左联接,减去,但是我不确定我是否在正确的轨道上。
df1=
ID|Timestamp |RowNum|
+----------+-------------------+
|1|2019-04-03 14:45:...| 1|
|2|2019-04-03 14:45:...| 1|
|3|2019-04-03 14:45:...| 1|
df2 =
ID|Timestamp |RowNum|
+----------+-------------------+
|2|2019-04-03 13:45:...| 2|
|3|2019-04-03 13:45:...| 2|
%%spark
result2 = df1.join(df2.select(['id']), ['id'], how='left')
result2.show(10)
but didn't give the desired output
Required Output:
ID|Timestamp |RowNum|
+----------+-------------------+
|1|2019-04-03 14:45:...| 1|
答案 0 :(得分:1)
您可以使用left_anti
连接类型来完全执行您想要的操作:
result2 = df1.join(df2, ['id'], how='left_anti')
在Spark文档本身中并没有很好解释,但是例如,您可以找到有关此连接类型here的更多信息。
答案 1 :(得分:1)
有两种方法可以实现:
1 不可用-从查找数据帧创建列表(df2_list),并在isin()== False中使用该列表
df1 = spark.createDataFrame([(1,"A") ,(2,"B",),(3,"C",),(4,"D")], ['id','item'])
df2 = spark.createDataFrame([(1,10) ,(2,20)], ['id','otherItem'])
df2_list = df2.select('id').rdd.map(lambda row : row[0]).collect()
from pyspark.sql.functions import col
df1.where(col('id').isin(df2_list) == False).show()
2 左肛门联接-将主表放在左侧。
df1.join(df2, df1.id==df2.id, 'left_anti').show()
答案 2 :(得分:0)
尝试一下。
scala> val df1 = Seq(("1","2019-04-03 14:45:00","1"),("2","2019-04-03 14:45:00","1"),("3","2019-04-03 14:45:00","1")).toDF("ID","Timestamp","RowNum")
df1: org.apache.spark.sql.DataFrame = [ID: string, Timestamp: string ... 1 more field]
scala> df1.show
+---+-------------------+------+
| ID| Timestamp|RowNum|
+---+-------------------+------+
| 1|2019-04-03 14:45:00| 1|
| 2|2019-04-03 14:45:00| 1|
| 3|2019-04-03 14:45:00| 1|
+---+-------------------+------+
scala> val df2 = Seq(("2","2019-04-03 13:45:00","2"),("3","2019-04-03 13:45:00","2")).toDF("ID","Timestamp","RowNum")
df2: org.apache.spark.sql.DataFrame = [ID: string, Timestamp: string ... 1 more field]
scala> df2.show
+---+-------------------+------+
| ID| Timestamp|RowNum|
+---+-------------------+------+
| 2|2019-04-03 13:45:00| 2|
| 3|2019-04-03 13:45:00| 2|
+---+-------------------+------+
scala> val idDiff = df1.select("ID").except(df2.select("ID"))
idDiff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: string]
scala> idDiff.show
+---+
| ID|
+---+
| 1|
+---+
scala> val outputDF = df1.join(idDiff, "ID")
outputDF: org.apache.spark.sql.DataFrame = [ID: string, Timestamp: string ... 1 more field]
scala> outputDF.show
+---+-------------------+------+
| ID| Timestamp|RowNum|
+---+-------------------+------+
| 1|2019-04-03 14:45:00| 1|
+---+-------------------+------+