如何使用pyspark仅获取一个数据帧中可用的新记录,而不获取另一个数据帧中可用的新记录?

时间:2018-10-06 11:06:25

标签: dataframe pyspark pyspark-sql

我有两个dataframe-df1和df2。 df1由70行和7列组成,而df2由80行和7列组成。

如何仅从df2中获取相对于df1在任何列中具有任何新记录值的记录,即pyspark-2.2.0中的df1中不存在该记录值?

我尝试使用此左联接查询方法,但无法在sqlContext.sql()中执行此操作。

sqlContext.sql(
select df2.*,df1.* from df2 
left join (select * from df1)
on (df2.col1=df1.col1 
AND df2.col2=df1.col2
AND df2.col3 =df1.col3 
AND df2.col4=df1.col4
AND df2.col5=df1.col5
AND df2.col6=df1.col6
AND df2.col7=df1.col7) 
where df1.col1 is null 
AND df1.col2 is null 
AND df1.col3 is null
AND df1.col4 is null
AND df1.col5 is null
AND df1.col6 is null
AND df1.col7 is null).show()

1 个答案:

答案 0 :(得分:0)

使用数据框方法减去[1]。示例:

*#!/usr/bin/expect -f*

cd /home/test/project
git pull
expect "sername"
send your_username
send "\r"
expect "assword"
send {your_password}
send "\r"
interact

[1] https://spark.apache.org/docs/1.3.0/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.DataFrame.subtract