我想在pyspark中的2个数据帧之间进行减法。挑战是我必须在减去数据帧的同时忽略一些列。但是结束数据框应该包含所有列,包括忽略的列。
以下是一个例子:
userLeft = sc.parallelize([
Row(id=u'1',
first_name=u'Steve',
last_name=u'Kent',
email=u's.kent@email.com',
date1=u'2017-02-08'),
Row(id=u'2',
first_name=u'Margaret',
last_name=u'Peace',
email=u'marge.peace@email.com',
date1=u'2017-02-09'),
Row(id=u'3',
first_name=None,
last_name=u'hh',
email=u'marge.hh@email.com',
date1=u'2017-02-10')
]).toDF()
userRight = sc.parallelize([
Row(id=u'2',
first_name=u'Margaret',
last_name=u'Peace',
email=u'marge.peace@email.com',
date1=u'2017-02-11'),
Row(id=u'3',
first_name=None,
last_name=u'hh',
email=u'marge.hh@email.com',
date1=u'2017-02-12')
]).toDF()
预期:
ActiveDF = userLeft.subtract(userRight) ||| Ignore "date1" column while subtracting.
最终结果应该是这样的,包括" date1"列。
+----------+--------------------+----------+---+---------+
| date1| email|first_name| id|last_name|
+----------+--------------------+----------+---+---------+
|2017-02-08| s.kent@email.com| Steve| 1| Kent|
+----------+--------------------+----------+---+---------+
答案 0 :(得分:1)
似乎你需要anti-join
:
userLeft.join(userRight, ["id"], "leftanti").show()
+----------+----------------+----------+---+---------+
| date1| email|first_name| id|last_name|
+----------+----------------+----------+---+---------+
|2017-02-08|s.kent@email.com| Steve| 1| Kent|
+----------+----------------+----------+---+---------+
答案 1 :(得分:0)
您还可以使用full join
并仅保留null
值:
userLeft.join(
userRight,
[c for c in userLeft.columns if c != "date1"],
"full"
).filter(psf.isnull(userLeft.date1) | psf.isnull(userRight.date1)).show()
+------------------+----------+---+---------+----------+----------+
| email|first_name| id|last_name| date1| date1|
+------------------+----------+---+---------+----------+----------+
|marge.hh@email.com| null| 3| hh|2017-02-10| null|
|marge.hh@email.com| null| 3| hh| null|2017-02-12|
| s.kent@email.com| Steve| 1| Kent|2017-02-08| null|
+------------------+----------+---+---------+----------+----------+
如果您想使用联接,无论是leftanti
还是full
,您都需要在加入列中找到null
的默认值(我想我们在上一个主题)。
你也可以drop
困扰你的专栏subtract
和join
:
df = userLeft.drop("date1").subtract(userRight.drop("date1"))
userLeft.join(df, df.columns).show()
+----------------+----------+---+---------+----------+
| email|first_name| id|last_name| date1|
+----------------+----------+---+---------+----------+
|s.kent@email.com| Steve| 1| Kent|2017-02-08|
+----------------+----------+---+---------+----------+