Question

我想在pyspark中的2个数据帧之间进行减法。挑战是我必须在减去数据帧的同时忽略一些列。但是结束数据框应该包含所有列，包括忽略的列。

以下是一个例子：

userLeft = sc.parallelize([
    Row(id=u'1', 
        first_name=u'Steve', 
        last_name=u'Kent', 
        email=u's.kent@email.com',
        date1=u'2017-02-08'),
    Row(id=u'2', 
        first_name=u'Margaret', 
        last_name=u'Peace', 
        email=u'marge.peace@email.com',
        date1=u'2017-02-09'),
    Row(id=u'3', 
        first_name=None, 
        last_name=u'hh', 
        email=u'marge.hh@email.com',
        date1=u'2017-02-10')
]).toDF()

userRight = sc.parallelize([
    Row(id=u'2', 
        first_name=u'Margaret', 
        last_name=u'Peace', 
        email=u'marge.peace@email.com',
        date1=u'2017-02-11'),
    Row(id=u'3', 
        first_name=None, 
        last_name=u'hh', 
        email=u'marge.hh@email.com',
        date1=u'2017-02-12')
]).toDF()

预期：

ActiveDF = userLeft.subtract(userRight) ||| Ignore "date1" column while subtracting.

最终结果应该是这样的，包括＆＃34; date1＆＃34;列。

+----------+--------------------+----------+---+---------+
|     date1|               email|first_name| id|last_name|
+----------+--------------------+----------+---+---------+
|2017-02-08|    s.kent@email.com|     Steve|  1|     Kent|
+----------+--------------------+----------+---+---------+

Answer 1

似乎你需要anti-join：

userLeft.join(userRight, ["id"], "leftanti").show()
+----------+----------------+----------+---+---------+  
|     date1|           email|first_name| id|last_name|
+----------+----------------+----------+---+---------+
|2017-02-08|s.kent@email.com|     Steve|  1|     Kent|
+----------+----------------+----------+---+---------+

Answer 2

您还可以使用full join并仅保留null值：

userLeft.join(
    userRight, 
    [c for c in userLeft.columns if c != "date1"], 
    "full"
 ).filter(psf.isnull(userLeft.date1) | psf.isnull(userRight.date1)).show()

    +------------------+----------+---+---------+----------+----------+
    |             email|first_name| id|last_name|     date1|     date1|
    +------------------+----------+---+---------+----------+----------+
    |marge.hh@email.com|      null|  3|       hh|2017-02-10|      null|
    |marge.hh@email.com|      null|  3|       hh|      null|2017-02-12|
    |  s.kent@email.com|     Steve|  1|     Kent|2017-02-08|      null|
    +------------------+----------+---+---------+----------+----------+

如果您想使用联接，无论是leftanti还是full，您都需要在加入列中找到null的默认值（我想我们在上一个主题）。

你也可以drop困扰你的专栏subtract和join：

df = userLeft.drop("date1").subtract(userRight.drop("date1"))
userLeft.join(df, df.columns).show()

    +----------------+----------+---+---------+----------+
    |           email|first_name| id|last_name|     date1|
    +----------------+----------+---+---------+----------+
    |s.kent@email.com|     Steve|  1|     Kent|2017-02-08|
    +----------------+----------+---+---------+----------+

PySpark：减去Dataframe忽略某些列

2 个答案: