布尔逻辑比较来自两个Pandas数据帧的数据

时间:2018-01-04 19:03:52

标签: python python-3.x pandas numpy scikit-learn

我有以下数据框df1

          Date  Invoice          Name  Price  Coupon Location
0   2017-12-24   700349      John Doe  59.95    NONE    VAGG1
1   2017-12-24   700347     Joe Smith  59.95    GBMR       GG
2   2017-12-24   700345  Dave Johnson  35.00  CHANGE    VAGG1
3   2017-12-24   700342     Sue Davis  35.00  GADSLR    VAGG1
4   2017-12-23   700329   Betty Clark  84.95  GADSLR      GG2

和第二个数据框df2

           Date  Invoice         Name  Price    Coupon    Location
0   2017-12-24    800349     John Doe  59.95   NONE      VAGG1
1   2017-12-24    800347    Joe Smith  59.95   GBMR      GG
2   2017-12-24    800345     John Doe  17.95   CHANGE    VAGG1
3   2017-12-24    800342     John Doe   9.95   GADSLR    VAGG1
4   2017-12-23    800329  Sue Simpson  34.95   GADSLR    GG2

我想使用以下逻辑创建第三个Dataframe df3

  • 对于df1中的每个名称,请检查是否匹配。
  • 如果匹配,请将提供的匹配行df2添加到df3 该行的价格与相关价格不符 该名称如果df1

因此输出数据帧df3应如下所示:

+------------+---------+----------+-------+--------+----------+
|    Date    | Invoice |   Name   | Price | Coupon | Location |
+------------+---------+----------+-------+--------+----------+
| 2017-12-24 |  800345 | John Doe | 17.95 | CHANGE | VAGG1    |
| 2017-12-24 |  800342 | John Doe |  9.95 | GADSLR | VAGG1    |
+------------+---------+----------+-------+--------+----------+

2 个答案:

答案 0 :(得分:1)

使用merge + query -

df1.merge(df2[['Name', 'Price']], on='Name')\
   .query('Price_x != Price_y')\
   .drop('Price_x', 1)\
   .rename(columns={'Price_y' : 'Price'})

         Date  Invoice      Name Coupon Location  Price
1  2017-12-24   700349  John Doe   NONE    VAGG1  17.95
2  2017-12-24   700349  John Doe   NONE    VAGG1   9.95

df1df2是您各自的数据框。

答案 1 :(得分:1)

以下代码块:

df3 = pd.merge(df1, df2, on='Name', how='right')\
        .query('Price_x != Price_y')\
        .drop('Price_x', 1)\
        .rename(columns={'Price_y' : 'Price'})

df3 =

中的结果
            Date_x  Invoice_x         Name Coupon_x Location_x        Date_y  \
1   2017-12-24   700349.0     John Doe     NONE      VAGG1   2017-12-24
2   2017-12-24   700349.0     John Doe     NONE      VAGG1   2017-12-24
4          NaN        NaN  Sue Simpson      NaN        NaN   2017-12-23

   Invoice_y  Price  Coupon_y  Location_y
1     800345  17.95   CHANGE    VAGG1
2     800342   9.95   GADSLR    VAGG1
4     800329  34.95   GADSLR    GG2

扩展代码块:

df3 = pd.merge(df1, df2, on='Name', how='right')\
        .query('Price_x != Price_y')\
        .drop('Price_x', 1)\
        .rename(columns={'Price_y' : 'Price'})\
        .drop('Location_x',1)\
        .drop('Coupon_x',1)\
        .drop('Date_x',1)\
        .drop('Invoice_x',1)\
        .rename(columns={'Date_y' : 'Date'})\
        .rename(columns={'Invoice_y' : 'Invoice'})\
        .rename(columns={'Coupon_y' : 'Coupon'})\
        .rename(columns={'Location_y' : 'Location'})

df3 =

中的结果
          Name          Date  Invoice  Price    Coupon    Location
1     John Doe   2017-12-24    800345  17.95   CHANGE    VAGG1
2     John Doe   2017-12-24    800342   9.95   GADSLR    VAGG1
4  Sue Simpson   2017-12-23    800329  34.95   GADSLR    GG2

这是有问题的,因为它导致列行无序。添加:

df3=df3[['Date', 'Invoice', 'Name', 'Price', 'Coupon', 'Location']]

我们得到df3 =

           Date  Invoice         Name  Price    Coupon    Location
1   2017-12-24    800345     John Doe  17.95   CHANGE    VAGG1
2   2017-12-24    800342     John Doe   9.95   GADSLR    VAGG1
4   2017-12-23    800329  Sue Simpson  34.95   GADSLR    GG2

除了" Sue Simpson"进入,应该缺席。