用NaN替换重复的一组值

时间:2019-05-08 15:27:01

标签: python python-3.x pandas numpy

如果我有以下数据:

+---------------+---------------------+---------------------+----------+--------------+
| email         | date_opened         | order_date          | order_id | product_name |
+---------------+---------------------+---------------------+----------+--------------+
| abc@email.com | 2019-01-01 10:20:12 | 2019-01-03 09:21:43 | 1234     | xyz          |
+---------------+---------------------+---------------------+----------+--------------+
| abc@email.com | 2019-01-01 10:45:09 | 2019-01-03 09:21:43 | 1234     | xyz          |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:13:46 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:15:20 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:24:43 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-12 00:39:21 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-09 01:24:54 | 2018-08-10 11:12:14 | 5678     | zyx          |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-10 15:22:34 | 2018-08-10 11:12:14 | 5678     | zyx          |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-10 00:12:14 | 2018-08-10 11:12:14 | 5678     | zyx          |
+---------------+---------------------+---------------------+----------+--------------+
| ...           | ...                 | ...                 | ...      | ...          |
+---------------+---------------------+---------------------+----------+--------------+

如何将order_date中最早的order_id或最低product_name中的单个date_openedemailorder_date保留下来,并替换所有其他重复的{{ 1}},order_idproduct_names分别为NaN

代码:

import pandas as pd
import numpy as np
import psycopg2
import pyodbc

dwh_conn = psycopg2.connect(...)
dm_query = ...
dm = pd.read_sql(dm_query, dwh_conn, parse_dates='date_opened', index_col='email')

dfdev_conn = pyodbc.connect(...)
bkgs_query = ...
bkgs = pd.read_sql(bkgs_query, dfdev_conn, parse_dates='order_date', index_col='email')

dm_bkgs = pd.merge(dm, bkgs, how='left', left_index=True, right_index=True)
dm_bkgs['diff_days'] = dm_bkgs['date_opened'] - dm_bkgs['order_date']
dm_bkgs['diff_days'] = dm_bkgs['diff_days']/np.timedelta64(1,'D')

dm_bkgs.index.name = 'email'
dm_bkgs.sort_values(by=['email','diff_days'], inplace=True)

dm_bkgs['order_date'] = np.where(dm_bkgs.duplicated('order_id'), np.NaN, dm_bkgs['order_id'])
dm_bkgs['product_name'] = np.where(dm_bkgs.duplicated('order_id'), np.NaN, dm_bkgs['order_id'])
dm_bkgs['diff_days'] = np.where(dm_bkgs.duplicated('order_id'), np.NaN, dm_bkgs['booking_id'])
dm_bkgs['order_id'] = np.where(dm_bkgs.duplicated('order_id'), np.NaN, dm_bkgs['order_id'])

我的代码有些起作用,但是我注意到dm数据帧有1433行,在merge或join之后,行数上升到1448。不知道为什么这样。仅bkgs个数据框没有重复...

感觉好像代码有点混乱...

期望:

+---------------+---------------------+---------------------+----------+--------------+
| email         | date_opened         | order_date          | order_id | product_name |
+---------------+---------------------+---------------------+----------+--------------+
| abc@email.com | 2019-01-01 10:20:12 | 2019-01-03 09:21:43 | 1234     | xyz          |
+---------------+---------------------+---------------------+----------+--------------+
| abc@email.com | 2019-01-01 10:45:09 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:13:46 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:15:20 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:24:43 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-12 00:39:21 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-09 01:24:54 | 2018-08-10 11:12:14 | 5678     | zyx          |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-10 15:22:34 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-10 00:12:14 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| ...           | ...                 | ...                 | ...      | ...          |
+---------------+---------------------+---------------------+----------+--------------+

1 个答案:

答案 0 :(得分:1)

怎么样:

duplicated = dm_bkgs.duplicated('order_id')

dm_bkgs.loc[duplicated, ['order_date', 'order_id', 'product_name']] = np.NaN

基本上,这就是您所做的一般形式。