我有一个包含数千条记录的数据集。基于关键列(A)与上个月相比,显示列(X)值的变化的最佳方法是什么。
以下是样本表。
+----------+---+-----+
| Date | A | X |
+----------+---+-----+
| Jan 2017 | z | 123 |
| Jan 2017 | y | 234 |
| Feb 2017 | w | 123 |
| Feb 2017 | z | 456 |
+----------+---+-----+
输出:
+----------+-----+-----------+
| Date | X | Changes |
+----------+-----+-----------+
| Feb 2017 | 234 | Deleted |
| Feb 2017 | 456 | Added |
+----------+-----+-----------+
谢谢!
答案 0 :(得分:0)
可能有一种更简单的方法,但这是一个解决方案:
In [1]: import pandas as pd
...:
...: df = pd.DataFrame({'Date': ['Jan 2017', 'Jan 2017', 'Feb 2017', 'Feb 2017'],
...: 'A': 'zywz', 'X': [123, 234, 123, 456]})
...: df = df[['Date', 'A', 'X']]
...: df['Date'] = pd.to_datetime(df['Date'])
...: df.set_index('Date', inplace=True)
...: df # input dataframe
...:
Out[1]:
A X
Date
2017-01-01 zywz 123
2017-01-01 zywz 234
2017-02-01 zywz 123
2017-02-01 zywz 456
In [2]: # cout X values per month
...: wdf = df.reset_index().groupby(['Date', 'X']).X.count().unstack(level='X')
...: wdf
...:
Out[2]:
X 123 234 456
Date
2017-01-01 1.0 1.0 NaN
2017-02-01 1.0 NaN 1.0
In [3]: # detect the changes
...: import numpy as np
...: def get_status(col):
...: if np.isnan(col[0]) and col[1]:
...: return 'Added'
...: if col[0] and np.isnan(col[1]):
...: return 'Deleted'
...: return 'no change'
...:
...: status = wdf.apply(get_status)
...: status.name = 'Changes'
...:
In [4]: # back to df
...: # securely work on working dataframe to save initial `df`
...: wdf = df.join(status, on='X').reset_index()[['Date', 'X', 'Changes']]
...: wdf[wdf['Changes']!='no change'].set_index('Date')
...:
Out[4]:
X Changes
Date
2017-01-01 234 Deleted
2017-02-01 456 Added