我想根据上一行中的值在熊猫数据框中创建一个新列。
具体地说,我想添加一列,以天为单位,在实际行中找到的日期与在最后一行中在最后一行中找到的日期之间的天数之差,具有相同的userId和amount> 0。
我有这个:
+--------+------------+-----------+
| UserId | Date | Amount |
+--------+------------+-----------+
| 1 | 2017-01-01 | 0 |
| 1 | 2017-01-03 | 10 |
| 2 | 2017-01-04 | 20 |
| 2 | 2017-01-07 | 15 |
| 1 | 2017-01-09 | 7 |
+--------+------------+-----------+
我想要这个
+--------+------------+-----------+-------------+
| UserId | Date | Amount | Difference |
+--------+------------+-----------+-------------+
| 1 | 2017-01-01 | 0 | -1 |
| 1 | 2017-01-03 | 10 | -1 |
| 2 | 2017-01-04 | 20 | -1 |
| 2 | 2017-01-07 | 15 | 3 |
| 1 | 2017-01-09 | 7 | 6 |
+--------+------------+-----------+-------------+
答案 0 :(得分:0)
你真的很亲密;我刚刚修改了您的代码。
"""
UserId Date Amount
1 2017-01-01 0
1 2017-01-03 10
2 2017-01-04 20
2 2017-01-07 15
1 2017-01-09 7
"""
import pandas as pd
df = pd.read_clipboard(parse_dates=["Date"])
df['difference'] = df[df['Amount'] > 0].groupby(['UserId'])['Date'].diff().dt.days.fillna(-1)
df.loc[0, "difference"] = -1
df
输出:
UserId Date Amount difference
0 1 2017-01-01 0 -1.0
1 1 2017-01-03 10 -1.0
2 2 2017-01-04 20 -1.0
3 2 2017-01-07 15 3.0
4 1 2017-01-09 7 6.0
来自Python: Convert timedelta to int in a dataframe
的帮助很显然,我手动更改了第一行;使用此代码时,您的df
的其余部分会如何变化?
答案 1 :(得分:0)
将方法纳入考虑范围的另一种方法:
首先使用熊猫函数Date
将to_datetime
列转换为日期时间。
df['Date'] = pd.to_datetime(df['Date'])
现在使用groupby
按天数计算差异,这将显示差异,其余值将作为NaN
df['Difference'] = df[df['Amount'] > 0].groupby(['UserId'])['Date'].diff().dt.days
df
UserId Date Amount Difference
0 1 2017-01-01 0 NaN
1 1 2017-01-03 10 NaN
2 2 2017-01-04 20 NaN
3 2 2017-01-07 15 3.0
4 2 2017-01-09 7 2.0
现在,最后在“数据帧” NaN's
列中填充所有-1
至Difference
。
df['Difference'] = df['Difference'].fillna("-1")
# df = df.fillna("-1") <-- this do the Job but in case you have NaNs in other location in df it will also replace them as `-1`
结果:
df
UserId Date Amount Difference
0 1 2017-01-01 0 -1
1 1 2017-01-03 10 -1
2 2 2017-01-04 20 -1
3 2 2017-01-07 15 3
4 2 2017-01-09 7 2