Question

我想根据上一行中的值在熊猫数据框中创建一个新列。

具体地说，我想添加一列，以天为单位，在实际行中找到的日期与在最后一行中在最后一行中找到的日期之间的天数之差，具有相同的userId和amount> 0。

我有这个：

+--------+------------+-----------+
| UserId |    Date    |    Amount |
+--------+------------+-----------+
|      1 | 2017-01-01 |         0 |
|      1 | 2017-01-03 |        10 |
|      2 | 2017-01-04 |        20 |
|      2 | 2017-01-07 |        15 |
|      1 | 2017-01-09 |         7 |
+--------+------------+-----------+

我想要这个

+--------+------------+-----------+-------------+
| UserId |    Date    |    Amount |  Difference |
+--------+------------+-----------+-------------+
|      1 | 2017-01-01 |         0 |          -1 |
|      1 | 2017-01-03 |        10 |          -1 |
|      2 | 2017-01-04 |        20 |          -1 |
|      2 | 2017-01-07 |        15 |           3 |
|      1 | 2017-01-09 |         7 |           6 |
+--------+------------+-----------+-------------+

Answer 1

你真的很亲密；我刚刚修改了您的代码。

"""
UserId     Date        Amount 
1  2017-01-01          0 
1  2017-01-03         10 
2  2017-01-04         20 
2  2017-01-07         15 
1  2017-01-09          7 
"""
import pandas as pd
df = pd.read_clipboard(parse_dates=["Date"])

df['difference'] = df[df['Amount'] > 0].groupby(['UserId'])['Date'].diff().dt.days.fillna(-1)
df.loc[0, "difference"] = -1
df

输出：

   UserId       Date  Amount  difference
0       1 2017-01-01       0        -1.0
1       1 2017-01-03      10        -1.0
2       2 2017-01-04      20        -1.0
3       2 2017-01-07      15         3.0
4       1 2017-01-09       7         6.0

来自Python: Convert timedelta to int in a dataframe

的帮助

很显然，我手动更改了第一行；使用此代码时，您的df的其余部分会如何变化？

Answer 2

将方法纳入考虑范围的另一种方法：

首先使用熊猫函数Date将to_datetime列转换为日期时间。

df['Date'] = pd.to_datetime(df['Date'])

现在使用groupby按天数计算差异，这将显示差异，其余值将作为NaN

df['Difference'] = df[df['Amount'] > 0].groupby(['UserId'])['Date'].diff().dt.days

df
   UserId       Date  Amount  Difference
0       1 2017-01-01       0         NaN
1       1 2017-01-03      10         NaN
2       2 2017-01-04      20         NaN
3       2 2017-01-07      15         3.0
4       2 2017-01-09       7         2.0

现在，最后在“数据帧” NaN's列中填充所有-1至Difference。

df['Difference'] = df['Difference'].fillna("-1")
# df = df.fillna("-1") <-- this do the Job but in case you have NaNs in other location in df it will also replace them as `-1`

结果：

df
   UserId       Date  Amount Difference
0       1 2017-01-01       0         -1
1       1 2017-01-03      10         -1
2       2 2017-01-04      20         -1
3       2 2017-01-07      15          3
4       2 2017-01-09       7          2

根据上一行的值在熊猫数据框中创建一个新列

2 个答案: