计算 groupby 中的甜菜日期月份的日期差异

时间:2021-02-14 23:33:33

标签: python pandas datetime pandas-groupby

我有以下数据框:

Id, country, date
1, ar, 2019-01-01
1, ar, , 2019-02-01
1, ar, 2019-03-01
1, it, , 2019-01-01
1, it, , 2019-02-01
1, it, 2019-03-01
1, it, , 2019-04-01
1, it, 2019-03-01
2, ar, 2019-01-01
2, ar, , 2019-02-01
2, ar, 2019-03-01
2, it, , 2019-01-01
2, it, , 2019-02-01
3, it, 2019-03-01
3, it, , 2019-04-01
4, it, 2019-05-01

我需要按 ID、国家/地区分组并计算每个组的日期(以月为单位)之间的差异。

我试过了:

df['daysdiff'] = df.sort_values('date').groupby(['id','country'])['date'].diff()

但它会在几天内得到差异。我需要几个月后的差异。我认为将 'daysdiff' 除以 30 是不准确的,因为月份有不同的天数......和闰年......

欢迎任何帮助!

1 个答案:

答案 0 :(得分:0)

我根据您的情况调整了此 approach

基本上,您必须处理 NaT 值。我选择将它们视为 0

如果您愿意,可以将月份四舍五入为整数。

与您的示例一样,有一个重复的行:"1", "it", "2019-03-01"

此行结果为 7, 1, it, 2019-03-01, 0 days, 0(因为它被视为已排序的唯一行作为输入)

对于这种情况,它似乎有效,尽管我还没有在其他情况下进行过测试。

import pandas as pd

df = pd.DataFrame(columns=["id", "country", "date"]
    , data=[
    ["1", "ar", "2019-01-01"],
    ["1", "ar", "2019-02-01"],
    ["1", "ar", "2019-03-01"],
    ["1", "it", "2019-01-01"],
    ["1", "it", "2019-02-01"],
    ["1", "it", "2019-03-01"],
    ["1", "it", "2019-04-01"],
    ["1", "it", "2019-03-01"],
    ["2", "ar", "2019-01-01"],
    ["2", "ar", "2019-02-01"],
    ["2", "ar", "2019-03-01"],
    ["2", "it", "2019-01-01"],
    ["2", "it", "2019-02-01"],
    ["3", "it", "2019-03-01"],
    ["3", "it", "2019-04-01"],
    ["4", "it", "2019-05-01"]
])
df["date"] = pd.to_datetime(df["date"])

df['daysdiff'] = df.sort_values('date').groupby(['id','country'])['date'].diff()
df['monthsdiff'] = (
    df
    .sort_values('date')
    .groupby(['id','country'])['date']
    .diff()
    # 365.25 [days/year] / (12 [months/year]) = 30.4375 [days/month]
    .div(pd.Timedelta(days=365.25/12), fill_value="0")
    .round()
    .astype(int)
    )
print(df)
#    id country       date daysdiff  monthsdiff
# 0   1      ar 2019-01-01      NaT           0
# 1   1      ar 2019-02-01  31 days           1
# 2   1      ar 2019-03-01  28 days           1
# 3   1      it 2019-01-01      NaT           0
# 4   1      it 2019-02-01  31 days           1
# 5   1      it 2019-03-01  28 days           1
# 6   1      it 2019-04-01  31 days           1
# 7   1      it 2019-03-01   0 days           0
# 8   2      ar 2019-01-01      NaT           0
# 9   2      ar 2019-02-01  31 days           1
# 10  2      ar 2019-03-01  28 days           1
# 11  2      it 2019-01-01      NaT           0
# 12  2      it 2019-02-01  31 days           1
# 13  3      it 2019-03-01      NaT           0
# 14  3      it 2019-04-01  31 days           1
# 15  4      it 2019-05-01      NaT           0