我有以下数据框:
Id, country, date
1, ar, 2019-01-01
1, ar, , 2019-02-01
1, ar, 2019-03-01
1, it, , 2019-01-01
1, it, , 2019-02-01
1, it, 2019-03-01
1, it, , 2019-04-01
1, it, 2019-03-01
2, ar, 2019-01-01
2, ar, , 2019-02-01
2, ar, 2019-03-01
2, it, , 2019-01-01
2, it, , 2019-02-01
3, it, 2019-03-01
3, it, , 2019-04-01
4, it, 2019-05-01
我需要按 ID、国家/地区分组并计算每个组的日期(以月为单位)之间的差异。
我试过了:
df['daysdiff'] = df.sort_values('date').groupby(['id','country'])['date'].diff()
但它会在几天内得到差异。我需要几个月后的差异。我认为将 'daysdiff' 除以 30 是不准确的,因为月份有不同的天数......和闰年......
欢迎任何帮助!
答案 0 :(得分:0)
我根据您的情况调整了此 approach。
基本上,您必须处理 NaT
值。我选择将它们视为 0
。
如果您愿意,可以将月份四舍五入为整数。
与您的示例一样,有一个重复的行:"1", "it", "2019-03-01"
此行结果为 7, 1, it, 2019-03-01, 0 days, 0
(因为它被视为已排序的唯一行作为输入)
对于这种情况,它似乎有效,尽管我还没有在其他情况下进行过测试。
import pandas as pd
df = pd.DataFrame(columns=["id", "country", "date"]
, data=[
["1", "ar", "2019-01-01"],
["1", "ar", "2019-02-01"],
["1", "ar", "2019-03-01"],
["1", "it", "2019-01-01"],
["1", "it", "2019-02-01"],
["1", "it", "2019-03-01"],
["1", "it", "2019-04-01"],
["1", "it", "2019-03-01"],
["2", "ar", "2019-01-01"],
["2", "ar", "2019-02-01"],
["2", "ar", "2019-03-01"],
["2", "it", "2019-01-01"],
["2", "it", "2019-02-01"],
["3", "it", "2019-03-01"],
["3", "it", "2019-04-01"],
["4", "it", "2019-05-01"]
])
df["date"] = pd.to_datetime(df["date"])
df['daysdiff'] = df.sort_values('date').groupby(['id','country'])['date'].diff()
df['monthsdiff'] = (
df
.sort_values('date')
.groupby(['id','country'])['date']
.diff()
# 365.25 [days/year] / (12 [months/year]) = 30.4375 [days/month]
.div(pd.Timedelta(days=365.25/12), fill_value="0")
.round()
.astype(int)
)
print(df)
# id country date daysdiff monthsdiff
# 0 1 ar 2019-01-01 NaT 0
# 1 1 ar 2019-02-01 31 days 1
# 2 1 ar 2019-03-01 28 days 1
# 3 1 it 2019-01-01 NaT 0
# 4 1 it 2019-02-01 31 days 1
# 5 1 it 2019-03-01 28 days 1
# 6 1 it 2019-04-01 31 days 1
# 7 1 it 2019-03-01 0 days 0
# 8 2 ar 2019-01-01 NaT 0
# 9 2 ar 2019-02-01 31 days 1
# 10 2 ar 2019-03-01 28 days 1
# 11 2 it 2019-01-01 NaT 0
# 12 2 it 2019-02-01 31 days 1
# 13 3 it 2019-03-01 NaT 0
# 14 3 it 2019-04-01 31 days 1
# 15 4 it 2019-05-01 NaT 0