我正在使用Python和Pandas。我有一个如下所示的数据框:
codename date
AAA 13-03-2015
AAB 20-02-2015
AAB 15-04-2015
AAB 20-04-2015
AAB 21-04-2015
AAB 21-05-2015
我正在寻求有关在30天内计算一系列事件的帮助。我试图在下面的表格中说明我希望实现的目标:
codename date daysBetween series
AAA 13-03-2015 NaN 1
AAB 20-02-2015 NaN 1
AAB 15-04-2015 54 1
AAB 20-04-2015 5 0
AAB 21-04-2015 6 0
AAB 21-05-2015 36 1
如果从单元格1(20-02-2015)到单元格(15-04-2015)已经超过30天,则计算之间的天数(54天),然后将结果放入daysBetween
并放入1插入series
。
如果两个单元格之间的间隔不超过30天,请计算天数并连续输入0。
日期应与序列为1的最后日期进行比较。
我设法按代号和日期排序:
import pandas as pd
file = pd.read_excel('sample.xlsx')
sortedData = file.sort_values(by=['codename', 'date'])
答案 0 :(得分:0)
我认为您需要将True/False
映射到1/0
的映射,并按Series.gt
比较值并按astype
转换为整数:
#convert column to datetimes
df['date'] = pd.to_datetime(df['date'], format='%d-%m-%Y')
#sorting
df = df.sort_values(by=['codename', 'date'])
#get difference between first value of group
df['daysBetween'] = df['date'].sub(df.groupby('codename')['date'].transform('first')).dt.days
#compare by gt (>) and cast to int
df['series'] = df['daysBetween'].gt(30).astype(int)
print (df)
codename date daysBetween series
0 AAA 2015-03-13 0 0
1 AAB 2015-02-20 0 0
2 AAB 2015-04-15 54 1
3 AAB 2015-04-20 59 1
4 AAB 2015-04-21 60 1
5 AAB 2015-05-21 90 1
如果需要两个值之间的差异:
df['daysBetween'] = df.groupby('codename')['date'].diff().dt.days
df['series'] = df['daysBetween'].gt(30).astype(int)
print (df)
codename date daysBetween series
0 AAA 2015-03-13 NaN 0
1 AAB 2015-02-20 NaN 0
2 AAB 2015-04-15 54.0 1
3 AAB 2015-04-20 5.0 0
4 AAB 2015-04-21 1.0 0
5 AAB 2015-05-21 30.0 0