假设我的数据集
Name date
A 15-01-01
A 15-01-31
A 16-02-02
A 16-03-04
B 17-04-05
B 17-05-08
B 17-07-09
C 18-01-02
C 18-02-03
我想将每个人的第一个和最后一个日期添加到新列中。
得到我想要的
Name date startdate enddate
A 15-01-01 15-01-01 16-03-04
A 15-01-31 15-01-01 16-03-04
A 16-02-02 15-01-01 16-03-04
A 16-03-04 15-01-01 16-03-04
B 17-04-05 17-04-05 17-07-09
B 17-05-08 17-04-05 17-07-09
B 17-07-09 17-04-05 17-07-09
C 18-01-02 18-01-02 18-02-03
C 18-02-03 18-01-02 18-02-03
有没有办法使之成为可能?如果您让我知道,我将不胜感激。
感谢您阅读
答案 0 :(得分:3)
将GroupBy.transform
与GroupBy.first
或GroupBy.last
结合使用并添加到新列:
g = df.groupby('Name')['date']
df = df.assign(startdate = g.transform('first'), enddate = g.transform('last'))
print (df)
Name date startdate enddate
0 A 15-01-01 15-01-01 16-03-04
1 A 15-01-31 15-01-01 16-03-04
2 A 16-02-02 15-01-01 16-03-04
3 A 16-03-04 15-01-01 16-03-04
4 B 17-04-05 17-04-05 17-07-09
5 B 17-05-08 17-04-05 17-07-09
6 B 17-07-09 17-04-05 17-07-09
7 C 18-01-02 18-01-02 18-02-03
8 C 18-02-03 18-01-02 18-02-03
或者如果需要使用GroupBy.min
和GroupBy.max
的最小值和最大值:
df['date'] = pd.to_datetime(df['date'], format='%y-%m-%d')
g = df.groupby('Name')['date']
df = df.assign(startdate = g.transform('min'), enddate = g.transform('max'))
print (df)
Name date startdate enddate
0 A 2015-01-01 2015-01-01 2016-03-04
1 A 2015-01-31 2015-01-01 2016-03-04
2 A 2016-02-02 2015-01-01 2016-03-04
3 A 2016-03-04 2015-01-01 2016-03-04
4 B 2017-04-05 2017-04-05 2017-07-09
5 B 2017-05-08 2017-04-05 2017-07-09
6 B 2017-07-09 2017-04-05 2017-07-09
7 C 2018-01-02 2018-01-02 2018-02-03
8 C 2018-02-03 2018-01-02 2018-02-03
用GroupBy.agg
和DataFrame.join
替代:
df1 = df.groupby('Name').agg(startdate = ('date','first'), lastdate = ('date','last'))
df = df.join(df1, on='Name')
答案 1 :(得分:1)
有一种方法可以做到。虽然我不太确定,但是应该可以。
df['date'] = pd.to_datetime(df['date'], format='%y-%m-%d')
start_date =[]
end_date = []
first_ittr = True;
index = 0
for i in df['date']:
if(first_ittr):
start_date[index] = i
end_date[index]=i
first_ittr = False
elif(i>end_date):
end_date[index] = i
elif(i<start_date):
start_date[index] = i
df['startdate'] = start_date
df['enddate'] = end_date