我有一个这样的数据框:
peak-date
0 17 Jan
1 17 Jan
2 31 Mar
3 30 Apr
4 31 May
5 26 Jun
6 26 Jun
我希望在峰值日期值变得相同时找到行前的行。在这种情况下,这将是peak-date
31 May
的行。我可以使用df ['peak-date']。diff()来计算数值,但是我该怎么做?
答案 0 :(得分:1)
一种可能的方式如下:
首先,您可以使用数据框中的to_datetime
读取日期格式的字符串,并找到具有diff
函数的每一行之间的差异。另外,我们可以找到秒的差异,以便有浮动数。然后,将差异行向上移2,并搜索第一次出现0差异将得到peak-date
值。
# read csv for date with month day
df = pd.read_csv('test.csv', sep='\s\s+', engine='python')
# Convert to datetime with difference in days
df['diff'] = pd.to_datetime(df['peak-date'], format='%d %b').diff()
# save difference in seconds in different column and shift by 2
df['diff_seconds'] = df['diff'].apply(lambda row: row.total_seconds())
df['diff_seconds'] = df['diff_seconds'].shift(-2)
查看dataframe
:
peak-date diff diff_seconds
0 17 Jan NaT 6307200.0
1 17 Jan 0 days 2592000.0
2 31 Mar 73 days 2678400.0
3 30 Apr 30 days 2246400.0
4 31 May 31 days 0.0
5 26 Jun 26 days 2592000.0
6 26 Jun 0 days 2246400.0
6 26 Jul 30 days 0.0
6 21 Aug 26 days NaN
6 21 Aug 0 days NaN
现在获取第一个连续日期之前的值:
# look for the first occurence index and get the row by index
first_occur_index = df.diff_seconds.eq(0.0).idxmax()
df.iloc[first_occur_index]['peak-date']
结果:
'31 May'