在pandas dataframe中具有相同值的行之前查找行

时间:2017-10-31 00:09:30

标签: python pandas

我有一个这样的数据框:

  peak-date  
0    17 Jan  
1    17 Jan  
2    31 Mar  
3    30 Apr  
4    31 May  
5    26 Jun  
6    26 Jun  

我希望在峰值日期值变得相同时找到行前的行。在这种情况下,这将是peak-date 31 May的行。我可以使用df ['peak-date']。diff()来计算数值,但是我该怎么做?

1 个答案:

答案 0 :(得分:1)

一种可能的方式如下: 首先,您可以使用数据框中的to_datetime读取日期格式的字符串,并找到具有diff函数的每一行之间的差异。另外,我们可以找到秒的差异,以便有浮动数。然后,将差异行向上移2,并搜索第一次出现0差异将得到peak-date值。

# read csv for date with month day
df = pd.read_csv('test.csv', sep='\s\s+', engine='python')

# Convert to datetime with difference in days
df['diff'] = pd.to_datetime(df['peak-date'], format='%d %b').diff()

# save difference in seconds in different column and shift by 2
df['diff_seconds'] = df['diff'].apply(lambda row: row.total_seconds())
df['diff_seconds'] = df['diff_seconds'].shift(-2)

查看dataframe

  peak-date    diff  diff_seconds
0    17 Jan     NaT     6307200.0
1    17 Jan  0 days     2592000.0
2    31 Mar 73 days     2678400.0
3    30 Apr 30 days     2246400.0
4    31 May 31 days           0.0
5    26 Jun 26 days     2592000.0
6    26 Jun  0 days     2246400.0
6    26 Jul 30 days           0.0
6    21 Aug 26 days           NaN
6    21 Aug  0 days           NaN

现在获取第一个连续日期之前的值:

# look for the first occurence index and get the row by index
first_occur_index = df.diff_seconds.eq(0.0).idxmax()
df.iloc[first_occur_index]['peak-date']

结果:

'31 May'