我有一个时间序列数据框,希望找到每个记录中的日期与该数据框中的最后一个(最大)日期之间的差异。但是出现错误-TypeError:-:'DatetimeIndex'和'SeriesGroupBy'不受支持的操作数类型。从错误看来,数据框不是“正确”类型,不允许进行这些操作。我如何避免这种情况或可能以其他某种格式转换数据以便执行该操作。下面是重现错误的示例代码
import pandas as pd
df = pd.DataFrame([[54.7,36.3,'2010-07-20'],[54.7,36.3,'2010-07-21'],[52.3,38.7,'2010-07-26'],[52.3,38.7,'2010-07-30']],
columns=['col1','col2','date'])
df.date = pd.to_datetime(df.date)
df.index = df.date
df = df.resample('D')
print(type(df))
diff = (df.date.max() - df.date).values
答案 0 :(得分:1)
我认为您首先需要通过DataFrame.set_index
创建DatetimeIndex
,因此,如果通过max
进行汇总,则可以获得连续的值:
df = pd.DataFrame([[54.7,36.3,'2010-07-20'],
[54.7,36.3,'2010-07-21'],
[52.3,38.7,'2010-07-26'],
[52.3,38.7,'2010-07-30']],
columns=['col1','col2','date'])
df.date = pd.to_datetime(df.date)
df1 = df.set_index('date').resample('D').max()
#alternative if not duplicated datetimes
#df1 = df.set_index('date').asfreq('D')
print (df1)
col1 col2
date
2010-07-20 54.7 36.3
2010-07-21 54.7 36.3
2010-07-22 NaN NaN
2010-07-23 NaN NaN
2010-07-24 NaN NaN
2010-07-25 NaN NaN
2010-07-26 52.3 38.7
2010-07-27 NaN NaN
2010-07-28 NaN NaN
2010-07-29 NaN NaN
2010-07-30 52.3 38.7
然后将其自身减去索引的最大值,并用TimedeltaIndex.days
将timedelta转换为天数:
df1['diff'] = (df1.index.max() - df1.index).days
print (df1)
col1 col2 diff
date
2010-07-20 54.7 36.3 10
2010-07-21 54.7 36.3 9
2010-07-22 NaN NaN 8
2010-07-23 NaN NaN 7
2010-07-24 NaN NaN 6
2010-07-25 NaN NaN 5
2010-07-26 52.3 38.7 4
2010-07-27 NaN NaN 3
2010-07-28 NaN NaN 2
2010-07-29 NaN NaN 1
2010-07-30 52.3 38.7 0