我有一个存储在pandas dataframe中的时间序列数据,如下所示:
Date Open High Low Close Volume
0 2016-01-19 22.86 22.92 22.36 22.60 838024
1 2016-01-20 22.19 22.98 21.87 22.77 796745
2 2016-01-21 22.75 23.10 22.62 22.76 573068
3 2016-01-22 23.13 23.35 22.96 23.33 586967
4 2016-01-25 23.22 23.42 23.01 23.26 645551
5 2016-01-26 23.28 23.85 23.22 23.74 592658
6 2016-01-27 23.68 23.78 18.76 20.09 5351850
7 2016-01-28 20.05 20.69 19.11 19.37 2255635
8 2016-01-29 19.51 20.02 19.40 19.90 1203969
9 2016-02-01 19.77 19.80 19.13 19.14 1203375
我想创建一个适用的函数,它可以获取原始数据集的一部分,它可以由任何自定义聚合运算符聚合。
可以说,该功能的应用如下:
aggregated_df = data.apply(calculateMySpecificAggregation, axis=1)
其中 calculateMySpecificAggregation 为原始数据帧的每一行获取原始数据帧的3个大小的切片。 对于每一行,函数的参数数据帧包含原始数据帧的前一行和下一行。
#pseudocode example
def calculateMySpecificAggregation(df_slice):
# I want to know which row was this function applied on (an index I would like to have here)
ri= ??? # index of the row where was this function applied
# where df_slice contains 3 rows and all columns
return float(df_slice["Close"][ri-1] + \
((df_slice["High"][ri] + df_slice["Low"][ri]) / 2) + \
df_slice["Open"][ri+1])
# this line will fail on the borders, but don't worry, I will handle it later...
我希望参数化滑动窗口大小,访问行的其他列并知道应用函数的原始行的行索引。
这意味着,如果slideWindow = 3,我想要参数数据帧:
#parameter dataframe when the function is applied on row[0]:
Date Open High Low Close Volume
0 2016-01-19 22.86 22.92 22.36 22.60 838024
1 2016-01-20 22.19 22.98 21.87 22.77 796745
#parameter dataframe when the function is applied on row[1]:
Date Open High Low Close Volume
0 2016-01-19 22.86 22.92 22.36 22.60 838024
1 2016-01-20 22.19 22.98 21.87 22.77 796745
2 2016-01-21 22.75 23.10 22.62 22.76 573068
#parameter dataframe when the function is applied on row[2]:
Date Open High Low Close Volume
1 2016-01-20 22.19 22.98 21.87 22.77 796745
2 2016-01-21 22.75 23.10 22.62 22.76 573068
3 2016-01-22 23.13 23.35 22.96 23.33 586967
#parameter dataframe when the function is applied on row[3]:
Date Open High Low Close Volume
2 2016-01-21 22.75 23.10 22.62 22.76 573068
3 2016-01-22 23.13 23.35 22.96 23.33 586967
4 2016-01-25 23.22 23.42 23.01 23.26 645551
...
#parameter dataframe when the function is applied on row[7]:
Date Open High Low Close Volume
6 2016-01-27 23.68 23.78 18.76 20.09 5351850
7 2016-01-28 20.05 20.69 19.11 19.37 2255635
8 2016-01-29 19.51 20.02 19.40 19.90 1203969
#parameter dataframe when the function is applied on row[8]:
Date Open High Low Close Volume
7 2016-01-28 20.05 20.69 19.11 19.37 2255635
8 2016-01-29 19.51 20.02 19.40 19.90 1203969
9 2016-02-01 19.77 19.80 19.13 19.14 120375
#parameter dataframe when the function is applied on row[9]:
Date Open High Low Close Volume
8 2016-01-29 19.51 20.02 19.40 19.90 1203969
9 2016-02-01 19.77 19.80 19.13 19.14 1203375
如果可能的话,我不想使用结合iloc
索引的循环。
我已尝试pandas.DataFrame.rolling
和pandas.rolling_apply
但没有成功。
有谁知道如何解决这个问题?
答案 0 :(得分:0)
好的,经过长时间的痛苦,我已经解决了这个问题。
我无法避免iloc
(在这种情况下这不是一个大问题),但至少在这里没有使用周期。
contextSizeLeft = 2
contextSizeRight = 3
def aggregateWithContext(df, row, func, contextSizeLeft, contextSizeRight):
leftBorder = max(0, row.name - contextSizeLeft)
rightBorder = min(len(df), row.name + contextSizeRight) + 1
'''
print("pos: ", row.name, \
"\t", (row.name-contextSizeLeft, row.name+contextSizeRight), \
"\t", (leftBorder, rightBorder), \
"\t", len(df.loc[:][leftBorder : rightBorder]))
'''
return func(df.iloc[:][leftBorder : rightBorder], row.name)
def aggregate(df, center):
print()
print("center", center)
print(df["Date"])
return len(df)
df.apply(lambda x: aggregateWithContext(df, x, aggregate, contextSizeLeft, contextSizeRight), axis=1)
如果有人需要它,日期相同:
def aggregateWithContext(df, row, func, timedeltaLeft, timedeltaRight):
dateInRecord = row["Date"]
leftBorder = pd.to_datetime(dateInRecord - timedeltaLeft)
rightBorder = pd.to_datetime(dateInRecord + timedeltaRight)
dfs = df[(df['Date'] >= leftBorder) & (df['Date'] <= rightBorder)]
#print(dateInRecord, ":\t", leftBorder, "\t", rightBorder, "\t", len(dfs))
return func(dfs, row.name)
def aggregate(df, center):
#print()
#print("center", center)
#print(df["Date"])
return len(df)
timedeltaLeft = timedelta(days=2)
timedeltaRight = timedelta(days=2)
df.apply(lambda x: aggregateWithContext(df, x, aggregate, timedeltaLeft, timedeltaRight), axis=1)