我正在尝试对ML回归问题进行数据预处理。
使用以下(简化的)数据帧:
grp day score
0 A 1 2
1 A 1 4
2 A 2 6
3 A 2 8
4 A 3 10
5 A 3 12
6 A 4 14
7 A 4 16
8 A 5 18
9 A 5 20
10 B 1 2
11 B 2 4
12 B 3 8
13 B 4 16
14 B 5 32
我正在尝试根据“天”列创建“滑动窗口”序列列表,因此,如果我有 X 天,则前2天的得分目标为 Y 天啊。
在下面的示例中,我每组有5天,每2天我查看的是提前2天的目标,在到达数据帧末尾时停止:
例如,这是A组的前两个组:
grp day score target
0 A 1 2 16
1 A 1 4 16
2 A 2 6 16
3 A 2 8 16 <- last score value of day 4 (group A)
grp day score target
0 A 2 6 20
1 A 2 8 20
2 A 3 10 20
3 A 3 12 20 <- last score value of day 5 (group A)
对于B组:
grp day score target
10 B 1 2 16
11 B 2 4 16 <- last score value of day 4 (group B)
grp day score target
10 B 2 4 32
11 B 3 8 32 <- last score value of day 5 (group B)
我已经使用factorize
来获取天数索引和分组,如下所示:
groups = df.groupby(['grp'])
for _,grp in groups:
days_row_index = grp['day'].factorize()[0]
days_group = grp.groupby(days_row_index)
...
但是我有点迷路了……任何帮助将不胜感激 更新:
我已经编写了以下笨拙的代码,以帮助我前进...如何改进它?
import pandas as pd
df = pd.DataFrame({'grp':['A','A','A','A','A','A','A','A','A','A','B','B','B','B','B'],
'day':['1','1','2','2','3','3','4','4','5','5','1','2','3','4','5'],
'score':[2,4,6,8,10,12,14,16,18,20,2,4,8,16,32]
})
print(df.head(15))
df2 = pd.DataFrame({'grp':[],
'day':[],
'score':[]})
groups = df.groupby(['grp'])
GROUP_SIZE = 2
LOOK_AHEAD = 2
sequences = []
for _,grp in groups:
days_row_index = grp['day'].factorize()[0]
days_group = grp.groupby(days_row_index)
for _,day in days_group:
day_index = int(day['day'].values[0])
if day_index + LOOK_AHEAD < len(days_group):
target = days_group.get_group(day_index + LOOK_AHEAD)['score'].values[-1]
print(day_index,day_index + LOOK_AHEAD,day['score'].values[-1],"----------->",target)
day['target'] = target
df2 = pd.concat([df2,day])
for i in range(0, GROUP_SIZE-1):
if day_index + i >= len(days_group):
break
next_day = days_group.get_group(day_index + i)
next_day['target'] = target
df2 = pd.concat([df2,next_day])
sequences.append(df2.copy())
df2 = df2.iloc[0:0]
sequences
答案 0 :(得分:0)
在您提出的解决方案的基础上,我编写了这一小段代码,我很肯定可以对其进行优化,因此任何人都可以对其进行改进。让我知道这是否是您想要的(我自由创建了另一个“混合”组“ C”以测试更通用的方法)。
import pandas as pd
# Create test dataframe
df = [
['A', 1, 2],
['A', 1, 4],
['A', 2, 6],
['A', 2, 8],
['A', 3, 10],
['A', 3, 12],
['A', 4, 14],
['A', 4, 16],
['A', 5, 18],
['A', 5, 20],
['B', 1, 2],
['B', 2, 4],
['B', 3, 8],
['B', 4, 16],
['B', 5, 32],
['C', 1, 2],
['C', 1, 4],
['C', 2, 8],
['C', 3, 16],
['C', 3, 20],
['C', 4, 24],
['C', 5, 28]
]
df = pd.DataFrame(df, columns = ['grp', 'day', 'score'])
# Processing
groups = df.groupby(['grp'])
for _,grp in groups:
days_row_index = grp['day'].factorize()[0]
i = min(days_row_index)
while i < max(days_row_index) - 2:
idx = (days_row_index == i) | (days_row_index == i + 1)
# Create list of targets for every subgroup
print([grp['score'].values[days_row_index == i + 3][-1]]*sum(idx))
i += 1