我有一个数据框(var ans = context.StorageAreaRacks
.Join(context.StorageAreas, sar => sar.StorageAreaId, sa => sa.Id, (sar, sa) => new { sar, sa })
.Join(context.StorageAreaTypes, sarsa => sarsa.sa.StorageAreaTypeId, sat => sat.Id, (sarsa, sat) => new { sarsa.sar, sat })
.Join(context.Racks, sarsat => sarsat.sar.RackId, r => r.Id, (sarsat, r) => new { sarsat.sat, r })
.Where(satr => !satr.sat.IsManual && satr.r.IsEnabled && !satr.r.IsVirtual)
.Select(satr => new { satr.sat.Id, satr.sat.Name })
.Distinct()
.ToList();
)和一个Postgres表(var ans = (from sar in context.StorageAreaRacks
join sa in context.StorageAreas on sar.StorageAreaId equals sa.Id
join sat in context.StorageAreaTypes on sa.StorageAreaTypeId equals sat.Id
join r in context.Racks on sar.RackId equals r.Id
where !sat.IsManual && r.IsEnabled && !r.IsVirtual
select new {
sat.Name,
sat.Id
}).Distinct().ToList();
)。
game_df
由几千行组成,数据像这样……
team_stats_1970_2017
game_df
将具有此对应数据
season_yr home_team visitor_team home_team_runs visitor_team_runs
0 2017 ARI SFG 6 5
1 2017 ARI SFG 4 8
2 2017 ARI SFG 8 6
3 2017 ARI SFG 9 3
4 2017 ARI CLE 7 3
5 2017 ARI CLE 11 2
6 2017 ATL LAD 2 3
例如,对于team_stats_1970_2017
的第1行,代码从Postgres的 team season_yr r_per_g pa ab b_r b_h b2 b3 b_hr
0 ARI 2017 5.01 6224.0 5525 812 1405 314 39 220
1 ATL 2017 4.52 6216.0 5584 732 1467 289 26 165
2 CLE 2017 5.05 6234.0 5511 818 1449 333 29 212
3 LAD 2017 4.75 6191.0 5408 770 1347 312 20 221
4 SFG 2017 3.94 6137.0 5551 639 1382 290 28 128
中选择“ ARI”和“ SFG”数据,并由此创建特征。然后对game_df
中的其余行重复此操作。
我当前正在使用team_stats_1970_2017
,但是我注意到它相当慢,因为我仅测试了我的一小部分数据,并且仍然需要一段时间。有人会为此提供更好/更快的选择吗?
game_df
答案 0 :(得分:1)
这是另一种容易理解的方法,但是使用{{$country_to}}
作为@sacul的解决方案。我将分别为每一行和列merge
和df_visitor
中的团队创建两个数据帧df_home
和team_stats_1970_2017
,其值分别为'visitor_team'
。为:
'home_team'
例如,您得到df_visitor = (game_df[['season_yr','visitor_team']].rename(columns={'visitor_team':'team'})
.merge(team_stats_1970_2017, how='left'))
df_home = (game_df[['season_yr','home_team']].rename(columns={'home_team':'team'})
.merge(team_stats_1970_2017, how='left'))
:
df_home
对于每一行,它是 season_yr team r_per_g pa ab b_r b_h b2 b3 b_hr
0 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220
1 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220
2 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220
3 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220
4 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220
5 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220
6 2017 ATL 4.52 6216.0 5584 732 1467 289 26 165
中与team_stats_1970_2017
列同一行中game_df
中与团队相关联的值。
现在可以在原始数据帧'home_team'
上添加差异了,您可以执行以下操作:
game_df
最后要添加列结果,可以使用np.where
:
# first get the lists of columns you want to add
col_features = team_stats_1970_2017.columns[2:]
game_df[col_features] = df_visitor[col_features] - df_home[col_features]
答案 1 :(得分:0)
如果我的理解正确,如果您可以将team_stats_1970_2017
作为pandas
数据框,则可以应用2个合并:一次在home_team
和season_yr
上进行,一次在visitor_team
和season_yr
上:
merged_df = (game_df.merge(team_stats_1970_2017,
left_on=['home_team', 'season_yr'],
right_on=['team', 'season_yr'])
.merge(team_stats_1970_2017, left_on=['visitor_team', 'season_yr'],
right_on=['team', 'season_yr'],
suffixes=['_home', '_visitor'])
.drop(['team_visitor', 'team_home'], axis=1))
>>> merged_df
season_yr home_team visitor_team home_team_runs visitor_team_runs \
0 2017 ARI SFG 6 5
1 2017 ARI SFG 4 8
2 2017 ARI SFG 8 6
3 2017 ARI SFG 9 3
4 2017 ARI CLE 7 3
5 2017 ARI CLE 11 2
6 2017 ATL LAD 2 3
r_per_g_home pa_home ab_home b_r_home b_h_home ... b3_home \
0 5.01 6224.0 5525 812 1405 ... 39
1 5.01 6224.0 5525 812 1405 ... 39
2 5.01 6224.0 5525 812 1405 ... 39
3 5.01 6224.0 5525 812 1405 ... 39
4 5.01 6224.0 5525 812 1405 ... 39
5 5.01 6224.0 5525 812 1405 ... 39
6 4.52 6216.0 5584 732 1467 ... 26
b_hr_home r_per_g_visitor pa_visitor ab_visitor b_r_visitor \
0 220 3.94 6137.0 5551 639
1 220 3.94 6137.0 5551 639
2 220 3.94 6137.0 5551 639
3 220 3.94 6137.0 5551 639
4 220 5.05 6234.0 5511 818
5 220 5.05 6234.0 5511 818
6 165 4.75 6191.0 5408 770
b_h_visitor b2_visitor b3_visitor b_hr_visitor
0 1382 290 28 128
1 1382 290 28 128
2 1382 290 28 128
3 1382 290 28 128
4 1449 333 29 212
5 1449 333 29 212
6 1347 312 20 221
[7 rows x 21 columns]
然后可以使用此merged_df
计算特征。例如(由于您似乎希望您的特征为np.arrays
),以计算pa_home
和pa_visitor
之间的差异(这只是一个虚拟的示例):
>>> (merged_df['pa_home'] - merged_df['pa_visitor']).values
array([ 87., 87., 87., 87., -10., -10., 25.])