df.iterrows()的替代方案,用于连接两个Postgres表和计算功能

时间:2018-08-02 21:31:55

标签: python python-3.x postgresql pandas

我有一个数据框(var ans = context.StorageAreaRacks .Join(context.StorageAreas, sar => sar.StorageAreaId, sa => sa.Id, (sar, sa) => new { sar, sa }) .Join(context.StorageAreaTypes, sarsa => sarsa.sa.StorageAreaTypeId, sat => sat.Id, (sarsa, sat) => new { sarsa.sar, sat }) .Join(context.Racks, sarsat => sarsat.sar.RackId, r => r.Id, (sarsat, r) => new { sarsat.sat, r }) .Where(satr => !satr.sat.IsManual && satr.r.IsEnabled && !satr.r.IsVirtual) .Select(satr => new { satr.sat.Id, satr.sat.Name }) .Distinct() .ToList(); )和一个Postgres表(var ans = (from sar in context.StorageAreaRacks join sa in context.StorageAreas on sar.StorageAreaId equals sa.Id join sat in context.StorageAreaTypes on sa.StorageAreaTypeId equals sat.Id join r in context.Racks on sar.RackId equals r.Id where !sat.IsManual && r.IsEnabled && !r.IsVirtual select new { sat.Name, sat.Id }).Distinct().ToList(); )。

game_df由几千行组成,数据像这样……

team_stats_1970_2017

game_df将具有此对应数据

      season_yr home_team visitor_team  home_team_runs  visitor_team_runs
0         2017       ARI          SFG               6                  5
1         2017       ARI          SFG               4                  8
2         2017       ARI          SFG               8                  6
3         2017       ARI          SFG               9                  3
4         2017       ARI          CLE               7                  3
5         2017       ARI          CLE              11                  2
6         2017       ATL          LAD               2                  3

例如,对于team_stats_1970_2017的第1行,代码从Postgres的 team season_yr r_per_g pa ab b_r b_h b2 b3 b_hr 0 ARI 2017 5.01 6224.0 5525 812 1405 314 39 220 1 ATL 2017 4.52 6216.0 5584 732 1467 289 26 165 2 CLE 2017 5.05 6234.0 5511 818 1449 333 29 212 3 LAD 2017 4.75 6191.0 5408 770 1347 312 20 221 4 SFG 2017 3.94 6137.0 5551 639 1382 290 28 128 中选择“ ARI”和“ SFG”数据,并由此创建特征。然后对game_df中的其余行重复此操作。

我当前正在使用team_stats_1970_2017,但是我注意到它相当慢,因为我仅测试了我的一小部分数据,并且仍然需要一段时间。有人会为此提供更好/更快的选择吗?

game_df

2 个答案:

答案 0 :(得分:1)

这是另一种容易理解的方法,但是使用{{$country_to}}作为@sacul的解决方案。我将分别为每一行和列mergedf_visitor中的团队创建两个数据帧df_hometeam_stats_1970_2017,其值分别为'visitor_team'。为:

'home_team'

例如,您得到df_visitor = (game_df[['season_yr','visitor_team']].rename(columns={'visitor_team':'team'}) .merge(team_stats_1970_2017, how='left')) df_home = (game_df[['season_yr','home_team']].rename(columns={'home_team':'team'}) .merge(team_stats_1970_2017, how='left'))

df_home

对于每一行,它是 season_yr team r_per_g pa ab b_r b_h b2 b3 b_hr 0 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220 1 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220 2 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220 3 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220 4 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220 5 2017 ARI 5.01 6224.0 5525 812 1405 314 39 220 6 2017 ATL 4.52 6216.0 5584 732 1467 289 26 165 中与team_stats_1970_2017列同一行中game_df中与团队相关联的值。

现在可以在原始数据帧'home_team'上添加差异了,您可以执行以下操作:

game_df

最后要添加列结果,可以使用np.where

# first get the lists of columns you want to add
col_features = team_stats_1970_2017.columns[2:]
game_df[col_features] = df_visitor[col_features] - df_home[col_features]

答案 1 :(得分:0)

如果我的理解正确,如果您可以将team_stats_1970_2017作为pandas数据框,则可以应用2个合并:一次在home_teamseason_yr上进行,一次在visitor_teamseason_yr上:

merged_df = (game_df.merge(team_stats_1970_2017,
                           left_on=['home_team', 'season_yr'],
                           right_on=['team', 'season_yr'])
             .merge(team_stats_1970_2017, left_on=['visitor_team', 'season_yr'],
                    right_on=['team', 'season_yr'],
                    suffixes=['_home', '_visitor'])
             .drop(['team_visitor', 'team_home'], axis=1))

>>> merged_df
   season_yr home_team visitor_team  home_team_runs  visitor_team_runs  \
0       2017       ARI          SFG               6                  5   
1       2017       ARI          SFG               4                  8   
2       2017       ARI          SFG               8                  6   
3       2017       ARI          SFG               9                  3   
4       2017       ARI          CLE               7                  3   
5       2017       ARI          CLE              11                  2   
6       2017       ATL          LAD               2                  3   

   r_per_g_home  pa_home  ab_home  b_r_home  b_h_home      ...       b3_home  \
0          5.01   6224.0     5525       812      1405      ...            39   
1          5.01   6224.0     5525       812      1405      ...            39   
2          5.01   6224.0     5525       812      1405      ...            39   
3          5.01   6224.0     5525       812      1405      ...            39   
4          5.01   6224.0     5525       812      1405      ...            39   
5          5.01   6224.0     5525       812      1405      ...            39   
6          4.52   6216.0     5584       732      1467      ...            26   

   b_hr_home  r_per_g_visitor  pa_visitor  ab_visitor  b_r_visitor  \
0        220             3.94      6137.0        5551          639   
1        220             3.94      6137.0        5551          639   
2        220             3.94      6137.0        5551          639   
3        220             3.94      6137.0        5551          639   
4        220             5.05      6234.0        5511          818   
5        220             5.05      6234.0        5511          818   
6        165             4.75      6191.0        5408          770   

   b_h_visitor  b2_visitor  b3_visitor  b_hr_visitor  
0         1382         290          28           128  
1         1382         290          28           128  
2         1382         290          28           128  
3         1382         290          28           128  
4         1449         333          29           212  
5         1449         333          29           212  
6         1347         312          20           221  

[7 rows x 21 columns]

然后可以使用此merged_df计算特征。例如(由于您似乎希望您的特征为np.arrays),以计算pa_homepa_visitor之间的差异(这只是一个虚拟的示例):

>>> (merged_df['pa_home'] - merged_df['pa_visitor']).values
array([ 87.,  87.,  87.,  87., -10., -10.,  25.])