Question

我有一个大型pandas数据帧（df_orig）和几个与df_orig中每个段对应的查找表（也是数据帧）。

这里是df_orig的一小部分：

segment score1 score2 
 B3         0   700
 B1         0   120
 B1       400   950
 B1       100   220
 B1       200   320
 B1       650   340
 B5       300   400
 B5         0   320
 B1         0   240
 B1       100   360
 B1       940   700
 B3       100   340

这里是一个名为thresholds_b5的B5段的完整查找表（大型数据集中的每个段都有一个查找表）：

score1 score2   
990     220
980     280
970     200
960     260
950     260
940     200
930     240
920     220
910     220
900     220
850     120
800     220
750     220
700     120
650     200
600     220
550     220
500     240
400     240
300     260
200     300
100     320
  0     400

我想在我的大型数据集中创建一个与此SQL逻辑类似的新列：

case when segment = 'B5' then
   case when score1 = 990 and score2 >= 220 then 1
   case when score1 = 980 and score2 >= 280 then 1
   .
   .
   .
   else 0
case when segment = 'B1' then
.
.
.
else 0 end as indicator

我能够使用基于this question的解决方案的循环获得正确的输出：

df_b5 = df_orig[df_orig.loc[:,'segment'] == 'B5']

for i,row in enumerate(thresholds_b5):

    value1 = thresholds_b5.iloc[i,0]
    value2 = thresholds_b5.iloc[i,1]

    df_b5.loc[(df_b5['score1'] == value1) & (df_b5['score2'] >= value2), 'indicator'] = 1

但是，我需要另一个循环来为每个段运行它，然后将所有结果数据帧重新附加在一起，这有点混乱。此外，虽然我现在只有三个部分（B1，B3，B5），但我将来会有20多个部分。

有没有办法更简洁地做到这一点，最好没有循环？我已经被警告说，数据帧上的循环往往很慢，并且考虑到我的数据集的大小，我认为速度很重要。

Answer 1

如果您可以提前对DataFrame进行排序，那么您可以使用新的asof join in pandas 0.19替换循环示例：

# query
df_b5 = df_orig.query('segment == "B5"')

# sort ahead of time
df_b5.sort_values('score2', inplace=True)
threshold_b5.sort_values('score2', inplace=True)

# set the default indicator as 1
threshold_b5['indicator'] = 1

# join the tables
df = pd.merge_asof(df_b5, threshold_b5, on='score2', by='score1')

# fill missing indicators as 0
df.indicator = np.int64(df.indicator.fillna(0.0))

这就是我得到的：

  segment  score1  score2  indicator
0      B5       0     320          0
1      B5     300     400          1

如果您需要原始订单，请将索引保存在df_orig的新列中，然后使用该列。最后一个DataFrame。

pandas 0.19.2 added multiple by parameters，因此您可以concat为每个阈值设置segment列的所有阈值，然后调用：

pd.merge_asof(df_orig, thresholds, on='score2', by=['segment', 'score1'])

基于Python Pandas中的几个查找表创建一个新列

1 个答案: