计算子列在另一列中的出现次数

时间:2018-07-16 13:08:13

标签: python pandas

我有两个正在使用的数据框,一个包含一个播放器列表,另一个包含来自另一个数据框的播放器播放数据。这两个数据帧内感兴趣的行的部分如下所示。

0          Matt Carpenter
1           Jason Heyward
2           Peter Bourjos
3           Matt Holliday
4          Jhonny Peralta
5              Matt Adams
...
Name: Name, dtype: object


0     Matt Carpenter grounded out to second (Grounder).
1               Jason Heyward doubled to right (Liner).
2     Matt Holliday singled to right (Liner). Jason Heyward scored.
...
Name: Play, dtype: object

我想做的是在第一个数据帧中创建一列,该列计算另一个数据帧中该列中字符串出现的次数(df ['Name'] +'scored')。例如,它将搜索“ Matt Carpenter得分”,“ Jason Heyward得分”等实例。我知道您可以使用str.contains来执行此类操作,但只有将显式内容放入串。例如,

batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains('Jason Heyward scored')].index)

工作正常,但如果我尝试

batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains(batter_game_logs_df['Name'].astype(str) + ' scored')].index)

它返回错误“系列”对象是可变的,因此不能进行散列。我看过各种类似的问题,但终生无法找到解决该问题的方法。谢谢您的协助!

1 个答案:

答案 0 :(得分:2)

我认为需要findall的正则表达式,并结合所有Name的值,然后通过MultiLabelBinarizer创建指标列,并通过reindex添加所有缺少的列:

s = df1['Name'] + ' scored'
pat = r'\b{}\b'.format('|'.join(s))

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df2['Play'].str.findall(pat)),
                  columns=mlb.classes_, 
                  index=df2.index).reindex(columns=s, fill_value=0)
print (df)
Name  Matt Carpenter scored  Jason Heyward scored  Peter Bourjos scored  \
0                         0                     0                     0   
1                         0                     0                     0   
2                         0                     1                     0   

Name  Matt Holliday scored  Jhonny Peralta scored  Matt Adams scored  
0                        0                      0                  0  
1                        0                      0                  0  
2                        0                      0                  0  

必要时从joindf1的最后

df = df2.join(df)
print (df)
                                                Play  Matt Carpenter scored  \
0  Matt Carpenter grounded out to second (Grounder).                      0   
1            Jason Heyward doubled to right (Liner).                      0   
2  Matt Holliday singled to right (Liner). Jason ...                      0   

   Jason Heyward scored  Peter Bourjos scored  Matt Holliday scored  \
0                     0                     0                     0   
1                     0                     0                     0   
2                     1                     0                     0   

   Jhonny Peralta scored  Matt Adams scored  
0                      0                  0  
1                      0                  0  
2                      0                  0