我有两个正在使用的数据框,一个包含一个播放器列表,另一个包含来自另一个数据框的播放器播放数据。这两个数据帧内感兴趣的行的部分如下所示。
0 Matt Carpenter
1 Jason Heyward
2 Peter Bourjos
3 Matt Holliday
4 Jhonny Peralta
5 Matt Adams
...
Name: Name, dtype: object
0 Matt Carpenter grounded out to second (Grounder).
1 Jason Heyward doubled to right (Liner).
2 Matt Holliday singled to right (Liner). Jason Heyward scored.
...
Name: Play, dtype: object
我想做的是在第一个数据帧中创建一列,该列计算另一个数据帧中该列中字符串出现的次数(df ['Name'] +'scored')。例如,它将搜索“ Matt Carpenter得分”,“ Jason Heyward得分”等实例。我知道您可以使用str.contains来执行此类操作,但只有将显式内容放入串。例如,
batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains('Jason Heyward scored')].index)
工作正常,但如果我尝试
batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains(batter_game_logs_df['Name'].astype(str) + ' scored')].index)
它返回错误“系列”对象是可变的,因此不能进行散列。我看过各种类似的问题,但终生无法找到解决该问题的方法。谢谢您的协助!
答案 0 :(得分:2)
我认为需要findall
的正则表达式,并结合所有Name
的值,然后通过MultiLabelBinarizer
创建指标列,并通过reindex
添加所有缺少的列:
s = df1['Name'] + ' scored'
pat = r'\b{}\b'.format('|'.join(s))
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df2['Play'].str.findall(pat)),
columns=mlb.classes_,
index=df2.index).reindex(columns=s, fill_value=0)
print (df)
Name Matt Carpenter scored Jason Heyward scored Peter Bourjos scored \
0 0 0 0
1 0 0 0
2 0 1 0
Name Matt Holliday scored Jhonny Peralta scored Matt Adams scored
0 0 0 0
1 0 0 0
2 0 0 0
必要时从join
到df1
的最后>
df = df2.join(df)
print (df)
Play Matt Carpenter scored \
0 Matt Carpenter grounded out to second (Grounder). 0
1 Jason Heyward doubled to right (Liner). 0
2 Matt Holliday singled to right (Liner). Jason ... 0
Jason Heyward scored Peter Bourjos scored Matt Holliday scored \
0 0 0 0
1 0 0 0
2 1 0 0
Jhonny Peralta scored Matt Adams scored
0 0 0
1 0 0
2 0 0