所以我有两种方法可以做同样的事情,并且想知道哪一种更有效:
第一种方法从文本文件或数组加载列表,并使用列表标记数据帧:
import pandas as pd
ban_list = ['Al Gore', 'Kim jong-un','Kim jong un','Kim Jong Un', 'Al Sharpton','Kim jong il', 'Richard Johnson', 'Dick Johnson']
df=pd.DataFrame({'Users': [ 'Al Gore', 'Kim jong il', 'Kim jong un', 'Al Sharpton', 'James', 'Richard Johnson', 'Bill Gates', 'Alf pig', 'Dick Johnson', 'Python Monte'],
'Time': ['D','D','N','D','L','N', 'N','L','L','N']})
df['Banned'] = ''
for i in range(len(ban_list)):
df.loc[df.Users.str.contains(ban_list[i]) & (df.Banned == ''),'Banned'] = 'Yes'
第二种方式使用正则表达式而不是名称列表
import pandas as pd
ban_list = ['^(?i)Al(\s)(Gore|Sharpton)$', '^(?i)Kim\sjong(\s|-)(il|un)$', '^(?i)(Dick|Richard)\sJohnson$']
df=pd.DataFrame({'Users': [ 'Al Gore', 'Kim jong il', 'Kim jong un', 'Al Sharpton', 'James', 'Richard Johnson', 'Bill Gates', 'Alf pig', 'Dick Johnson', 'Python Monte'],
'Time': ['D','D','N','D','L','N', 'N','L','L','N']})
df['Banned'] = ''
for i in range(len(ban_list)):
df.loc[df.Users.str.contains(ban_list[i]) & (df.Banned == ''),'Banned'] = 'Yes'
这两组代码都可以工作并且做同样的事情。到目前为止,问题是第一个问题不是区分大小写,第二个问题是警告UserWarning: This pattern has match groups. To actually get the groups, use str.extract. " groups, use str.extract.", UserWarning)
第一种方式中的数组加载一个大型列表,第二种方式是带有多个步骤的正则表达式。我应该使用哪一个来提高效率?还是有其他方法来改善这个?
答案 0 :(得分:1)
似乎有点清洁(至少对我来说)使用isin
,因为你有一个很好的禁用用户列表(然后你可以将True / False映射到是/'':
df['Banned'] = df.Users.isin(ban_list).map({True:'Yes',False:''})
print df
Time Users Banned
0 D Al Gore Yes
1 D Kim jong il Yes
2 N Kim jong un Yes
3 D Al Sharpton Yes
4 L James
5 N Richard Johnson Yes
6 N Bill Gates
7 L Alf pig
8 L Dick Johnson Yes
9 N Python Monte
当然,如果True / False足够好,你可以直接使用命令的第一部分:
df['Banned'] = df.Users.isin(ban_list)
print df
Time Users Banned
0 D Al Gore True
1 D Kim jong il True
2 N Kim jong un True
3 D Al Sharpton True
4 L James False
5 N Richard Johnson True
6 N Bill Gates False
7 L Alf pig False
8 L Dick Johnson True
9 N Python Monte False
修改:如果您有第二个列表,我会按如下方式执行:
Adminlist = ['Bill Gates']
df['Banned'] = (df.Users.isin(ban_list).map({True:'Yes',False:''}) +
df.Users.isin(Adminlist).map({True:'Admin',False:''}))
print df
Time Users Banned
0 D Al Gore Yes
1 D Kim jong il Yes
2 N Kim jong un Yes
3 D Al Sharpton Yes
4 L James
5 N Richard Johnson Yes
6 N Bill Gates Admin
7 L Alf pig
8 L Dick Johnson Yes
9 N Python Monte