Python 3 pandas使用字符串与Regex在数据帧中标记数据

时间:2014-05-07 23:48:57

标签: python regex pandas

所以我有两种方法可以做同样的事情,并且想知道哪一种更有效:

第一种方法从文本文件或数组加载列表,并使用列表标记数据帧:

import pandas as pd

ban_list = ['Al Gore', 'Kim jong-un','Kim jong un','Kim Jong Un', 'Al Sharpton','Kim jong il', 'Richard Johnson', 'Dick Johnson']

df=pd.DataFrame({'Users': [ 'Al Gore', 'Kim jong il', 'Kim jong un', 'Al Sharpton', 'James', 'Richard Johnson', 'Bill Gates', 'Alf pig', 'Dick Johnson', 'Python Monte'],
                 'Time': ['D','D','N','D','L','N', 'N','L','L','N']})

df['Banned'] = ''


for i in range(len(ban_list)):
    df.loc[df.Users.str.contains(ban_list[i]) & (df.Banned == ''),'Banned'] = 'Yes'

第二种方式使用正则表达式而不是名称列表

import pandas as pd

ban_list = ['^(?i)Al(\s)(Gore|Sharpton)$', '^(?i)Kim\sjong(\s|-)(il|un)$', '^(?i)(Dick|Richard)\sJohnson$']

df=pd.DataFrame({'Users': [ 'Al Gore', 'Kim jong il', 'Kim jong un', 'Al Sharpton', 'James', 'Richard Johnson', 'Bill Gates', 'Alf pig', 'Dick Johnson', 'Python Monte'],
                 'Time': ['D','D','N','D','L','N', 'N','L','L','N']})

df['Banned'] = ''


for i in range(len(ban_list)):
    df.loc[df.Users.str.contains(ban_list[i]) & (df.Banned == ''),'Banned'] = 'Yes'

这两组代码都可以工作并且做同样的事情。到目前为止,问题是第一个问题不是区分大小写,第二个问题是警告UserWarning: This pattern has match groups. To actually get the groups, use str.extract. " groups, use str.extract.", UserWarning)

第一种方式中的数组加载一个大型列表,第二种方式是带有多个步骤的正则表达式。我应该使用哪一个来提高效率?还是有其他方法来改善这个?

1 个答案:

答案 0 :(得分:1)

似乎有点清洁(至少对我来说)使用isin,因为你有一个很好的禁用用户列表(然后你可以将True / False映射到是/'':

df['Banned'] = df.Users.isin(ban_list).map({True:'Yes',False:''})
print df

  Time            Users Banned
0    D          Al Gore    Yes
1    D      Kim jong il    Yes
2    N      Kim jong un    Yes
3    D      Al Sharpton    Yes
4    L            James       
5    N  Richard Johnson    Yes
6    N       Bill Gates       
7    L          Alf pig       
8    L     Dick Johnson    Yes
9    N     Python Monte       

当然,如果True / False足够好,你可以直接使用命令的第一部分:

df['Banned'] = df.Users.isin(ban_list)
print df

  Time            Users Banned
0    D          Al Gore   True
1    D      Kim jong il   True
2    N      Kim jong un   True
3    D      Al Sharpton   True
4    L            James  False
5    N  Richard Johnson   True
6    N       Bill Gates  False
7    L          Alf pig  False
8    L     Dick Johnson   True
9    N     Python Monte  False

修改:如果您有第二个列表,我会按如下方式执行:

Adminlist = ['Bill Gates']
df['Banned'] = (df.Users.isin(ban_list).map({True:'Yes',False:''}) +
                df.Users.isin(Adminlist).map({True:'Admin',False:''}))
print df

  Time            Users Banned
0    D          Al Gore    Yes
1    D      Kim jong il    Yes
2    N      Kim jong un    Yes
3    D      Al Sharpton    Yes
4    L            James       
5    N  Richard Johnson    Yes
6    N       Bill Gates  Admin
7    L          Alf pig       
8    L     Dick Johnson    Yes
9    N     Python Monte