从pandas列中提取子字符串

时间:2019-12-11 13:31:46

标签: python regex pandas

我正在尝试从用户名column中提取子字符串。但我没有得到我的实际结果。     我的df如下图

data = {'Name':['inf.negem.netmgmt', 'infbe_cdb', 'inf_igh', 'INF_EONLOG','inf.dkprime.netmgmt','infaus_mgo','infau_abr']}
df = pd.DataFrame(data) 

print(df)

   Name
0    inf.negem.netmgmt
1            infbe_cdb
2              inf_igh
3           INF_EONLOG
4  inf.dkprime.netmgmt
5           infaus_mgo
6            infau_abr    

I tried following code.but i am not
df['Country'] = df['Name'].str.slice(3,6)

I would like to see output like below
output  = {'Country':['No_Country', 'be', 'No_Country', 'No_Country','No_Country','aus','au']}
df = pd.DataFrame(output) 

print(df)

  Country
0  No_Country
1          be
2  No_Country
3  No_Country
4  No_Country
5         aus
6          au

Note: I would like to extract words between 'inf' and '_' as country and would like to create new column as Country. if nothing is there after inf then it's value is 'No_Country'

2 个答案:

答案 0 :(得分:1)

这是使用str.extract的一种方法:

df['Country'] = (df.Name.str.lower()
                        .str.extract(r'inf(.*?)_')
                        .replace('', float('nan'))
                        .fillna('No_Country'))

print(df)

               Name     Country
0    inf.negem.netmgmt  No_Country
1            infbe_cdb          be
2              inf_igh  No_Country
3           INF_EONLOG  No_Country
4  inf.dkprime.netmgmt  No_Country
5           infaus_mgo         aus
6            infau_abr          au

答案 1 :(得分:0)

使用列表理解和re.findall

import re
df['Country'] = ["".join(re.findall(r'inf(.*?)_', i)) for i in df['Name']]


print(df)
                  Name    Country
0    inf.negem.netmgmt        
1            infbe_cdb       be
2              inf_igh        
3           INF_EONLOG        
4  inf.dkprime.netmgmt        
5           infaus_mgo       aus
6            infau_abr       au