Question

我有一个包含两列的数据集：

Index            Text
 1               *some text* address13/b srs mall, indirapuram,sann-444000 *some text*
 2               *some text*   
 3               *some text* contactus 12J 1st floor, jajan,totl-996633 *some text*
 4               ..........
 5               ........

我想要一个具有新列的数据帧作为＆＃34; location＆＃34;只有那个字符串才能从列中提取出来＃34; Text＆＃34;这超出了关键字＆＃34;地址＆＃34;或者＆＃34; contactus＆＃34;直到6位数字并给出＆＃34; NA＆＃34;字符串不匹配的地方。输出我想要的东西是这样的：

Index                location
1                 13/b srs mall, indirapuram,sann-444000
2                 NA
3                 12J 1st floor, jajan,totl-996633
4                 NA

Answer 1

使用str.extract：

df['location'] = df.Text.str.extract('(?:address|contactus)(.*?\d{6})', expand=False)
df.drop('Text', 1)

   Index                                location
0      1  13/b srs mall, indirapuram,sann-444000
1      2                                     NaN
2      3        12J 1st floor, jajan,totl-996633

作为一个有用的帮助，当您有多个要检查的项目时，请将它们放在一个列表中并与str.join一起加入：

terms = ['address', 'contactus', ...]

df['location'] = df.Text.str\
         .extract(r'(?:{})(.*?\d{6})'.format('|'.join(terms), expand=False)

正则表达式详细信息

(?:        # non-capturing group
address    # "address" 
|          # regex OR
contactus  # "contactus
)  
(.*?       # non-greedy match-all
\d{6}      # 6 digit zipcode
)

基于关键字的大熊猫文本提取

1 个答案: