我有一个包含两列的数据集:
Index Text
1 *some text* address13/b srs mall, indirapuram,sann-444000 *some text*
2 *some text*
3 *some text* contactus 12J 1st floor, jajan,totl-996633 *some text*
4 ..........
5 ........
我想要一个具有新列的数据帧作为" location"只有那个字符串才能从列中提取出来#34; Text"这超出了关键字"地址"或者" contactus"直到6位数字并给出" NA"字符串不匹配的地方。输出我想要的东西是这样的:
Index location
1 13/b srs mall, indirapuram,sann-444000
2 NA
3 12J 1st floor, jajan,totl-996633
4 NA
答案 0 :(得分:1)
使用str.extract
:
df['location'] = df.Text.str.extract('(?:address|contactus)(.*?\d{6})', expand=False)
df.drop('Text', 1)
Index location
0 1 13/b srs mall, indirapuram,sann-444000
1 2 NaN
2 3 12J 1st floor, jajan,totl-996633
作为一个有用的帮助,当您有多个要检查的项目时,请将它们放在一个列表中并与str.join
一起加入:
terms = ['address', 'contactus', ...]
df['location'] = df.Text.str\
.extract(r'(?:{})(.*?\d{6})'.format('|'.join(terms), expand=False)
正则表达式详细信息
(?: # non-capturing group
address # "address"
| # regex OR
contactus # "contactus
)
(.*? # non-greedy match-all
\d{6} # 6 digit zipcode
)