如何在熊猫中建立名称检测器

时间:2018-04-19 06:11:51

标签: python regex pandas dataframe

这是我的数据集

Id.   Text
1     Dear Mr. Alpha Terra, your food is delivered
2     Dear Mrs. Betta Irina Viruva, your drink is delivered

我想要的是在Mr,Mrs,之后但,之前检测单词。所以,我可以得到这个名字,那就是我想要的东西

Id.   Text                                                       Name
1     Dear Mr. Alpha Terra, your food is delivered               Alpha Terra 
2     Dear Mrs. Betta Irina Viruva, your drink is delivered      Betta Irina Viruva

4 个答案:

答案 0 :(得分:2)

试试这个:

In [134]: df.Text.str.split('.',expand=True)[1].str.split(',',expand=True)[0]
Out[134]: 
0            Alpha Terra
1     Betta Irina Viruva
Name: 0, dtype: object

答案 1 :(得分:2)

一种选择是使用以下模式进行匹配:

.*Mrs?\.\s+([^,]+).*

这将捕获Mr.Mrs.之后的所有逗号,但不包括以下第一个逗号。

line = "Dear Mrs. Betta Irina Viruva, your drink is delivered"
matches = re.match(r'.*Mrs?\.\s+([^,]+).*', line, re.M|re.I)

if matches:
    print "Name: ", matches.group(1)
else:
    print "No match!!"

Demo

答案 2 :(得分:1)

使用extract

df['Name'] = df['Text'].str.extract(r'Mrs?\.\s+(.*?),', expand=False)
print (df)
   Id.                                               Text                Name
0    1       Dear Mr. Alpha Terra, your food is delivered         Alpha Terra
1    2  Dear Mrs. Betta Irina Viruva, your drink is de...  Betta Irina Viruva

答案 3 :(得分:1)

当你要求正则表达式时,试试这个:

import pandas
data = [{'ID': 1, 'Text': 'Dear Mr. Alpha Terra, your food is delivered'},
        {'ID': 2, 'Text': 'Dear Mrs. Betta Irina Viruva, your drink is delivered'}]
df = pandas.DataFrame(data)
df['Name'] = df.Text.str.extract(r'\.(.*?),')
print(df)

这是一个repl供您试用。