基于名称修改pandas列中的alter text

时间:2019-08-11 14:23:19

标签: regex python-3.x string pandas replace

背景

我有以下df,它是对Alter text in pandas column based on names的修改

import pandas as pd
df = pd.DataFrame({'Text' : ['Jon J Doe works ', 
                                   'So is Mary Doe, works too',
                                'Jane Ann, Doe doesnt',
                                 'Jone, Dow doesnt either'], 

                      'P_ID': [1,2,3,4],
                   'P_Name' : ['Doe, Jon J', 'Doe, Mary', 'Doe, Jane Ann', 'Dow, Jone' ]

                     })


P_ID    P_Name           Text
0   1   Doe, Jon J       Jon J Doe works
1   2   Doe, Mary        So is Mary Doe, works too
2   3   Doe, Jane Ann    Jane Ann, Doe doesnt
3   4   Dow, Jone        Jone, Dow doesnt either

下面的代码块可以阻止诸如Jon J Doe之类的名称,但是当诸如Jane Ann Doe之类的名称之间有一个字符时,它是无效的。 Jane Ann, DoeJone! Dow

df['NewText'] = df['Text'].replace(df['P_Name'].str.split(', *').apply(lambda l: ' '.join(l[::-1])),'**BLOCK**',regex=True) 

输出

    P_ID    P_Name    Text                     NewText
0   1   Doe, Jon J    Jon J Doe works          **BLOCK** works
1   2   Doe, Mary     So is Mary Doe, works     So is **BLOCK**, works 
2   3   Doe, Jane Ann Jane Ann, Doe doesnt      Jane Ann, Doe doesnt
3   4   Dow, Jone     Jone,Dow doesnt either    Jone, Dow doesnt either

目标

1)调整上面的代码,以考虑到,(或名称之间可能存在的任何其他字符)

(我知道我可以删除逗号,但我需要保留它们)

所需的输出

    P_ID    P_Name    Text                     NewText
0   1   Doe, Jon J    Jon J Doe works          **BLOCK** works
1   2   Doe, Mary     So is Mary Doe, works     So is **BLOCK**, works 
2   3   Doe, Jane Ann Jane Ann, Doe doesnt      **BLOCK**  doesnt
3   4   Dow, Jone     Jone,Dow doesnt either    **BLOCK** doesnt either

问题

如何调整代码以获得所需的输出?

2 个答案:

答案 0 :(得分:1)

尝试:

df['NewText'] = df['Text'].replace( r'('+ df['P_Name'].str.split('\W+').str.join('|')+'|\W+){3,}', ' **BLOCK** ', regex=True)

答案 1 :(得分:1)

我不知道是否有多个此类情况,但是如果您的情况有限

样本数据集:

>>> df
   P_ID         P_Name                       Text
0     1     Doe, Jon J           Jon J Doe works
1     2      Doe, Mary  So is Mary Doe, works too
2     3  Doe, Jane Ann       Jane Ann, Doe doesnt
3     4      Dow, Jone    Jone, Dow doesnt either

您可以创建字典组合并将其应用于dataFrame以获取结果。

>>> replace_values = {'Jon J Doe': '**BLOCK**', 'Mary Doe': '**BLOCK**', 'Jane Ann, Doe': '**BLOCK**', 'Jone, Dow': '**BLOCK**'}

结果dataFrame:

>>> df = df.replace(replace_values, regex=True)
>>> df
   P_ID         P_Name                        Text
0     1     Doe, Jon J            **BLOCK** works
1     2      Doe, Mary  So is **BLOCK**, works too
2     3  Doe, Jane Ann            **BLOCK** doesnt
3     4      Dow, Jone     **BLOCK** doesnt either