熊猫解析具有多个元素和多个定界符的字符串

时间:2019-11-28 17:52:26

标签: python pandas split

我有一个27列的Excel文件。我有一列称为“用户”的列,其中包含多个元素。每个元素在同一单元格的Excel中的新行上分隔,并且每个子元素在圆括号(())之间用分号(;)分隔。但是,子元素中也可能存在括号。下面是使用示例数据显示表格在Excel中的显示方式的图片。

Data Table

这是使用Pandas将其作为DataFrame导入Python的方式。

df = pd.DataFrame({'CN ON': ['WB-01','ZD-DD','DE-02','WZ-D8','HJ-78'],
                   'Type': ['First','Second','First','Second','Third'],
                   'Status': ['Completed','Started','Started','Final','Pending'],
                   'User': ['Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)','Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)','Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin (THUMB TACK; thumbtack@domain.com; 999-999-999\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)',  'Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin (THUMB TACK; thumbtack@domain.com; 999-999-999)\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)', 'Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin (THUMB TACK; thumbtack@domain.com; 999-999-999)\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nAdmin Assistant (WRIST PAD; wristpad@domain.com; 999 999 9999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)']
                   })

逻辑

现在,我想应用以下逻辑并创建一个新的列,称为所有权。 如果Status = 'Completed'然后Ownership = 'Completed' 如果Status = 'Started',则Ownership =为每个Admin的名称。 如果Status = 'Final',则Ownership =为每个Supervisor的名称。 如果Status = 'Pending',则Ownership =为每个Admin Assistant的名称。

可能的用户角色是'Admin', 'Admin Assistant', 'Supervisor','Alternative Supervisor'。括号中的第一个子元素是该角色的名称。第二个子元素是电子邮件地址。第三个子元素是非标准化电话号码。它可以有破折号,可以有括号,可以有空格或全部在一起。子元素中的定界符为分号;。我相信元素之间的分隔符是Python中的\n,因为这是我导入DataFrame(使用上面的脚本)时显示的样子。

Status     User Role
Started    Admin
Pending    Admin Assistant
Final      Supervisor
Completed  Completed

所需结果 Desired Results

Python脚本可获得预期结果。

df_results = pd.DataFrame({'CN ON': ['WB-01','ZD-DD','DE-02','WZ-D8','HJ-78'],
                           'Type': ['First','Second','First','Second','Third'],
                           'Status': ['Completed','Started','Started','Final','Pending'],
                           'User': ['Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)','Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)','Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin (THUMB TACK; thumbtack@domain.com; 999-999-999\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)',  'Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin (THUMB TACK; thumbtack@domain.com; 999-999-999)\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)', 'Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin (THUMB TACK; thumbtack@domain.com; 999-999-999)\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nAdmin Assistant (WRIST PAD; wristpad@domain.com; 999 999 9999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)'],
                           'Ownership': ['Completed','PAPER CLIP','PAPER CLIP, THUMB TACK','WHITE BOARD','MOUSE PAD, WRIST PAD']
                           })

我尝试使用.split函数,但是我不知道如何对多个元素和多个定界符进行分割,尤其是当可能存在多个括号时。然后,我不知道如何仅从该字段中提取某些元素,因为基于另一个字段的条件可能存在多个实例。

任何指导或协助将不胜感激!请让我知道是否需要澄清。

1 个答案:

答案 0 :(得分:1)

第一个任务是提取所有角色的名称,另一个任务很简单:

roles = (df['User'].str.split('\n', expand=True)
     .stack()
     .str.extract('^([\w\s]*)\s+\(([\w\s]*)[;|\)]')
     .reset_index()
     .groupby(['level_0', 0])[1]
     .agg(', '.join)
     .unstack(level=0)
)

# assign Completed ownership
roles['Completed'] = 'Completed'

ownership_mask = {
    'Started' : 'Admin',
    'Pending' : 'Admin Assistant',
    'Final'   : 'Supervisor',
    'Completed': 'Completed'
}

df['ownership'] = roles.lookup(df.index, df['Status'].map(ownership_mask))

输出(df['Status']):

0                 Completed
1                PAPER CLIP
2    PAPER CLIP, THUMB TACK
3               WHITE BOARD
4      MOUSE PAD, WRIST PAD
Name: ownership, dtype: object

注意,我们可以使用df['Status'].map代替np.select

# no need to do `roles['Completed'] = 'Completed':
df['Ownership'] = np.select([df['Status'].eq('Started'),
                             df['Status'].eq('Pending'),
                             df['Status'].eq('Final')],
                            [roles['Admin'], roles['Admin Assistant'], roles['Supervisor'] ],
                            'Completed'
                           )