我有一个27列的Excel文件。我有一列称为“用户”的列,其中包含多个元素。每个元素在同一单元格的Excel中的新行上分隔,并且每个子元素在圆括号(())之间用分号(;)分隔。但是,子元素中也可能存在括号。下面是使用示例数据显示表格在Excel中的显示方式的图片。
这是使用Pandas将其作为DataFrame导入Python的方式。
df = pd.DataFrame({'CN ON': ['WB-01','ZD-DD','DE-02','WZ-D8','HJ-78'],
'Type': ['First','Second','First','Second','Third'],
'Status': ['Completed','Started','Started','Final','Pending'],
'User': ['Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)','Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)','Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin (THUMB TACK; thumbtack@domain.com; 999-999-999\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)', 'Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin (THUMB TACK; thumbtack@domain.com; 999-999-999)\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)', 'Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin (THUMB TACK; thumbtack@domain.com; 999-999-999)\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nAdmin Assistant (WRIST PAD; wristpad@domain.com; 999 999 9999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)']
})
逻辑
现在,我想应用以下逻辑并创建一个新的列,称为所有权。
如果Status = 'Completed'
然后Ownership = 'Completed'
如果Status = 'Started'
,则Ownership =
为每个Admin
的名称。
如果Status = 'Final'
,则Ownership =
为每个Supervisor
的名称。
如果Status = 'Pending'
,则Ownership =
为每个Admin Assistant
的名称。
可能的用户角色是'Admin', 'Admin Assistant', 'Supervisor',
和'Alternative Supervisor'
。括号中的第一个子元素是该角色的名称。第二个子元素是电子邮件地址。第三个子元素是非标准化电话号码。它可以有破折号,可以有括号,可以有空格或全部在一起。子元素中的定界符为分号;
。我相信元素之间的分隔符是Python中的\n
,因为这是我导入DataFrame(使用上面的脚本)时显示的样子。
Status User Role
Started Admin
Pending Admin Assistant
Final Supervisor
Completed Completed
Python脚本可获得预期结果。
df_results = pd.DataFrame({'CN ON': ['WB-01','ZD-DD','DE-02','WZ-D8','HJ-78'],
'Type': ['First','Second','First','Second','Third'],
'Status': ['Completed','Started','Started','Final','Pending'],
'User': ['Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)','Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)','Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin (THUMB TACK; thumbtack@domain.com; 999-999-999\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)', 'Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin (THUMB TACK; thumbtack@domain.com; 999-999-999)\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)', 'Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin (THUMB TACK; thumbtack@domain.com; 999-999-999)\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nAdmin Assistant (WRIST PAD; wristpad@domain.com; 999 999 9999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)'],
'Ownership': ['Completed','PAPER CLIP','PAPER CLIP, THUMB TACK','WHITE BOARD','MOUSE PAD, WRIST PAD']
})
我尝试使用.split
函数,但是我不知道如何对多个元素和多个定界符进行分割,尤其是当可能存在多个括号时。然后,我不知道如何仅从该字段中提取某些元素,因为基于另一个字段的条件可能存在多个实例。
任何指导或协助将不胜感激!请让我知道是否需要澄清。
答案 0 :(得分:1)
第一个任务是提取所有角色的名称,另一个任务很简单:
roles = (df['User'].str.split('\n', expand=True)
.stack()
.str.extract('^([\w\s]*)\s+\(([\w\s]*)[;|\)]')
.reset_index()
.groupby(['level_0', 0])[1]
.agg(', '.join)
.unstack(level=0)
)
# assign Completed ownership
roles['Completed'] = 'Completed'
ownership_mask = {
'Started' : 'Admin',
'Pending' : 'Admin Assistant',
'Final' : 'Supervisor',
'Completed': 'Completed'
}
df['ownership'] = roles.lookup(df.index, df['Status'].map(ownership_mask))
输出(df['Status']
):
0 Completed
1 PAPER CLIP
2 PAPER CLIP, THUMB TACK
3 WHITE BOARD
4 MOUSE PAD, WRIST PAD
Name: ownership, dtype: object
注意,我们可以使用df['Status'].map
代替np.select
:
# no need to do `roles['Completed'] = 'Completed':
df['Ownership'] = np.select([df['Status'].eq('Started'),
df['Status'].eq('Pending'),
df['Status'].eq('Final')],
[roles['Admin'], roles['Admin Assistant'], roles['Supervisor'] ],
'Completed'
)