我刚开始使用Python进行编码,正在寻找改善代码的方法。目前,我正在分析包含多个工作流程的数据框。每个工作流程具有用于启动,结束或删除已启动的工作流程的不同处理步骤。在简化版本中,我的数据如下所示:
Workflow Initiate_1 Initiate_2 End_1 End_2 End_3 Del_1 Del_2
0 1 Name_1 na na Name_1 na na na
1 2 Name_2 na na na na Name_2 na
2 3 na Name_3 na na Name_5 na na
3 4 na Name_4 Name_5 na na na na
4 5 na na na na Name_5 na na
对于每个工作流程,我想比较结束工作流程的名称和启动工作流程的名称是否不同。此外,我想确定启动的工作流程是否已删除。在this问题中的Stackoverflow的帮助下,我编写了以下代码,似乎给出了预期的结果:
end_scenarios = df.filter(items = ['End_1',
'End_2',
'End_3']) # filter by columns which can end a mutation
delete_scenarios = df.filter(items = ['Del_1',
'Del_2']) # filter by columns which delete a mutation
df = df.replace('na', np.nan)
nulls = end_scenarios.isnull().all(1) # checks which rows are all null
delete = delete_scenarios.notnull().any(1) # checks if a row contains a value in one of the removal scenario's
match = end_scenarios.ffill(1).iloc[:, -1] == df['Initiate_1'] # find last name in the last end
# use np.select to analyse each row for the first initiate scenario
df['Analysis Initiate_1'] = np.select([match, delete, nulls],
['Name end equals initiate', 'Deleted mutation', 'No name ended'],
'Different name ended')
# use np.select to analyse the second initiate scenario
match = end_scenarios.ffill(1).iloc[:, -1] == pivot['Initiate_2']
df['Analysis Initiate_2'] = np.select([match, delete, nulls],
['Name end equals initiate', 'Deleted mutation', 'No name ended'],
'Different name ended')
我决定重写分析以比较多个列,而不是向前填充它,而只与最后一列进行比较。以防万一最终方案中存储了多个名称。结果是:
conditions = [
(df['Initiate_1'] != 'na') &
((df['Initiate_1'] == df['End_1']) |
(df['Initiate_1'] == df['End_2']) |
(df['Initiate_1'] == df['End_3']),
(df['Del_1'] != 'na') |
(df['Del_2'] != 'na'),
(df['End_1'] == 'na') &
(df['End_2'] == 'na') &
(df['End_3'] == 'na'))]
answers = ['Name end equals initiate','Deleted','No name ended']
df['Analysis Initiate_1'] = np.select(conditions, answers, default = 'Different name ended')
conditions = [
(df['Initiate_2'] != 'na') &
((df['Initiate_2'] == df['End_1']) |
(df['Initiate_2'] == df['End_2']) |
(df['Initiate_2'] == df['End_3']),
(df['Del_1'] != 'na') |
(df['Del_2'] != 'na'),
(df['End_1'] == 'na') &
(df['End_2'] == 'na') &
(df['End_3'] == 'na'))]
df['Analysis Initiate_2'] = np.select(conditions, answers, default = 'Different name ended')
该代码似乎给出了预期的结果,但希望发布该代码以进行进一步的改进。我编写的用于比较多列的代码是否存在陷阱?还有其他方法可以使用更简洁的代码对多列进行分析吗?等等