关于Python代码以比较数据框中的多个列的想法

时间:2019-02-10 15:07:01

标签: python pandas numpy dataframe

我刚开始使用Python进行编码,正在寻找改善代码的方法。目前,我正在分析包含多个工作流程的数据框。每个工作流程具有用于启动,结束或删除已启动的工作流程的不同处理步骤。在简化版本中,我的数据如下所示:

   Workflow Initiate_1  Initiate_2   End_1   End_2   End_3   Del_1   Del_2 
0         1   Name_1            na      na  Name_1      na      na      na
1         2   Name_2            na      na      na      na  Name_2      na
2         3       na        Name_3      na      na  Name_5      na      na
3         4       na        Name_4  Name_5      na      na      na      na 
4         5       na            na      na      na  Name_5      na      na

对于每个工作流程,我想比较结束工作流程的名称和启动工作流程的名称是否不同。此外,我想确定启动的工作流程是否已删除。在this问题中的Stackoverflow的帮助下,我编写了以下代码,似乎给出了预期的结果:

end_scenarios = df.filter(items = ['End_1',
                                   'End_2',
                                   'End_3'])  # filter by columns which can end a mutation

delete_scenarios = df.filter(items = ['Del_1',
                                      'Del_2']) # filter by columns which delete a mutation

df = df.replace('na', np.nan)
nulls = end_scenarios.isnull().all(1)  # checks which rows are all null
delete = delete_scenarios.notnull().any(1) # checks if a row contains a value in one of the removal scenario's 
match = end_scenarios.ffill(1).iloc[:, -1] == df['Initiate_1'] # find last name in the last end

# use np.select to analyse each row for the first initiate scenario
df['Analysis Initiate_1'] = np.select([match, delete, nulls], 
                                      ['Name end equals initiate', 'Deleted mutation', 'No name ended'], 
                                       'Different name ended')

# use np.select to analyse the second initiate scenario
match = end_scenarios.ffill(1).iloc[:, -1] == pivot['Initiate_2']

df['Analysis Initiate_2'] = np.select([match, delete, nulls], 
                                      ['Name end equals initiate', 'Deleted mutation', 'No name ended'], 
                                       'Different name ended')

我决定重写分析以比较多个列,而不是向前填充它,而只与最后一列进行比较。以防万一最终方案中存储了多个名称。结果是:

conditions = [
    (df['Initiate_1'] != 'na') & 
    ((df['Initiate_1'] == df['End_1']) | 
     (df['Initiate_1'] == df['End_2']) | 
     (df['Initiate_1'] == df['End_3']),
    (df['Del_1'] != 'na') |
    (df['Del_2'] != 'na'),
    (df['End_1'] == 'na') & 
    (df['End_2'] == 'na') & 
    (df['End_3'] == 'na'))] 

answers = ['Name end equals initiate','Deleted','No name ended']

df['Analysis Initiate_1'] = np.select(conditions, answers, default = 'Different name ended')

conditions = [
        (df['Initiate_2'] != 'na') & 
        ((df['Initiate_2'] == df['End_1']) | 
         (df['Initiate_2'] == df['End_2']) | 
         (df['Initiate_2'] == df['End_3']),
        (df['Del_1'] != 'na') |
        (df['Del_2'] != 'na'),
        (df['End_1'] == 'na') & 
        (df['End_2'] == 'na') & 
        (df['End_3'] == 'na'))] 

df['Analysis Initiate_2'] = np.select(conditions, answers, default = 'Different name ended')

该代码似乎给出了预期的结果,但希望发布该代码以进行进一步的改进。我编写的用于比较多列的代码是否存在陷阱?还有其他方法可以使用更简洁的代码对多列进行分析吗?等等

0 个答案:

没有答案