如何通过比较其他两个列来创建具有值的新列?

时间:2019-07-03 19:22:48

标签: python pandas

我正在从事一个项目,该项目将对具有计算机帐户的员工进行审核。我要打印一个带有两个新列的数据框。这与“数据帧中的比较列”问题不同,因为我正在使用字符串。我还需要做一些模糊逻辑,但这还很远。

我收到的数据在Excel工作表中。它来自两个我无法控制的来源,因此我将其格式化为[First Name,Last Name],然后将它们打印到控制台以确保我正在使用的数据正确。我将.xls转换为.csv文件,对信息进行了格式设置,并且能够在具有两列的单个数据框中输出两个名称列表,但是无法将我想要的值放在最后两列中。我用过查询(返回的是True / False,不是名称),diff和regex。我以为我使用的工具不正确。

    import pandas as pd

    nd = {'col1': ["Abraham Hansen","Demetrius McMahon","Hilary 
          Emerson","Amelia H. Hayden","Abraham Oliver"],
          'col2': ["Abraham Hansen","Abe Oliver","Hillary Emerson","DJ 
          McMahon","Amelia H. Hayden"]}
    info = pd.DataFrame(data=nd)

    for row in info:
    if info.col1.value not in info.col2:
        info["Need Account"] = info.col1.value

    if info.col2.value not in info.col1:
        info["Delete Account"] = info.col2.value

    print(info)

我想要一个带有2列的新数据框:“需要帐户”和“删除帐户”,并根据数据框中的其他列填写适当的值。在这种情况下,我得到一个错误,即“系列”没有属性“值”。 这是我预期输出的示例:

    df_out: 
    Need Account       Delete Account
    Demetrius McMahon  Abe Oliver
    Abraham Oliver     Hillary Emerson
    Hilary Emerson     DJ McMahon

从该列表中,我可以看到谁的昵称出现,然后从那里删除列表。

3 个答案:

答案 0 :(得分:0)

我碰巧没有看到您的预期输出,但阅读了您在代码中尝试的内容。让我知道这是您要寻找的吗?

nd = {'col1': ["Abraham Hansen","Demetrius McMahon","Hilary Emerson","Amelia H. Hayden","Abraham Oliver"],
      'col2': ["Abraham Hansen","Abe Oliver","Hillary Emerson","DJ McMahon","Amelia H. Hayden"], 
      'Need Account':"", 
      'Delete Account':""
     }
info = pd.DataFrame(data=nd)

print(info)

               col1              col2 Need Account Delete Account
0     Abraham Hansen    Abraham Hansen                            
1  Demetrius McMahon        Abe Oliver                            
2     Hilary Emerson   Hillary Emerson                            
3   Amelia H. Hayden        DJ McMahon                            
4     Abraham Oliver  Amelia H. Hayden    

不使用循环,使用矢量...

info.loc[info['col1'] != info['col2'], 'Need Account'] = info['col1']
info.loc[info['col2'] != info['col1'], 'Delete Account'] = info['col2']

print(info)

               col1              col2       Need Account    Delete Account
0     Abraham Hansen    Abraham Hansen                                     
1  Demetrius McMahon        Abe Oliver  Demetrius McMahon        Abe Oliver
2     Hilary Emerson   Hillary Emerson     Hilary Emerson   Hillary Emerson
3   Amelia H. Hayden        DJ McMahon   Amelia H. Hayden        DJ McMahon
4     Abraham Oliver  Amelia H. Hayden     Abraham Oliver  Amelia H. Hayden

答案 1 :(得分:0)

您要使用isinnp.where有条件地分配新值:

info['Need Account'] = np.where(~info['col1'].isin(info['col2']), info['col1'], np.NaN)
info['Delete Account'] = np.where(~info['col2'].isin(info['col1']), info['col2'], np.NaN)

                col1              col2       Need Account   Delete Account
0     Abraham Hansen    Abraham Hansen                NaN              NaN
1  Demetrius McMahon        Abe Oliver  Demetrius McMahon       Abe Oliver
2     Hilary Emerson   Hillary Emerson     Hilary Emerson  Hillary Emerson
3   Amelia H. Hayden        DJ McMahon                NaN       DJ McMahon
4     Abraham Oliver  Amelia H. Hayden     Abraham Oliver              NaN

或者,如果您想要一个新的数据框,如问题中所述:

need = np.where(~info['col1'].isin(info['col2']), info['col1'], np.NaN)
delete = np.where(~info['col2'].isin(info['col1']), info['col2'], np.NaN)

newdf = pd.DataFrame({'Need Account':need,
                      'Delete Account':delete})

        Need Account   Delete Account
0                NaN              NaN
1  Demetrius McMahon       Abe Oliver
2     Hilary Emerson  Hillary Emerson
3                NaN       DJ McMahon
4     Abraham Oliver              NaN

答案 2 :(得分:0)

IIUC,似乎从输入数据框中不需要维护太多的“结构”,因此您可以使用集合直接比较组中的成员身份。

nd = {'col1': ["Abraham Hansen","Demetrius McMahon","Hilary Emerson","Amelia H. Hayden","Abraham Oliver"],
      'col2': ["Abraham Hansen","Abe Oliver","Hillary Emerson","DJ McMahon","Amelia H. Hayden"]}
df = pd.DataFrame(data=nd)

col1 = set(df['col1'])
col2 = set(df['col2'])

need = col1 - col2
delete = col2 - col1

print('need = ', need)
print('delete =  ', delete)

收益

need =  {'Hilary Emerson', 'Demetrius McMahon', 'Abraham Oliver'}
delete =   {'Hillary Emerson', 'DJ McMahon', 'Abe Oliver'}

然后您可以将其放置在新的数据框中:

data = {'need':list(need), 'delete':list(delete)}
new_df = pd.DataFrame.from_dict(data, orient='index').transpose()

(已编辑,以考虑到needdelete长度不相等的可能性。)