我正在从事一个项目,该项目将对具有计算机帐户的员工进行审核。我要打印一个带有两个新列的数据框。这与“数据帧中的比较列”问题不同,因为我正在使用字符串。我还需要做一些模糊逻辑,但这还很远。
我收到的数据在Excel工作表中。它来自两个我无法控制的来源,因此我将其格式化为[First Name,Last Name],然后将它们打印到控制台以确保我正在使用的数据正确。我将.xls转换为.csv文件,对信息进行了格式设置,并且能够在具有两列的单个数据框中输出两个名称列表,但是无法将我想要的值放在最后两列中。我用过查询(返回的是True / False,不是名称),diff和regex。我以为我使用的工具不正确。
import pandas as pd
nd = {'col1': ["Abraham Hansen","Demetrius McMahon","Hilary
Emerson","Amelia H. Hayden","Abraham Oliver"],
'col2': ["Abraham Hansen","Abe Oliver","Hillary Emerson","DJ
McMahon","Amelia H. Hayden"]}
info = pd.DataFrame(data=nd)
for row in info:
if info.col1.value not in info.col2:
info["Need Account"] = info.col1.value
if info.col2.value not in info.col1:
info["Delete Account"] = info.col2.value
print(info)
我想要一个带有2列的新数据框:“需要帐户”和“删除帐户”,并根据数据框中的其他列填写适当的值。在这种情况下,我得到一个错误,即“系列”没有属性“值”。 这是我预期输出的示例:
df_out:
Need Account Delete Account
Demetrius McMahon Abe Oliver
Abraham Oliver Hillary Emerson
Hilary Emerson DJ McMahon
从该列表中,我可以看到谁的昵称出现,然后从那里删除列表。
答案 0 :(得分:0)
我碰巧没有看到您的预期输出,但阅读了您在代码中尝试的内容。让我知道这是您要寻找的吗?
nd = {'col1': ["Abraham Hansen","Demetrius McMahon","Hilary Emerson","Amelia H. Hayden","Abraham Oliver"],
'col2': ["Abraham Hansen","Abe Oliver","Hillary Emerson","DJ McMahon","Amelia H. Hayden"],
'Need Account':"",
'Delete Account':""
}
info = pd.DataFrame(data=nd)
print(info)
col1 col2 Need Account Delete Account
0 Abraham Hansen Abraham Hansen
1 Demetrius McMahon Abe Oliver
2 Hilary Emerson Hillary Emerson
3 Amelia H. Hayden DJ McMahon
4 Abraham Oliver Amelia H. Hayden
不使用循环,使用矢量...
info.loc[info['col1'] != info['col2'], 'Need Account'] = info['col1']
info.loc[info['col2'] != info['col1'], 'Delete Account'] = info['col2']
print(info)
col1 col2 Need Account Delete Account
0 Abraham Hansen Abraham Hansen
1 Demetrius McMahon Abe Oliver Demetrius McMahon Abe Oliver
2 Hilary Emerson Hillary Emerson Hilary Emerson Hillary Emerson
3 Amelia H. Hayden DJ McMahon Amelia H. Hayden DJ McMahon
4 Abraham Oliver Amelia H. Hayden Abraham Oliver Amelia H. Hayden
答案 1 :(得分:0)
info['Need Account'] = np.where(~info['col1'].isin(info['col2']), info['col1'], np.NaN)
info['Delete Account'] = np.where(~info['col2'].isin(info['col1']), info['col2'], np.NaN)
col1 col2 Need Account Delete Account
0 Abraham Hansen Abraham Hansen NaN NaN
1 Demetrius McMahon Abe Oliver Demetrius McMahon Abe Oliver
2 Hilary Emerson Hillary Emerson Hilary Emerson Hillary Emerson
3 Amelia H. Hayden DJ McMahon NaN DJ McMahon
4 Abraham Oliver Amelia H. Hayden Abraham Oliver NaN
或者,如果您想要一个新的数据框,如问题中所述:
need = np.where(~info['col1'].isin(info['col2']), info['col1'], np.NaN)
delete = np.where(~info['col2'].isin(info['col1']), info['col2'], np.NaN)
newdf = pd.DataFrame({'Need Account':need,
'Delete Account':delete})
Need Account Delete Account
0 NaN NaN
1 Demetrius McMahon Abe Oliver
2 Hilary Emerson Hillary Emerson
3 NaN DJ McMahon
4 Abraham Oliver NaN
答案 2 :(得分:0)
IIUC,似乎从输入数据框中不需要维护太多的“结构”,因此您可以使用集合直接比较组中的成员身份。
nd = {'col1': ["Abraham Hansen","Demetrius McMahon","Hilary Emerson","Amelia H. Hayden","Abraham Oliver"],
'col2': ["Abraham Hansen","Abe Oliver","Hillary Emerson","DJ McMahon","Amelia H. Hayden"]}
df = pd.DataFrame(data=nd)
col1 = set(df['col1'])
col2 = set(df['col2'])
need = col1 - col2
delete = col2 - col1
print('need = ', need)
print('delete = ', delete)
收益
need = {'Hilary Emerson', 'Demetrius McMahon', 'Abraham Oliver'}
delete = {'Hillary Emerson', 'DJ McMahon', 'Abe Oliver'}
然后您可以将其放置在新的数据框中:
data = {'need':list(need), 'delete':list(delete)}
new_df = pd.DataFrame.from_dict(data, orient='index').transpose()
(已编辑,以考虑到need
和delete
长度不相等的可能性。)