我有2个CSV文件:
CSV1:
"Hypervisor","IP","ABCD","Operating System","Domain","Memory","No. CPU","Availability (%)","Last Collection Time","lol"
"lglac125.lss.com","10.247.52.125","VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599031E9"
"lglac126.lss.com","10.247.52.126","VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9"
"lglac127.lss.com","10.247.52.127","VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","0.0","1.558599031E9"
"lglac128.lss.com","10.247.52.128","VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9"
"lglac129.lss.com","10.247.52.129","VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9"
CSV2:
"Hypervisor","IP","Arrays","Operating System","Domain","Memory","No. CPU","Availability (%)","Last Collection Time","DummyColumn"
"lglac125.lss.com","10.247.52.125",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599031E9","A"
"lglac126.lss.com","10.247.52.126",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9","B"
"lglac127.lss.com","10.247.52.127",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","0.0","1.558599031E9","C"
"lglac128.lss.com","10.247.52.128",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9","D"
"lglac129.lss.com","10.247.52.129",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9","E"
"DummyRow","10.247.52.129",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9","F"
我正在尝试将每列的所有条目(如果在csv2中可用)与相应的行进行比较。如果有任何条目丢失或更改,我需要升起一个标志。 两个文件中都有可能添加或删除任何列。因此,我需要首先检查csv2中是否存在x列,然后匹配csv1中同一列的条目。 / p>
我已经为此苦苦挣扎了三天,无法解决。我非常感谢您的帮助。
答案 0 :(得分:2)
您可以在indicator=True
中尝试使用merge
和both
和query()
:
matching_cols=df1.columns.intersection(df2.columns).tolist() #find matching columns to merge
df1.merge(df2,on=matching_cols,how='outer',indicator=True).query("_merge!='both'")
这将向您显示数据框之间的罕见数据
Hypervisor IP Operating System \
0 lglac125.lss.emc.com 10.247.52.125 VMware ESXi 5.5.0 build-9919047
5 lglac125.lss.emc.com VMware ESXi 5.5.0 build-9919047
6 DummyRow 10.247.52.129 VMware ESXi 5.5.0 build-9919047
Domain Memory No. CPU Availability (%) Last Collection Time \
0 lss.emc.com 524278.03125 4.0 100.0 1.558599e+09
5 lss.emc.com 524278.03125 4.0 100.0 1.558599e+09
6 lss.emc.com 524278.03125 4.0 100.0 1.558600e+09
Arrays DummyColumn _merge
0 NaN NaN left_only
5 NaN A right_only
6 NaN F right_only
答案 1 :(得分:0)
IIUC,
假设csv1,csv2以df1
,df2
的形式导入熊猫。在列上使用intersection
查找匹配的列并对其进行排序。将其传递到df1
和df2
。最后,eq
位于df1
和df2
matched_list = df1.columns.intersection(df2.columns).sort_values()
df1_mask = df1[matched_list].eq(df2[matched_list])
Out[853]:
Availability (%) Domain Hypervisor IP Last Collection Time Memory \
0 True True True False True True
1 True True True True True True
2 True True True True True True
3 True True True True True True
4 True True True True True True
5 False False False False False False
No. CPU Operating System
0 True True
1 True True
2 True True
3 True True
4 True True
5 False False
注意:我将df1.loc[0, 'IP']
更改为10.247.52.124
以在False
的第0行的一个值中显示df1
,以进行演示< / em>
您可以将此df1_mask
插入df1
,以检查NaN
。任何NaN
要么是原始值NaN
,要么在df1
和df2
df1[df1_mask]
Out[854]:
Hypervisor IP Operating System Domain \
0 lglac125.lss.com NaN VMware ESXi 5.5.0 build-9919047 lss.com
1 lglac126.lss.com 10.247.52.126 VMware ESXi 5.5.0 build-9919047 lss.com
2 lglac127.lss.com 10.247.52.127 VMware ESXi 5.5.0 build-9919047 lss.com
3 lglac128.lss.com 10.247.52.128 VMware ESXi 5.5.0 build-9919047 lss.com
4 lglac129.lss.com 10.247.52.129 VMware ESXi 5.5.0 build-9919047 lss.com
Memory No. CPU Availability (%) Last Collection Time lol
0 524278.03125 4.0 100.0 1.558599e+09 NaN
1 524278.03125 4.0 100.0 1.558600e+09 NaN
2 524278.03125 4.0 0.0 1.558599e+09 NaN
3 524278.03125 4.0 100.0 1.558600e+09 NaN
4 524278.03125 4.0 100.0 1.558600e+09 NaN
注意:您的df1
有列lol
但没有值,因此它原来是NaN
或者您可以检查df2
df2[df1_mask]
Out[855]:
Hypervisor IP Arrays Operating System \
0 lglac125.lss.com NaN NaN VMware ESXi 5.5.0 build-9919047
1 lglac126.lss.com 10.247.52.126 NaN VMware ESXi 5.5.0 build-9919047
2 lglac127.lss.com 10.247.52.127 NaN VMware ESXi 5.5.0 build-9919047
3 lglac128.lss.com 10.247.52.128 NaN VMware ESXi 5.5.0 build-9919047
4 lglac129.lss.com 10.247.52.129 NaN VMware ESXi 5.5.0 build-9919047
5 NaN NaN NaN NaN
Domain Memory No. CPU Availability (%) Last Collection Time \
0 lss.com 524278.03125 4.0 100.0 1.558599e+09
1 lss.com 524278.03125 4.0 100.0 1.558600e+09
2 lss.com 524278.03125 4.0 0.0 1.558599e+09
3 lss.com 524278.03125 4.0 100.0 1.558600e+09
4 lss.com 524278.03125 4.0 100.0 1.558600e+09
5 NaN NaN NaN NaN NaN
DummyColumn
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN