我正在处理python pandas数据帧,并且在比较两个数据帧时遇到问题。
df栏:
A =>字符串名称
B =>时间戳
C =>时间戳
D => int
E =>浮动
cols = ['A','B','C','D','E']
df1 = pd.read_csv(file1, sep=',', header=None, names=cols,
usecols=['A', 'B', 'C', 'D', 'E'],
converters={'B':transform,'C':transform,'D':transform,'E':transform},
dtype={'B': np.float64},
{'C': np.float64},
{'D': np.float64},
{'E': np.float64})
df2 = pd.read_csv(file2, sep=',', header=None, names=cols,
usecols=['A', 'B', 'C', 'D', 'E'],
converters={'B':transform,'C':transform,'D':transform,'E':transform},
dtype={'B': np.float64},
{'C': np.float64},
{'D': np.float64},
{'E': np.float64})
其中df1是:
A B C D E
0 g 08:15:32 08:12:12 100 11.5
1 g 08:17:45 08:12:12 101 11.3
2 g 08:25:22 08:12:12 102 11.4
3 m 08:36:15 08:30:15 250 17.5
4 m 08:45:14 08:30:15 250 17.6
和df2是:
A B C D E
0 g 08:15:15 08:12:12 105 11.5
1 m 08:37:07 08:30:15 200 17.3
2 m 08:38:13 08:30:15 250 17.6
3 m 08:45:12 08:46:14 200 23.4
我想要比较所有匹配行的密钥为[' A'' C']的两个数据框,而不是删除任何重复项。因为我想知道哪个数据框可能有额外的记录。所以我的结果数据框是:
DF12
A B C D E df difference diff-D diff-E
0 g 08:15:32 08:12:12 100 11.5 df1 Y 100-->105 NaN
0 g 08:15:15 08:12:12 105 11.5 df2 Y 105-->100 NaN
1 g 08:17:45 08:12:12 101 11.3 df1
NaN NaN NaN NaN NaN df2 missing NaN NaN
2 g 08:25:22 08:12:12 102 11.4 df1
NaN NaN NaN NaN NaN df2 missing NaN NaN
3 m 08:36:15 08:30:15 250 17.5 df1 Y 250-->200 17.5-->17.3
1 m 08:37:07 08:30:15 200 17.3 df2 Y 200-->250 17.3-->17.5
4 m 08:45:14 08:30:15 250 17.6 df1 N
2 m 08:38:13 08:30:15 250 17.6 df2 N
NaN NaN NaN NaN NaN df1 missing NaN NaN
3 a 08:45:12 08:46:14 200 23.4 df2