合并2个具有不同列名的数据框,以显示公共元素以及数据框之间的差异

时间:2019-06-03 11:52:28

标签: python-3.x pandas

我有一个脚本,该脚本具有到2个不同数据库的2个连接。我需要比较查询的结果,并显示常见元素以及结果之间的差异。 Data set example

我有一个与数据帧进行比较的函数,它给出了差异和共同的元素,但是却给我一个错误。我认为是因为查询中列的名称不同。

def compare(a,b):
    if a.equals(b):
       print("SAME!")
    else:
        df = a.merge(b, how='outer',indicator=True)
        x = df.loc[df['_merge'] == 'both', 'm.id']
        y = df.loc[df['_merge'] == 'left_only', 'm.id']
        z = df.loc[df['_merge'] == 'right_only', 'm.id']
        print (f'Display Common Elements contained in Neo4j and MySQL: {", ".join(x)}')
        print (f'Elements found only in Neo4j: {", ".join(y)}')
        print (f'Elements found only in MySQL: {", ".join(z)}')

我希望

Common elements: C0012345
Elements found only in Neo4j: C027415, C189274
Elements found only in MySQL: C086356, C098876

3 个答案:

答案 0 :(得分:2)

这可以工作

df1 = pd.DataFrame({"a" : ["1","2","3","4","5","6","7"]})
df2 = pd.DataFrame({"b" : ["1","3","2","9","11","23","4"]})

def compare(df1, df2):
    result = pd.merge(df1,df2, how='outer', left_on='a', right_on='b')
    missing_from_a = result.loc[pd.isna(result.a)].b
    missing_from_b = result.loc[pd.isna(result.b)].a
    have_both = result.loc[~pd.isna(result.b)].a.copy()
    have_both.dropna(inplace=True)
    print(", ".join(list(missing_from_b)))
    print(", ".join(list(missing_from_a)))
    print(", ".join(list(have_both)))

答案 1 :(得分:2)

除了合并上面已经描述过的@Anna Semjen之外,您还可以尝试使用isin()方法来查找哪个值是否在另一个数据帧中:

df1 = pd.DataFrame({0 : ["1","2","3","4","5","6","7"]}) # as MySQL
df2 = pd.DataFrame({"m.id" : ["1","3","2","9","11","23","4"]}) # as Neo4j
print('Elements found only in MySQL: '+ ','.join(list(df1[~df1[0].isin(df2['m.id'])].iloc[:,0].tolist())))
print('Elements found only in Neo4j: '+ ','.join(list(df2[~(df2['m.id'].isin(df1[0]))].iloc[:,0].tolist())))
print('Elements found in both Neo4j & MySQL: '+ ','.join(df1[df1[0].isin(df2['m.id'])].iloc[:,0].tolist()))

输出:

Elements found only in MySQL: 5,6,7
Elements found only in Neo4j: 9,11,23
Elements found in both Neo4j & MySQL: 1,2,3,4

希望这可以帮助您作为另一种方法的参考:)

答案 2 :(得分:0)

熊猫更新1.1.0 +

熊猫提供了一种pd.DataFrame.compare方法:

我们可以像这样比较具有相同索引的数据框:

df1.compare(df2.rename(columns={'b':'a'}))

输出:

     a      
  self other
1    2     3
2    3     2
3    4     9
4    5    11
5    6    23
6    7     4

或者我们可以像这样使用pd.Series.compare

df1['a'].compare(df2['b'])

输出:

  self other
1    2     3
2    3     2
3    4     9
4    5    11
5    6    23
6    7     4