比较具有主键条件而不合并的不同数据框中的2列

时间:2018-10-23 06:00:49

标签: python python-3.x pandas validation dataframe

我有2个不同的数据框,例如:

Df1:

User_id    User_name     User_phn
1          Alex          1234123
2          Danny         4234123
3          Bryan         5234123

Df2:

User_id    User_name     User_phn
1          Alex          3234123
2          Chris         4234123
3          Bryan         5234123
4          Bexy          6234123

user_id是两个表中的主键,我需要使用user_id作为条件比较两个数据框,并为我提供具有匹配值和不匹配值的值,而不将数据框合并到新的数据框中。我们将处理具有庞大数据集的超过1亿条记录,这就是为什么我不想再次合并到新的数据框中,而我认为这会再次消耗内存。

结果:

User_id    User_name     User_phn
1          Alex          Mismatch
2          Mismatch      4234123
3          Bryan         5234123
4          Mismatch      Mismatch

4 个答案:

答案 0 :(得分:1)

这并不容易,但是可以通过比较由列组合创建的元组的Series和通过isin进行比较来实现:

s11 = pd.Series(list(map(tuple, Df1[['User_id','User_name']].values.tolist())))
s12 = pd.Series(list(map(tuple, Df2[['User_id','User_name']].values.tolist())))

s21 = pd.Series(list(map(tuple, Df1[['User_id','User_phn']].values.tolist())))
s22 = pd.Series(list(map(tuple, Df2[['User_id','User_phn']].values.tolist())))


Df2.loc[~s12.isin(s11), 'User_name'] = 'Mismatch'
Df2.loc[~s22.isin(s21), 'User_phn'] = 'Mismatch'

print (Df2)
   User_id User_name  User_phn
0        1      Alex  Mismatch
1        2  Mismatch   4234123
2        3     Bryan   5234123
3        4  Mismatch  Mismatch

merge测试isna的不匹配对(缺失值)的解决方案:

s1 = Df2.merge(Df1, how='left', on=['User_id','User_name'], suffixes=('_',''))['User_phn']
print (s1)
0    1234123.0
1          NaN
2    5234123.0
3          NaN
Name: User_phn, dtype: float64

s2 = Df2.merge(Df1, how='left', on=['User_id','User_phn'], suffixes=('_',''))['User_name']
print (s2)
0      NaN
1    Danny
2    Bryan
3      NaN
Name: User_name, dtype: object

Df2.loc[s1.isna(), 'User_name'] = 'Mismatch'
Df2.loc[s2.isna(), 'User_phn'] = 'Mismatch'

print (Df2)
   User_id User_name  User_phn
0        1      Alex  Mismatch
1        2  Mismatch   4234123
2        3     Bryan   5234123
3        4  Mismatch  Mismatch

答案 1 :(得分:1)

您也可以使用系列df2.set_index('fruit', inplace=True) mask = df1.fruit.isin(df2.index) df1.loc[mask, 'type'] = df2.loc[df1.loc[mask, 'fruit'], 'type'].values fruit name type 0 apple anna B 1 banana lisa B 2 orange red A 3 pine tin A 尝试这种方法:

map

答案 2 :(得分:0)

我编写了代码,并以一种非常简单的方式解决了该问题。我只是比较了两个数据库的每一行,然后进行了比较,并将结果行附加到结果数据库中。让我知道这个是否奏效。

import pandas as pd
data = [[1,'Alex','1234123'],[2,'Danny','4234123'],[3,'Bryan','5234123']]
df = pd.DataFrame(data,columns=['User_id','User_name','User_phn'])
print (df)
data = [[1,'Alex','3234123'],[2,'Chris','4234123'],[3,'Bryan','5234123'],[4,'Bexy','6234123']]
df_2 = pd.DataFrame(data,columns=['User_id','User_name','User_phn'])
print (df_2)
l=max(len(df.index),len(df_2.index))
df_res = pd.DataFrame(columns=['User_id','User_name','User_phn'])
df_mat = df.as_matrix()
df_2_mat = df_2.as_matrix()
for i in range(0,l):
    try:
            arr=[]
            arr.append(df_mat[i][0])
            for k in range(1,3):
                if df_mat[i][k] == df_2_mat[i][k]:
                    arr.append(df_mat[i][k])
                else:
                    arr.append("Mismatch")
            df_res.loc[i] = arr

    except:
        df_res.loc[i] = [i+1,"Mismatch","Mismatch"]
print(df_res)

答案 3 :(得分:0)

嗨Narayana Kandukuri,

我想我的代码可能很简单,看看吧。

import pandas as pd

df1 = pd.DataFrame([[1,'Alex',1234123],[2,'Danny',4234123],[3,'Bryan',5234123]],columns=['User_id','User_name','User_phn'])
df2 = pd.DataFrame([[1,'Alex',3234123],[2,'Chris',4234123],[3,'Bryan',5234123],[4,'Bexy',6234123]],columns=['User_id','User_name','User_phn'])

temp = df2[['User_id']] #Saving this for later use.
Bool_Data = (df1==df2[:df1.shape[0]]) #This will give you a boolean frame
df2 = df2[Bool_Data].fillna('mismatch') #Keep this boolean frame to df2
df2['User_id'] = temp['User_id'] #Assign the before temp.

df2 =
   User_id User_name     User_phn
0        1      Alex     mismatch
1        2  mismatch     423412
2        3     Bryan     523412
3        4  mismatch     mismatch