我有2个不同的数据框,例如:
Df1:
User_id User_name User_phn
1 Alex 1234123
2 Danny 4234123
3 Bryan 5234123
Df2:
User_id User_name User_phn
1 Alex 3234123
2 Chris 4234123
3 Bryan 5234123
4 Bexy 6234123
user_id是两个表中的主键,我需要使用user_id作为条件比较两个数据框,并为我提供具有匹配值和不匹配值的值,而不将数据框合并到新的数据框中。我们将处理具有庞大数据集的超过1亿条记录,这就是为什么我不想再次合并到新的数据框中,而我认为这会再次消耗内存。
结果:
User_id User_name User_phn
1 Alex Mismatch
2 Mismatch 4234123
3 Bryan 5234123
4 Mismatch Mismatch
答案 0 :(得分:1)
这并不容易,但是可以通过比较由列组合创建的元组的Series
和通过isin
进行比较来实现:
s11 = pd.Series(list(map(tuple, Df1[['User_id','User_name']].values.tolist())))
s12 = pd.Series(list(map(tuple, Df2[['User_id','User_name']].values.tolist())))
s21 = pd.Series(list(map(tuple, Df1[['User_id','User_phn']].values.tolist())))
s22 = pd.Series(list(map(tuple, Df2[['User_id','User_phn']].values.tolist())))
Df2.loc[~s12.isin(s11), 'User_name'] = 'Mismatch'
Df2.loc[~s22.isin(s21), 'User_phn'] = 'Mismatch'
print (Df2)
User_id User_name User_phn
0 1 Alex Mismatch
1 2 Mismatch 4234123
2 3 Bryan 5234123
3 4 Mismatch Mismatch
s1 = Df2.merge(Df1, how='left', on=['User_id','User_name'], suffixes=('_',''))['User_phn']
print (s1)
0 1234123.0
1 NaN
2 5234123.0
3 NaN
Name: User_phn, dtype: float64
s2 = Df2.merge(Df1, how='left', on=['User_id','User_phn'], suffixes=('_',''))['User_name']
print (s2)
0 NaN
1 Danny
2 Bryan
3 NaN
Name: User_name, dtype: object
Df2.loc[s1.isna(), 'User_name'] = 'Mismatch'
Df2.loc[s2.isna(), 'User_phn'] = 'Mismatch'
print (Df2)
User_id User_name User_phn
0 1 Alex Mismatch
1 2 Mismatch 4234123
2 3 Bryan 5234123
3 4 Mismatch Mismatch
答案 1 :(得分:1)
您也可以使用系列df2.set_index('fruit', inplace=True)
mask = df1.fruit.isin(df2.index)
df1.loc[mask, 'type'] = df2.loc[df1.loc[mask, 'fruit'], 'type'].values
fruit name type
0 apple anna B
1 banana lisa B
2 orange red A
3 pine tin A
尝试这种方法:
map
答案 2 :(得分:0)
我编写了代码,并以一种非常简单的方式解决了该问题。我只是比较了两个数据库的每一行,然后进行了比较,并将结果行附加到结果数据库中。让我知道这个是否奏效。
import pandas as pd
data = [[1,'Alex','1234123'],[2,'Danny','4234123'],[3,'Bryan','5234123']]
df = pd.DataFrame(data,columns=['User_id','User_name','User_phn'])
print (df)
data = [[1,'Alex','3234123'],[2,'Chris','4234123'],[3,'Bryan','5234123'],[4,'Bexy','6234123']]
df_2 = pd.DataFrame(data,columns=['User_id','User_name','User_phn'])
print (df_2)
l=max(len(df.index),len(df_2.index))
df_res = pd.DataFrame(columns=['User_id','User_name','User_phn'])
df_mat = df.as_matrix()
df_2_mat = df_2.as_matrix()
for i in range(0,l):
try:
arr=[]
arr.append(df_mat[i][0])
for k in range(1,3):
if df_mat[i][k] == df_2_mat[i][k]:
arr.append(df_mat[i][k])
else:
arr.append("Mismatch")
df_res.loc[i] = arr
except:
df_res.loc[i] = [i+1,"Mismatch","Mismatch"]
print(df_res)
答案 3 :(得分:0)
嗨Narayana Kandukuri,
我想我的代码可能很简单,看看吧。
import pandas as pd
df1 = pd.DataFrame([[1,'Alex',1234123],[2,'Danny',4234123],[3,'Bryan',5234123]],columns=['User_id','User_name','User_phn'])
df2 = pd.DataFrame([[1,'Alex',3234123],[2,'Chris',4234123],[3,'Bryan',5234123],[4,'Bexy',6234123]],columns=['User_id','User_name','User_phn'])
temp = df2[['User_id']] #Saving this for later use.
Bool_Data = (df1==df2[:df1.shape[0]]) #This will give you a boolean frame
df2 = df2[Bool_Data].fillna('mismatch') #Keep this boolean frame to df2
df2['User_id'] = temp['User_id'] #Assign the before temp.
df2 =
User_id User_name User_phn
0 1 Alex mismatch
1 2 mismatch 423412
2 3 Bryan 523412
3 4 mismatch mismatch