我要从两个具有共同特征但并不总是相同特征的独立数据库中查询两个数据帧,我需要找到一种将两者可靠地结合在一起的方法。
例如:
import pandas as pd
inp = [{'Name':'Jose', 'Age':12,'Location':'Frankfurt','Occupation':'Student','Mothers Name':'Rosy'}, {'Name':'Katherine','Age':23,'Location':'Maui','Occupation':'Lawyer','Mothers Name':'Amy'}, {'Name':'Larry','Age':22,'Location':'Dallas','Occupation':'Nurse','Mothers Name':'Monica'}]
df = pd.DataFrame(inp)
print (df)
Age Location Mothers Name Name Occupation
0 12 Frankfurt Rosy Jose Student
1 23 Maui Amy Katherine Lawyer
2 22 Dallas Monica Larry Nurse
inp2 = [{'Name': '','Occupation':'Nurse','Favorite Hobby':'Basketball','Mothers Name':'Monica'},{'Name':'Jose','Occupation':'','Favorite Hobby':'Sewing','Mothers Name':'Rosy'},{'Name':'Katherine','Occupation':'Lawyer','Favorite Hobby':'Reading','Mothers Name':''}]
df2 = pd.DataFrame(inp2)
print(df2)
Favorite Hobby Mothers Name Name Occupation
0 Basketball Monica Nurse
1 Sewing Rosy Jose
2 Reading Katherine Lawyer
我需要找出一种方法来可靠地连接这两个数据帧,而数据却始终不保持一致。为了使问题更加复杂,两个数据库的长度并不总是相同。有任何想法吗?
答案 0 :(得分:0)
您可以在可能的列组合上执行合并,并合并这些df,然后在第一个(完整)df上合并新的df:
# do your three possible merges on 'Mothers Name', 'Name', and 'Occupation'
# then concat your dataframes
new_df = pd.concat([df.merge(df2, on=['Mothers Name', 'Name']),
df.merge(df2, on=['Name', 'Occupation']),
df.merge(df2, on=['Mothers Name', 'Occupation'])], sort=False)
# take the first dataframe, which is complete, and merge with your new_df and drop dups
df.merge(new_df[['Age', 'Location', 'Favorite Hobby']], on=['Age', 'Location']).drop_duplicates()
Age Location Mothers Name Name Occupation Favorite Hobby
0 12 Frankfurt Rosy Jose Student Sewing
2 23 Maui Amy Katherine Lawyer Reading
4 22 Dallas Monica Larry Nurse Basketball
这假设每一行的年龄和位置都是唯一的