我通常使用Dataframe.merge来组合pandas中的数据帧。根据我的理解,这仅适用于等式连接。使用其他类型的检查(例如不等式)加入两个数据帧的惯用方法是什么?
答案 0 :(得分:1)
Pandas merge()允许两个数据框之间的outer
,left
,right
加入(不仅仅是inner
加入),因此您可以返回不匹配的记录。此外,merge()
甚至可以推广为返回交叉连接(两个数据帧之间的所有组合匹配),然后通过过滤可以返回不匹配的记录。还有,isin()大熊猫方法。
考虑以下演示。以下是我们喜欢的两个数据框架,计算机语言。如图所示,第一数据帧是第二数据帧的子集。外部联接返回带有NaN
的记录,用于不匹配的列,以后可以过滤掉。交叉联接返回完整的完整行,可以对其进行过滤,isin()
搜索列中的值:
import pandas as pd
df1 = pd.DataFrame({'Languages': ['C++', 'C', 'Java', 'C#', 'Python', 'PHP'],
'Uses': ['computing', 'computing', 'application', 'application', 'application', 'web'],
'Type': ['Proprietary', 'Proprietary', 'Proprietary', 'Proprietary', 'Open-Source', 'Open-Source']})
df2 = pd.DataFrame({'Languages': ['C++', 'C', 'Java', 'C#', 'Python', 'PHP',
'Perl', 'R', 'Ruby', 'VB.NET', 'Javascript', 'Matlab'],
'Uses': ['computing', 'computing', 'application', 'application', 'application', 'web',
'application', 'computing', 'web', 'application', 'web', 'computing'],
'Type': ['Proprietary', 'Proprietary', 'Proprietary', 'Proprietary', 'Open-Source',
'Open-Source', 'Open-Source', 'Open-Source', 'Open-Source', 'Proprietary',
'Open-Source', 'Proprietary']})
# OUTER JOIN
mergedf = pd.merge(df1, df2, on=['Languages'], how='outer')
# FILTER OUT LANGUAGES IN SMALLER THAT IS NULL
mergedf = mergedf[pd.isnull(mergedf['Type_x'])][['Languages', 'Uses_y', 'Type_y']]
# Languages Uses_y Type_y
#6 Perl application Open-Source
#7 R computing Open-Source
#8 Ruby web Open-Source
#9 VB.NET application Proprietary
#10 Javascript web Open-Source
#11 Matlab computing Proprietary
# ISIN COMPARISON, RETURNING RECORDS IN LARGER NOT IN SMALLER
unequaldf = df2[~df2.Languages.isin(df1['Languages'])]
# Languages Type Uses
#6 Perl Open-Source application
#7 R Open-Source computing
#8 Ruby Open-Source web
#9 VB.NET Proprietary application
#10 Javascript Open-Source web
#11 Matlab Proprietary computing
# CROSS JOIN
df1['key'] = 1 # (REQUIRES A JOIN KEY OF SAME VALUE)
df2['key'] = 1
crossjoindf = pd.merge(df1, df2, on=['key'])
# FILTER FOR LANGUAGES IN LARGER NOT IN SMALLER (ALSO USING ISIN)
crossjoindf = crossjoindf[~crossjoindf['Languages_y'].isin(crossjoindf['Languages_x'])]\
[['Languages_y', 'Uses_y', 'Type_y']].drop_duplicates()
# Languages_y Uses_y Type_y
#6 Perl application Open-Source
#7 R computing Open-Source
#8 Ruby web Open-Source
#9 VB.NET application Proprietary
#10 Javascript web Open-Source
#11 Matlab computing Proprietary
不可否认,交叉连接在这里可能是冗余和冗长的,但如果您的无与伦比的需求需要跨数据框排列,那么它可以很方便。
答案 1 :(得分:0)
merge()相当有限。您可以使用pandasql.sqldf完成更复杂的连接。您可以编写几乎任何sql查询,并将您现有的数据帧作为sql语句中的表名称引用 https://github.com/yhat/pandasql/ 一个已知的错误是无法在产品连接中选择多个表,例如
select d1.something, d2.something else from df1 as d1, df2 as d2 where d1.date=d2.date
但是,如果您可以毫无问题地进行连接,并且可以将上述语句转换为连接。