我有多个熊猫数据帧,为简单起见,假设我有三个。
>> df1=
col1 col2
id1 A B
id2 C D
id3 B A
id4 E F
>> df2=
col1 col2
id1 B A
id2 D C
id3 M N
id4 F E
>> df3=
col1 col2
id1 A B
id2 D C
id3 N M
id4 E F
需要的结果是:
>> df=
col1 col2
id1 A B
id2 C D
id3 E F
因为(A,B),(C,D),(E,F)对出现在所有数据帧中,尽管它们可能颠倒了。
在使用pandas合并时,它仅考虑传递列的方式。为了检查我的观察,我尝试对两个数据帧使用以下代码:
df1['reverse_1'] = (df1.col1+df1.col2).isin(df2.col1 + df2.col2)
df1['reverse_2'] = (df1.col1+df1.col2).isin(df2.col2 + df2.col1)
我发现结果不同:
col1 col2 reverse_1 reverse_2
a b False True
c d False True
b a True False
e f False True
因此,如果我从reverse_1和reverse_2列中都收集了“ True”值,则可以得到两个数据帧的交集。即使我对两个数据帧执行此操作,我也不清楚如何继续处理更多数据帧(两个以上)。我对此不太困惑。有什么建议吗?
答案 0 :(得分:5)
您可以创建DataFrame
的列表,并在列表理解中按行排序,并删除重复项:
dfs = [df1,df2,df3]
L = [pd.DataFrame(np.sort(x.values, axis=1), columns=x.columns).drop_duplicates()
for x in dfs]
print (L)
[ col1 col2
0 A B
1 C D
3 E F, col1 col2
0 A B
1 C D
2 M N
3 E F, col1 col2
0 A B
1 C D
2 M N
3 E F]
然后按所有列依次merge list of DataFrames
(无参数on
):
from functools import reduce
df = reduce(lambda left,right: pd.merge(left,right), L)
print (df)
col1 col2
0 A B
1 C D
2 E F
@pygo的另一种解决方案:
通过index
个创建frozenset
,然后通过concat
通过inner
连接在一起,最后使用duplicated
通过boolean indexing
删除索引中的重复项和iloc
获取前两列:
df = pd.concat([x.set_index(x.apply(frozenset, axis=1)) for x in dfs], axis=1, join='inner')
df = df.iloc[~df.index.duplicated(), :2]
print (df)
col1 col2
(B, A) A B
(C, D) C D
(F, E) E F
答案 1 :(得分:0)
有点类似于先前的答案。
import pandas as pd
from io import StringIO
# Test data
df1 = pd.read_table(StringIO ("""
id col1 col2
id1 A B
id2 C D
id3 B A
id4 E F
"""), delim_whitespace = True)
df2 = pd.read_table(StringIO ("""
id col1 col2
id1 B A
id2 D C
id3 M N
id4 F E
"""), delim_whitespace = True)
df3 = pd.read_table(StringIO("""
id col1 col2
id1 A B
id2 D C
id3 N M
id4 E F
"""), delim_whitespace = True)
# List of n dataframes
dfs = [df1, df2, df3]
# Use frozenset to define the column values without regard for order
# pandas apply iterates over each row
# list expression iterates over each dataframe
combined_columns = [pd.Series(df.apply(lambda r: frozenset((r.col1, r.col2)), axis=1), name = 'combined') for df in dfs]
print(combined_columns)
# Results in alist of Series named 'combined'
#[0 (B, A)
# 1 (D, C)
# 2 (B, A)
# 3 (F, E)
# Name: combined, dtype: object,
# 0 (B, A)
# 1 (D, C)
# 2 (N, M)
# 3 (E, F)
# Name: combined, dtype: object,
# 0 (B, A)
# 1 (D, C)
# 2 (M, N)
# 3 (F, E)
# Name: combined, dtype: object]
dfs_combined = [pd.concat([dfs[i], combined_columns[i]], axis = 1) for i in range(len(dfs))]
print(dfs_combined)
# Result in a list of dataframes with the extra columns
#[ id col1 col2 combined
# 0 id1 A B (B, A)
# 1 id2 C D (D, C)
# 2 id3 B A (B, A)
# 3 id4 E F (F, E),
# id col1 col2 combined
# 0 id1 B A (B, A)
# 1 id2 D C (D, C)
# 2 id3 M N (N, M)
# 3 id4 F E (E, F),
# id col1 col2 combined
# 0 id1 A B (B, A)
# 1 id2 D C (D, C)
# 2 id3 N M (M, N)
# 3 id4 E F (F, E)]
# The reduce function operates on pairs, with previous result as the first argument
from functools import reduce
result = reduce(lambda df1, df2: df1[df1['combined'].isin(df2['combined'])], dfs_combined).drop_duplicates(subset='combined')
print(result)
# id col1 col2 combined
#0 id1 A B (B, A)
#1 id2 C D (D, C)
#3 id4 E F (F, E)