我有两个不同大小的pandas数据帧,第一个有大约500k行,这里是一个示例 -
df=
Name1 Name2 Date place pet Value1 Value2 Value3
0 Jim Al 2015-09-28 work cat 3 9 4
1 Rick Sarah 2015-09-28 home cat 12 11 2
2 Gary Sasha 2015-09-28 home cat 8 11 7
3 Tom Ryan 2015-09-27 bank dog 8 1 3
4 Jane Bob 2015-09-27 gym cat 6 5 9
5 Chris Steve 2015-09-26 car cat 4 4 2
6 Jack Ashley 2015-09-26 home cat 2 6 7
下一个有大约40k行 -
df_2=
Date place pet Value1 Value2 Value3
0 2015-09-28 home cat 2 1 2
1 2015-09-28 work cat 1 1 3
2 2015-09-27 gym cat 4 4 1
3 2015-09-27 bank dog 2 3 3
4 2015-09-26 car cat 3 2 1
5 2015-09-26 home cat 4 1 1
我想要完成的是比较两个数据框,看看Date
,place
和pet
在每个数据帧上哪些行相同,如果它们是相同的那么我我希望将df.Value1
乘以df2.Value1
,将df.Value2
乘以df.Value2
,依此类推,从而返回一个不仅包含这些值而且保留{中的一些信息 - 的数据帧{1}},例如df
,Name1
,Name2
,Date
和Place
围绕我正在寻找的结果 -
pet
谢谢!
答案 0 :(得分:0)
这个脚本应该注意:
import pandas as pd
df1 = pd.read_csv('df1.csv', sep="\s*")
df2 = pd.read_csv('df2.csv', sep="\s*")
df_combined = df1.merge(df2, on=['Date', 'place', 'pet'])
for elem in ['Value1', 'Value2', 'Value3']:
df_combined[elem] = df_combined[elem + '_x'].mul(
df_combined[elem + '_y'])
# drop the old columns
df_combined = df_combined.drop(elem + '_x', 1)
df_combined = df_combined.drop(elem + '_y', 1)
print(df_combined)
输出结果为:
Name1 Name2 Date place pet Value1 Value2 Value3
0 Jim Al 2015-09-28 work cat 3 9 12
1 Rick Sarah 2015-09-28 home cat 24 11 4
2 Gary Sasha 2015-09-28 home cat 16 11 14
3 Tom Ryan 2015-09-27 bank dog 16 3 9
4 Jane Bob 2015-09-27 gym cat 24 20 9
5 Chris Steve 2015-09-26 car cat 12 8 2
6 Jack Ashley 2015-09-26 home cat 8 6 7
答案 1 :(得分:0)
这是一个矢量化解决方案,它使用merge()
方法:
In [64]: m = pd.merge(d1, d2, on=['Date','place','pet'], suffixes=['', '_z'])
In [65]: m
Out[65]:
Name1 Name2 Date place pet Value1 Value2 Value3 Value1_z Value2_z Value3_z
0 Jim Al 2015-09-28 work cat 3 9 4 1 1 3
1 Rick Sarah 2015-09-28 home cat 12 11 2 2 1 2
2 Gary Sasha 2015-09-28 home cat 8 11 7 2 1 2
3 Tom Ryan 2015-09-27 bank dog 8 1 3 2 3 3
4 Jane Bob 2015-09-27 gym cat 6 5 9 4 4 1
5 Chris Steve 2015-09-26 car cat 4 4 2 3 2 1
6 Jack Ashley 2015-09-26 home cat 2 6 7 4 1 1
现在我们可以将相应的Value*
列相乘:
In [66]: m.loc[:, m.columns.str.contains(r'^Value\d+$')] *= m.filter(regex='Value\d+_z').values
In [69]: m = m.drop(m.columns[m.columns.str.contains(r'^Value\d+_z')], 1)
In [70]: m
Out[70]:
Name1 Name2 Date place pet Value1 Value2 Value3
0 Jim Al 2015-09-28 work cat 3 9 12
1 Rick Sarah 2015-09-28 home cat 24 11 4
2 Gary Sasha 2015-09-28 home cat 16 11 14
3 Tom Ryan 2015-09-27 bank dog 16 3 9
4 Jane Bob 2015-09-27 gym cat 24 20 9
5 Chris Steve 2015-09-26 car cat 12 8 2
6 Jack Ashley 2015-09-26 home cat 8 6 7
一些解释:
In [74]: m.columns.str.contains(r'^Value\d+$')
Out[74]: array([False, False, False, False, False, True, True, True, False, False, False], dtype=bool)
In [75]: m.loc[:, m.columns.str.contains(r'^Value\d+$')]
Out[75]:
Value1 Value2 Value3
0 3 9 4
1 12 11 2
2 8 11 7
3 8 1 3
4 6 5 9
5 4 4 2
6 2 6 7
In [76]: m.filter(regex='Value\d+_z')
Out[76]:
Value1_z Value2_z Value3_z
0 1 1 3
1 2 1 2
2 2 1 2
3 2 3 3
4 4 4 1
5 3 2 1
6 4 1 1
In [77]: m.filter(regex='Value\d+_z').values
Out[77]:
array([[1, 1, 3],
[2, 1, 2],
[2, 1, 2],
[2, 3, 3],
[4, 4, 1],
[3, 2, 1],
[4, 1, 1]], dtype=int64)