比较2个不均匀的Pandas DataFrames以匹配值,然后在匹配

时间:2017-01-12 18:47:44

标签: python pandas dataframe

我有两个不同大小的pandas数据帧,第一个有大约500k行,这里是一个示例 -

df=
   Name1   Name2         Date   place  pet   Value1   Value2  Value3
0    Jim      Al   2015-09-28    work  cat        3        9       4
1   Rick   Sarah   2015-09-28    home  cat       12       11       2
2   Gary   Sasha   2015-09-28    home  cat        8       11       7
3    Tom    Ryan   2015-09-27    bank  dog        8        1       3
4   Jane     Bob   2015-09-27     gym  cat        6        5       9
5  Chris   Steve   2015-09-26     car  cat        4        4       2
6   Jack  Ashley   2015-09-26     home  cat       2        6       7

下一个有大约40k行 -

df_2=
         Date  place  pet   Value1  Value2  Value3
0  2015-09-28   home  cat        2       1       2
1  2015-09-28   work  cat        1       1       3
2  2015-09-27    gym  cat        4       4       1
3  2015-09-27   bank  dog        2       3       3
4  2015-09-26    car  cat        3       2       1
5  2015-09-26   home  cat        4       1       1

我想要完成的是比较两个数据框,看看Dateplacepet在每个数据帧上哪些行相同,如果它们是相同的那么我我希望将df.Value1乘以df2.Value1,将df.Value2乘以df.Value2,依此类推,从而返回一个不仅包含这些值而且保留{中的一些信息 - 的数据帧{1}},例如dfName1Name2DatePlace

围绕我正在寻找的结果 -

pet

谢谢!

2 个答案:

答案 0 :(得分:0)

这个脚本应该注意:

import pandas as pd


df1 = pd.read_csv('df1.csv', sep="\s*")
df2 = pd.read_csv('df2.csv', sep="\s*")

df_combined = df1.merge(df2, on=['Date', 'place', 'pet'])

for elem in ['Value1', 'Value2', 'Value3']:
    df_combined[elem] = df_combined[elem + '_x'].mul(
        df_combined[elem + '_y'])
    # drop the old columns
    df_combined = df_combined.drop(elem + '_x', 1)
    df_combined = df_combined.drop(elem + '_y', 1)


print(df_combined)

输出结果为:

   Name1   Name2        Date place  pet  Value1  Value2  Value3
0    Jim      Al  2015-09-28  work  cat       3       9      12
1   Rick   Sarah  2015-09-28  home  cat      24      11       4
2   Gary   Sasha  2015-09-28  home  cat      16      11      14
3    Tom    Ryan  2015-09-27  bank  dog      16       3       9
4   Jane     Bob  2015-09-27   gym  cat      24      20       9
5  Chris   Steve  2015-09-26   car  cat      12       8       2
6   Jack  Ashley  2015-09-26  home  cat       8       6       7

答案 1 :(得分:0)

这是一个矢量化解决方案,它使用merge()方法:

In [64]: m = pd.merge(d1, d2, on=['Date','place','pet'], suffixes=['', '_z'])

In [65]: m
Out[65]:
   Name1   Name2        Date place  pet  Value1  Value2  Value3  Value1_z  Value2_z  Value3_z
0    Jim      Al  2015-09-28  work  cat       3       9       4         1         1         3
1   Rick   Sarah  2015-09-28  home  cat      12      11       2         2         1         2
2   Gary   Sasha  2015-09-28  home  cat       8      11       7         2         1         2
3    Tom    Ryan  2015-09-27  bank  dog       8       1       3         2         3         3
4   Jane     Bob  2015-09-27   gym  cat       6       5       9         4         4         1
5  Chris   Steve  2015-09-26   car  cat       4       4       2         3         2         1
6   Jack  Ashley  2015-09-26  home  cat       2       6       7         4         1         1

现在我们可以将相应的Value*列相乘:

In [66]: m.loc[:, m.columns.str.contains(r'^Value\d+$')] *= m.filter(regex='Value\d+_z').values

In [69]: m = m.drop(m.columns[m.columns.str.contains(r'^Value\d+_z')], 1)

In [70]: m
Out[70]:
   Name1   Name2        Date place  pet  Value1  Value2  Value3
0    Jim      Al  2015-09-28  work  cat       3       9      12
1   Rick   Sarah  2015-09-28  home  cat      24      11       4
2   Gary   Sasha  2015-09-28  home  cat      16      11      14
3    Tom    Ryan  2015-09-27  bank  dog      16       3       9
4   Jane     Bob  2015-09-27   gym  cat      24      20       9
5  Chris   Steve  2015-09-26   car  cat      12       8       2
6   Jack  Ashley  2015-09-26  home  cat       8       6       7

一些解释:

In [74]: m.columns.str.contains(r'^Value\d+$')
Out[74]: array([False, False, False, False, False,  True,  True,  True, False, False, False], dtype=bool)

In [75]: m.loc[:, m.columns.str.contains(r'^Value\d+$')]
Out[75]:
   Value1  Value2  Value3
0       3       9       4
1      12      11       2
2       8      11       7
3       8       1       3
4       6       5       9
5       4       4       2
6       2       6       7

In [76]: m.filter(regex='Value\d+_z')
Out[76]:
   Value1_z  Value2_z  Value3_z
0         1         1         3
1         2         1         2
2         2         1         2
3         2         3         3
4         4         4         1
5         3         2         1
6         4         1         1

In [77]: m.filter(regex='Value\d+_z').values
Out[77]:
array([[1, 1, 3],
       [2, 1, 2],
       [2, 1, 2],
       [2, 3, 3],
       [4, 4, 1],
       [3, 2, 1],
       [4, 1, 1]], dtype=int64)