在熊猫数据帧上的矢量化查找

时间:2017-09-08 10:51:47

标签: python pandas dataframe vectorization lookup

我有两个DataFrame。 。 。

df1是一个表,我需要从df2中的多列中检索索引,列对来提取值。

我看到有一个函数get_value在给定索引和列值时可以正常工作,但在尝试向量化此函数以创建新列时,我失败了......

df1 = pd.DataFrame(np.arange(20).reshape((4, 5)))

df1.columns = list('abcde')

df1.index = ['cat', 'dog', 'fish', 'bird']

        a   b   c   d   e
cat     0   1   2   3   4
dog     5   6   7   8   9
fish    10  11  12  13  14
bird    15  16  17  18  19

df1.get_value('bird, 'c')

17

现在我需要做的是在df2上创建一个完整的新列 - 根据df1animal的索引,列对索引letter df2中指定的列有效地对上面的pd.get_value函数进行了矢量化。

df2 = pd.DataFrame(np.arange(20).reshape((4, 5)))

df2['animal'] = ['cat', 'dog', 'fish', 'bird']

df2['letter'] = list('abcd')

    0   1   2   3   4   animal  letter
0   0   1   2   3   4   cat     a
1   5   6   7   8   9   dog     b
2   10  11  12  13  14  fish    c
3   15  16  17  18  19  bird    d

导致。 。 。

    0   1   2   3   4   animal  letter   looked_up
0   0   1   2   3   4   cat     a        0
1   5   6   7   8   9   dog     b        6
2   10  11  12  13  14  fish    c        12
3   15  16  17  18  19  bird    d        18

3 个答案:

答案 0 :(得分:4)

如果寻找更快的方法,那么zip将有助于小数据帧,即

k = list(zip(df2['animal'].values,df2['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]

输出:

   0   1   2   3   4 animal letter  looked_up
0   0   1   2   3   4    cat      a          0
1   5   6   7   8   9    dog      b          6
2  10  11  12  13  14   fish      c         12
3  15  16  17  18  19   bird      d         18

正如约翰所建议的那样,你可以简化代码,这将会更快。

 df2['looked_up'] = [df1.get_value(r, c) for r, c in zip(df2.animal, df2.letter)]

如果缺少数据,否则就是

df2['looked_up'] = [df1.get_value(r, c) if not pd.isnull(c) | pd.isnull(r) else pd.np.nan for r, c in zip(df2.animal, df2.letter) ]

适用于小型数据集

%%timeit
df2['looked_up'] = df1.lookup(df2.animal, df2.letter)
1000 loops, best of 3: 801 µs per loop

k = list(zip(df2['animal'].values,df2['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
1000 loops, best of 3: 399 µs per loop

[df1.get_value(r, c) for r, c in zip(df2.animal, df2.letter)]
10000 loops, best of 3: 87.5 µs per loop

适用于大型数据框

df3 = pd.concat([df2]*10000)

%%timeit
k = list(zip(df3['animal'].values,df3['letter'].values))
df2['looked_up'] = [df1.get_value(*i) for i in k]
1 loop, best of 3: 185 ms per loop


df2['looked_up'] = [df1.get_value(r, c) for r, c in zip(df3.animal, df3.letter)]
1 loop, best of 3: 165 ms per loop

df2['looked_up'] = df1.lookup(df3.animal, df3.letter)
100 loops, best of 3: 8.82 ms per loop

答案 1 :(得分:3)

这是一个恰当地命名为lookup的函数。

df2['looked_up'] = df1.lookup(df2.animal, df2.letter)

df2

    0   1   2   3   4 animal letter  looked_up
0   0   1   2   3   4    cat      a          0
1   5   6   7   8   9    dog      b          6
2  10  11  12  13  14   fish      c         12
3  15  16  17  18  19   bird      d         18

答案 2 :(得分:2)

如果您的值存在于查找数据框中,则

lookupget_value是很好的答案。

但是,如果查询数据框中没有(行,列)对,并且希望查找值为NaN - merge,则stack为1这样做的方式

In [206]: df2.merge(df1.stack().reset_index().rename(columns={0: 'looked_up'}),
                    left_on=['animal', 'letter'], right_on=['level_0', 'level_1'],
                    how='left').drop(['level_0', 'level_1'], 1)
Out[206]:
    0   1   2   3   4 animal letter  looked_up
0   0   1   2   3   4    cat      a          0
1   5   6   7   8   9    dog      b          6
2  10  11  12  13  14   fish      c         12
3  15  16  17  18  19   bird      d         18

添加不存在的(动物,字母)对进行测试

In [207]: df22
Out[207]:
      0     1     2     3     4 animal letter
0   0.0   1.0   2.0   3.0   4.0    cat      a
1   5.0   6.0   7.0   8.0   9.0    dog      b
2  10.0  11.0  12.0  13.0  14.0   fish      c
3  15.0  16.0  17.0  18.0  19.0   bird      d
4   NaN   NaN   NaN   NaN   NaN  dummy    NaN

In [208]: df22.merge(df1.stack().reset_index().rename(columns={0: 'looked_up'}),
                    left_on=['animal', 'letter'], right_on=['level_0', 'level_1'],
                    how='left').drop(['level_0', 'level_1'], 1)
Out[208]:
      0     1     2     3     4 animal letter  looked_up
0   0.0   1.0   2.0   3.0   4.0    cat      a        0.0
1   5.0   6.0   7.0   8.0   9.0    dog      b        6.0
2  10.0  11.0  12.0  13.0  14.0   fish      c       12.0
3  15.0  16.0  17.0  18.0  19.0   bird      d       18.0
4   NaN   NaN   NaN   NaN   NaN  dummy    NaN        NaN
相关问题